Thursday 29 May 2008

A method for flexibly using external services

aka "How I picture a simple REST mechanism for queuing tasks with external services, such as transformation of PDFs to images, Documents to text, or scanning, file format identification."
Example: Document to .txt service utilisation


Step 1: send the URL for the document to the service (in this example, the request is automatically accepted - code 201 indicates that a new resource has been created)

{(( server h - 'in the cloud/resolvable' - /x.doc exists ))}

u | ----------------- POST /jobs (msg 'http://h/x.doc') ----------------> | Service (s)
| <---------------- HTTP resp code 201 (msg 'http://s/jobs/1') ---------- |

Step 2: Check the returned resource to find out if the job has completed (an error code 4XX would be suitable if there has been a problem such as unavailability of the doc resource)


u | ----------------- GET /jobs/1 (header: "Accept: text/rdf+n3") --------> | s


If job is in progress:

u | <---------------- HTTP resp code 202 ---------------------------------- | s

If job is complete (and accept format is supported):

u | <---------------- HTTP resp code 303 (location: /jobs/1.rdf ----------- | s

u | ----------------- GET /jobs/1.rdf --------------> | s
| <---------------- HTTP 200 msg below ------------ |


@PREFIX s: <http://s/jobs/>.
@PREFIX store: <http://s/store/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix ore: <http://www.openarchives.org/ore/terms/>.
@prefix dcterms: <http://purl.org/dc/terms/>.

s:1
ore:isDescribedBy
s:1.rdf;
dc:creator
"Antiword service - http://s";
dcterms:created
"2008-05-29T20:33:33.009152";
ore:aggregates
store:1.txt.

store:1.txt
dc:format
"text/plain";
dcterms:created
"2008-05-29T12:20:33.009152";
XXXXXXX:deleted
"2009-05-29T00:00:00.000";
dc:title
"My Document"

<http://h/x.doc>
dcterms:hasFormat
store:1.txt

---------------------------------------------------------


Then, the user can get the aggregate parts as required, noting the TTL (the deleted date predicate, for which I need to find a good real choice for)

Also, as this is a transformation, the service has indicated this with the final triple, asserting that the created resource is a rendition of the original resource, but in a different format.

A report based on the item, such as something that would be output from JHOVE, Droid or a virus-scanning service, can be shown as an aggregate resource in the same way, or if the report can be rendered using RDF, can be included in the aggregation itself.

It should be straightforward to see that this response gives the opportunity for services to return zero or more files and for that reply to be self-describing. The re-use of the basic structure of the OAI-ORE profile, means that the work going into the Atom format rendition can be repicated here, so an Atom report format could also work.

General service description:

All requests have {[?pass=XXXXXXXXXXXXXXXXX]} as an optional. Service has the choice whether to support it or not.

Request:
GET /jobs
Response
Content-negotiation applies, but default response is Atom format
List of job URIs that the user can see (without a pass, the list is just the anonymous ones if the service allows it)

Request:
POST /jobs
Body: "Resource URL"

Response:
HTTP Code 201 - Job accepted - Resp body == URI of job resource
HTTP Code 403 - Fobidden, due to bad credentials
HTTP Code 402 - Request is possible, but requires payment
- resp body => $ details and how to credit the account

Request:
DELETE /jobs/job-id
Response:
HTTP Code 200 - Job is removed from the queue as will any created resources
HTTP Code 403 - Bad credentials/Not allowed

Request:
GET /jobs/job-id
Header (optional): "Accept: application/rdf+xml" to get rdf/xml rather than the default atom, should the service support it
Response:
HTTP Code 406 - Service cannot make a response to comply with the Accept header formats
HTTP Code 202 - Job is in process - msg MAY include an expected time for completion
HTTP Code 303 - Job complete, redirect to formatted version of the report (typically /jobs/job-id.atom/rdf/etc)

Request:
GET /jobs/job-id.atom
Response:
HTTP Code 200 - Job is completed, and the msg header is the ORE map in Atom format
HTTP Code 404 - Job is not complete

Authorisation and economics


The authorisation for use of the service is a separate consideration, but ultimately it is dependent on the service implementation - if anonymous access is allowed, rate-limits, authorisation through invitation only, etc.

I would suggest the use of SSL for those service that do use it, but not HTTP Digest per se. HTTP Basic through an SSL connection should be good enough; the Digest standard is not pleasant to try to implement and get working (standard is a little ropey).

Due to the possibility of a code 402 (payment required) on a job request, it is possible to start to add in some economic valuations. It is required that the server holding the resource can respond to a HEAD request sensibly and report information such as file-size and format.

A particular passcode can be credited to allow it to make use of a service, the use of which debits the account as required. When an automated system hits upon a 402 (payment required) rather than a plain 403 (Forbidden), this could trigger mechanisms to get more credit, rather than a simple fail.

Links:
OAI-ORE spec - http://www.openarchives.org/ore/0.3/toc

HTTP status codes - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

Thursday 8 May 2008

Internal object relationships - in the context of Fedora and Solr indexing.

Peter Sefton wrote to me recently, and noted that in the basic solr indexer I've written, it still uses the rather poor convention that the datastream with a DSID of FULLTEXT contains all the extracted text from the other binary datastreams. He wrote on to say that perhaps the connection might be able to be expressed in inter-datastream relationship expressed through the OAI-ORE resource map.

This is exactly my intention and I'll just write up the type of relationships that I am using for these purposes and also how I am serialising these with the Fedora software.

Peter was right when he said that the relationship between PDF/DOC/binary and its text version can be expressed in ORE - I am planning that very thing. While Fedora 3 doesn't seem to have plans for the RELS-INT datastream, this is the datastream ID in which I intend to store the necessary relationships as RDF/XML, as the name helps convey some of the intent of the datastream - the INTernal RELationshipS.

The predicate I'm using to bind a binary datastream, such as a pdf/doc/etc to the text that is extracted and stored alongside it is the dcterms:hasFormat property - the simple description of what this property asserts is:
dcterms:hasFormat - A related resource that is substantially the same as the pre-existing described resource, but in another format.

It also has the bonus that if you have a pdf of a .doc, you can bind them together with the same property too, e.g. (in n-triples for clarity)

<info:fedora/thing:1234/PDF2> <dcterms:hasFormat> <info:fedora/thing:1234/PDF2TEXT>
<info:fedora/thing:1234/PDF2> <dcterms:hasFormat> <info:fedora/thing:1234/DOC2>

and just to illustrate an interesting use case:

<info:fedora/thing:1234/PDF> <dcterms:hasFormat> <info:fedora/someotherthing:123456/DOC>

Another predicate I am using is the dcterms:conformsTo - again, the simple description is:
dcterms:conformsTo - An established standard to which the described resource conforms.

This I am using to indicate whether a datastream conforms to a metadata standard, such as MODS, oai_dc, etc. It could equally be used to express the file format for a binary ds.

E.g.

<info:fedora/thing:1234/DC> <dcterms:conformsTo> <http://namespace.for.whatever.schema/or/standard/used>

I think it would be useful to discuss what we use to uniquely identify a given standard, just so that it is clear. Should we use the URL that resolves to the .xsd for the standard, or do we use the namespace URI?

(The end idea being that anything in the RELS-INT is expressed in the OAI-ORE, and consequently, the easiest way to include extra information for the ORE is to put it into the RELS-INT)

So, to illustrate how this can be used to index an object held in Fedora:

1) get the list of dsids and mimetypes via the API call of ListDatastreams
2) get the RELS-INT from the object itself and create a graph of the triples.
2b) [Get the RELS-INT from the content model for the object and munge the two graphs together. The content model RELS-INT could hold all the common relationships for a given type, e.g. Image to Thumbnail, etc. This will help performance by enabling caching, etc.]
3) Resolve and GET the metadata ds, as indicated by the RELS-INT and put it into a suitable configured Solr (1 to 1 relationship - index doc to object). The decision of which metadata to use should be left up to the indexer. The Fedora object should only indicate (via hasFormat) which are the derived metadata and which is the canonical.
4) Grab the text/plain alternative formats and put them into a (second) solr (1 to 1 - index doc to parent datastream e.g. info:fedora/t:1234/PDF2, etc)
5) Use the distributed index query of solr to query both, or each individually, as the portal requires.

Wednesday 7 May 2008

python-xml module depreciated in Hardy/Debian

If you've recently installed the shiny new Hardy Heron release of Ubuntu, or updated to the latest Debian, you may be surprised that a few old xml techniques in python no longer work. For example, the following no longer exist:
from xml import xpath
from xml.dom.ext import Anything_Really

See for more details and a temporary workaround: http://www.aigarius.com/blog/2008/05/06/ubuntu-removing-xml-from-python/

Now, the reasons for the depreciation I can understand - the package has seen no real updates for some time, and a subset of its functionality is no part of the core python system. But no real warning? That's annoying.

It does mean that I'll have to update my (admittedly oldschool) xpath-dependent functions to use elementtree, which I had intended to do, but now my hand is forced :)