Thursday, 29 May 2008

A method for flexibly using external services

aka "How I picture a simple REST mechanism for queuing tasks with external services, such as transformation of PDFs to images, Documents to text, or scanning, file format identification."
Example: Document to .txt service utilisation


Step 1: send the URL for the document to the service (in this example, the request is automatically accepted - code 201 indicates that a new resource has been created)

{(( server h - 'in the cloud/resolvable' - /x.doc exists ))}

u | ----------------- POST /jobs (msg 'http://h/x.doc') ----------------> | Service (s)
| <---------------- HTTP resp code 201 (msg 'http://s/jobs/1') ---------- |

Step 2: Check the returned resource to find out if the job has completed (an error code 4XX would be suitable if there has been a problem such as unavailability of the doc resource)


u | ----------------- GET /jobs/1 (header: "Accept: text/rdf+n3") --------> | s


If job is in progress:

u | <---------------- HTTP resp code 202 ---------------------------------- | s

If job is complete (and accept format is supported):

u | <---------------- HTTP resp code 303 (location: /jobs/1.rdf ----------- | s

u | ----------------- GET /jobs/1.rdf --------------> | s
| <---------------- HTTP 200 msg below ------------ |


@PREFIX s: <http://s/jobs/>.
@PREFIX store: <http://s/store/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix ore: <http://www.openarchives.org/ore/terms/>.
@prefix dcterms: <http://purl.org/dc/terms/>.

s:1
ore:isDescribedBy
s:1.rdf;
dc:creator
"Antiword service - http://s";
dcterms:created
"2008-05-29T20:33:33.009152";
ore:aggregates
store:1.txt.

store:1.txt
dc:format
"text/plain";
dcterms:created
"2008-05-29T12:20:33.009152";
XXXXXXX:deleted
"2009-05-29T00:00:00.000";
dc:title
"My Document"

<http://h/x.doc>
dcterms:hasFormat
store:1.txt

---------------------------------------------------------


Then, the user can get the aggregate parts as required, noting the TTL (the deleted date predicate, for which I need to find a good real choice for)

Also, as this is a transformation, the service has indicated this with the final triple, asserting that the created resource is a rendition of the original resource, but in a different format.

A report based on the item, such as something that would be output from JHOVE, Droid or a virus-scanning service, can be shown as an aggregate resource in the same way, or if the report can be rendered using RDF, can be included in the aggregation itself.

It should be straightforward to see that this response gives the opportunity for services to return zero or more files and for that reply to be self-describing. The re-use of the basic structure of the OAI-ORE profile, means that the work going into the Atom format rendition can be repicated here, so an Atom report format could also work.

General service description:

All requests have {[?pass=XXXXXXXXXXXXXXXXX]} as an optional. Service has the choice whether to support it or not.

Request:
GET /jobs
Response
Content-negotiation applies, but default response is Atom format
List of job URIs that the user can see (without a pass, the list is just the anonymous ones if the service allows it)

Request:
POST /jobs
Body: "Resource URL"

Response:
HTTP Code 201 - Job accepted - Resp body == URI of job resource
HTTP Code 403 - Fobidden, due to bad credentials
HTTP Code 402 - Request is possible, but requires payment
- resp body => $ details and how to credit the account

Request:
DELETE /jobs/job-id
Response:
HTTP Code 200 - Job is removed from the queue as will any created resources
HTTP Code 403 - Bad credentials/Not allowed

Request:
GET /jobs/job-id
Header (optional): "Accept: application/rdf+xml" to get rdf/xml rather than the default atom, should the service support it
Response:
HTTP Code 406 - Service cannot make a response to comply with the Accept header formats
HTTP Code 202 - Job is in process - msg MAY include an expected time for completion
HTTP Code 303 - Job complete, redirect to formatted version of the report (typically /jobs/job-id.atom/rdf/etc)

Request:
GET /jobs/job-id.atom
Response:
HTTP Code 200 - Job is completed, and the msg header is the ORE map in Atom format
HTTP Code 404 - Job is not complete

Authorisation and economics


The authorisation for use of the service is a separate consideration, but ultimately it is dependent on the service implementation - if anonymous access is allowed, rate-limits, authorisation through invitation only, etc.

I would suggest the use of SSL for those service that do use it, but not HTTP Digest per se. HTTP Basic through an SSL connection should be good enough; the Digest standard is not pleasant to try to implement and get working (standard is a little ropey).

Due to the possibility of a code 402 (payment required) on a job request, it is possible to start to add in some economic valuations. It is required that the server holding the resource can respond to a HEAD request sensibly and report information such as file-size and format.

A particular passcode can be credited to allow it to make use of a service, the use of which debits the account as required. When an automated system hits upon a 402 (payment required) rather than a plain 403 (Forbidden), this could trigger mechanisms to get more credit, rather than a simple fail.

Links:
OAI-ORE spec - http://www.openarchives.org/ore/0.3/toc

HTTP status codes - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

3 comments:

Anonymous said...

Really interesting post. We've been looking at the same problem in molecule repositories (although in our case the services are things like format wrangling, RDF extraction, 2D depiction etc etc). We're thinking about taking a slightly different approach - I wonder if you've considered some alternatives and if so, why you didn't go for them...

We're trying out an alternative in which h publishes an atom feed of jobs to be done along with an APP callback address for the results to be POSTed to. It means our external processors can be deployed more flexibly (e.g. they can be cron'd client processes if that's convenient), and they don't have to store the results long term.

In your solution there's a temporal coupling when the job is invoked; if the service is down the repo (h) ends up having to manage a queue of jobs. In our alternative, there's a temporal coupling when trying to return the results. Tomato, tomato. Another alternative would be for the service to publish results through a feed, which requires the service to hold all the results (which might not be desirable), but is robust to either part being unavailable, without needing managed queues.

I've got a nagging feeling that this is all skirting around creating HTTP interfaces to a messaging architecture. (Not that that's entirely straightforward either)

Ben O'Steen said...

This post is essentially a brain-dump of something that was at the back of my mind during the SUN PA-SIG conference. Whether this is a good way or not I don't know.

Matt Zumwalt tipped me off that Amazon have a message-based mechanism for a similar purpose, so I'll give that a look.

How does the APP callback structure the incoming resources in your model? Is it a OAIS-type bundle, a BagIT, or something more resource-orientated?

I was trying to keep the service and the repo as far apart as possible, and also that there may be many different service providers for the same service, whether for load balancing or for multi-vendor purposes (e.g. other people doing OpenCalais type services or multiple chemical registries.)

Anonymous said...

"Matt Zumwalt tipped me off that Amazon have a message-based mechanism for a similar purpose, so I'll give that a look."

It's called SQS. Let me know what it's like if you get time to look at it before me?

"How does the APP callback structure the incoming resources in your model? Is it a OAIS-type bundle, a BagIT, or something more resource-orientated?"

More resource-y; in our set up anything that's recognisable as RDF gets indexed, so you can add descriptive and structural metadata by POSTing RDF describing the resources after you've found their eventual URIs.

"I was trying to keep the service and the repo as far apart as possible, and also that there may be many different service providers for the same service, whether for load balancing or for multi-vendor purposes (e.g. other people doing OpenCalais type services or multiple chemical registries.)"

Since there isn't a standard to represent these kind of services yet, we're really talking about a layer of proxies to give functionally equivalent services the same API.