Friday 20 February 2009

Pushing the BagIt manifest concept a little further

I really like the idea of BagIt - just enough framework to transfer files in a way that errors can be detected.

I really like the idea of RDF - just enough framework to detail, characterise and interlink resources in an extremely flexible and extendable fashion.

I really like the 4 rules of Linked Data - just enough rules to act as guides; follow the rules and your information will be much more useful to you and the wider world.

What I do not wish to go near is any format that requires a non-machine-readable profile to understand or a human to reverse-engineer - METS, being a good example of a framework giving you enough rope to hang yourself on.

So, what's my use-case? First, I'll outline what I digital objects I have, and why I handle and express them in the way I have.

I deal with lists of resources on a day-to-day basis, and what these resources are and the way these resources link together is very important. The metadata associated with the list is also important, as this conveys the perspective of the agent that constructed this list; the "who,what,where,when and why" of the list.

OAI-ORE is - at a basic level - a specification and a vocabulary, which can be used to depict a list of resources. This is a good thing. But here's the rub for me - I don't agree with how ORE semantically places this list. For me, the list is a subjective thing, a facet or perception of linkage between the resources listed. The list *always* implies a context through which the resources are to be viewed. This view leads me to the conclusion that any triples that are *asserted* by the list, such as triples containing an ordering predicate, such as 'hasNext' or 'hasLast', these triples must not be in the same graph as the factual triples which would enter the 'global' graph, such as list A is called (dc:title) "My photos" and contains resources a,b,c,d and e and was authored by Agent X.

This is easier to illustrate with an example with everyone's friends, Alice and Bob:


Now, while Alice and Bob may be 'aggregating' some of the same images, this doesn't mean we can infer much at all. Alice might be researching the development of a fruit fly's wings based on genetic degredation, and Bob might be researching the fruit fly's eye structure, looking for clear photos of the front of the fly. It could be even more unrelated in that Bob is actually looking for features on the electron microscope photos caused by dust or pollen.

So, to cope with contextual assertions (A asserts that <B> <verb C> <D>) there are a couple of well-discussed tactics: Reification, 'Promotion' (not sure of the correct term here) and Named Graphs.

Reification is a no-no. Very bad. Google will tell you all the reasons why this is bad.

'Promotion' (what the real term for this is, I hope someone will post in the comments.) 'Promotion' is just where a Classed node is introduced to allow contextual information to be added, very useful for adding information about a predicate. For example, consider <Person A> <researches> <ProjectX>. This, I'd argue is a bad triple for any system that will last more than <ProjectX>'s lifespan. We need to classify this triple with temporal information, and perhaps even location information too. So, one solution is to 'promote' the <researches> predicate to be of the following form: <Person A> <has_role> <_: type=Researcher>; <_:> <dtstart> <etc>, <_:> <researches> <ProjectX> ...

From the ORE camp, this promotion comes in the form of a Proxy for each aggregated resource that needs context. So in this way, they have 'promoted' the resource, as a kind of promotion to the act of aggregation. Tomayto, Tomarto. The way this works for ORE doesn't sit well for me though, and the convention for the URI schema here feels very awkward and heavy.

The third way (and my strong preference) is the Named Graph approach. Put all the triples that are asserted by, say Alice, into a resource at URI <Alices-NG> and say something like <Aggregation> <isProvidedContextBy> <Alices-NG>

For ease of reuse though, I'd suggest that the facts are left in the global graph, in the aggregation serialisation itself. I am sure that the semantic arguments over what should go where could rage on for eons, my take is that information that is factual or generally useful should be left in the global graph. Like resource mime-type, standards compliance ('conformsTo', etc), mirroring/alternate format information ('sha1_checksum', 'hasFormat' between a PDF, txt and Word doc versions, etc)

(There is the murky middle ground of course, like for licencing. But I'd suggest leave to the 'owning' aggregation to put it in the global graph.)

Enough of the digression on RDF!

So, how to extend BagIt, taking on board the things I have said above:

Add alongside the MANIFEST of BagIt (a simple list of files and checksums) an RDF serialisation - RDFMANIFEST.{format} (which in my preference is in N3 or turtle notation, .n3 or .turtle accordingly)

Copying the modelling of Aggregations from OAI-ORE, and we will say that one BagIt archive is equivalent to one Aggregation. (NB nothing wrong with a BagIt archive of BagIt archives!)

Re-use the Agent and ore:aggregates concepts and conventions from the OAI-ORE spec to 'list' the archive, and give some form of provenance. Add in a simple record for what this archive is meant to be as a whole (attached to the Aggregation class).

Give each BagIt a URI - in this case, preferably a resolvable URI from which you can download it, but for bulk transfers using SneakerNet or CarFullOfDrivesNet, use a info:BagIt/{id} scheme of your choice.

URIs for resources in transit are hierarchical, based on location in the archive: <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

Checksums, mimetypes and alternates should be added to the RDF Manifest:

NB <page1> == <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

<page1> <sha1> "9cbd4aeff71f5d7929b2451c28356c8823d09ab4";
<mimetype> "image/tiff";
<hasFormat> <info:BagIt/{book-id}/thumbnail_pages/page1/bookid_page1_0.jpg>;


Any assertions, such as page ordering in this case, should be handled as necessary. Just *please* do not use 'hasNext'! Use a named graph, use the built in rdf list mechanism, add an RSS 1.0 feed as a resource, anything but hasNext!

And that's about it for the format. One last thing to say about using info URIs though - I'd strongly suggest that they are used when the items do not have resolvable (http) URIs, and once transfered, I'd suggest that the info URIs are replaced with the http ones, and the info varients can be kept in a graph for provenance.

(Please note that I am biased in that this mirrors quite closely the way that the archives here and the way that digital items are held, but I think this works!)

No comments: