Thursday, 8 May 2008

Internal object relationships - in the context of Fedora and Solr indexing.

Peter Sefton wrote to me recently, and noted that in the basic solr indexer I've written, it still uses the rather poor convention that the datastream with a DSID of FULLTEXT contains all the extracted text from the other binary datastreams. He wrote on to say that perhaps the connection might be able to be expressed in inter-datastream relationship expressed through the OAI-ORE resource map.

This is exactly my intention and I'll just write up the type of relationships that I am using for these purposes and also how I am serialising these with the Fedora software.

Peter was right when he said that the relationship between PDF/DOC/binary and its text version can be expressed in ORE - I am planning that very thing. While Fedora 3 doesn't seem to have plans for the RELS-INT datastream, this is the datastream ID in which I intend to store the necessary relationships as RDF/XML, as the name helps convey some of the intent of the datastream - the INTernal RELationshipS.

The predicate I'm using to bind a binary datastream, such as a pdf/doc/etc to the text that is extracted and stored alongside it is the dcterms:hasFormat property - the simple description of what this property asserts is:
dcterms:hasFormat - A related resource that is substantially the same as the pre-existing described resource, but in another format.

It also has the bonus that if you have a pdf of a .doc, you can bind them together with the same property too, e.g. (in n-triples for clarity)

<info:fedora/thing:1234/PDF2> <dcterms:hasFormat> <info:fedora/thing:1234/PDF2TEXT>
<info:fedora/thing:1234/PDF2> <dcterms:hasFormat> <info:fedora/thing:1234/DOC2>

and just to illustrate an interesting use case:

<info:fedora/thing:1234/PDF> <dcterms:hasFormat> <info:fedora/someotherthing:123456/DOC>

Another predicate I am using is the dcterms:conformsTo - again, the simple description is:
dcterms:conformsTo - An established standard to which the described resource conforms.

This I am using to indicate whether a datastream conforms to a metadata standard, such as MODS, oai_dc, etc. It could equally be used to express the file format for a binary ds.

E.g.

<info:fedora/thing:1234/DC> <dcterms:conformsTo> <http://namespace.for.whatever.schema/or/standard/used>

I think it would be useful to discuss what we use to uniquely identify a given standard, just so that it is clear. Should we use the URL that resolves to the .xsd for the standard, or do we use the namespace URI?

(The end idea being that anything in the RELS-INT is expressed in the OAI-ORE, and consequently, the easiest way to include extra information for the ORE is to put it into the RELS-INT)

So, to illustrate how this can be used to index an object held in Fedora:

1) get the list of dsids and mimetypes via the API call of ListDatastreams
2) get the RELS-INT from the object itself and create a graph of the triples.
2b) [Get the RELS-INT from the content model for the object and munge the two graphs together. The content model RELS-INT could hold all the common relationships for a given type, e.g. Image to Thumbnail, etc. This will help performance by enabling caching, etc.]
3) Resolve and GET the metadata ds, as indicated by the RELS-INT and put it into a suitable configured Solr (1 to 1 relationship - index doc to object). The decision of which metadata to use should be left up to the indexer. The Fedora object should only indicate (via hasFormat) which are the derived metadata and which is the canonical.
4) Grab the text/plain alternative formats and put them into a (second) solr (1 to 1 - index doc to parent datastream e.g. info:fedora/t:1234/PDF2, etc)
5) Use the distributed index query of solr to query both, or each individually, as the portal requires.

No comments: