Less Talk, More Code: Content Types in the Oxford Research Archive

I started working on the Oxford research archive last year, and very early on, it seemed inevitable that a broad classification would be useful to distiguish between the types of things that were being submitted. I don't mean distiguish between authors or file types or even between items containing images and those that didn't.

What I mean is that a broad and somewhat rough characterisation of the item's nature is needed, something that can be found when the following type of questions are asked:

What is the item's canonical metadata? (MODS, DC, Qual. DC, MIX, FoaF etc?)
Is its metadata likely to comprise a certain set or profile of information, such as the sets of information you might get from a journal article, a book, an image, a video or a thesis? For example, while a book is likely to have 'author', 'publisher' and so on in its set of metadata, an image is likely to have 'photographer', 'camera' and 'exposure' as well.
Is the item a metadata only item, or does it have binary attachments? (Is it a "by reference" item or "by value"?)
Which attachments should be listed for download (PDFs, etc) and which should be shown inline (images, thumbnails, video) in the page?
Is the item part of the repository to give it structure, a item that corresponds to a collection, an author, or a department rather than to the data and attachments from a actual submission?

So, what broad types has Oxford Research Archive got in the production server? Copying and pasting from this search: (07 Dec results taken)

(From near the bottom of the response:)
<int name="eprint">398</int> Content type name: eprint
<int name="basic">345</int> Content type name: basic
<int name="ethesi">54</int> Content type name: thesis
<int name="thesi">54</int> Content type name: thesis
<int name="general">52</int> Content type name: general
<int name="collect">6</int> Content type name: collection
<int name="confer">5</int> Content type name: conference and conferenceitem

NB The results are 'stemmed' (runs and running have the stem run, for example) but I have written the real names alongside them. Also, due to Fedora 3.0 expressing content types (content models to them) using RDF, rather than a metadata field in the object's FOXML, the above results are a combination of the old style of content model, combined with the new. In the new, I have dropped the e from ethesis as it was purposeless.

I should mention that there are universal datastreams, datastreams that are common to all of these types.

DC is the simple Dublin Core for an item and is equivalent to the only mandated(?) format for OAI-PMH, 'oai_dc'. This is also required by Fedora, at least up to version 2.2.1. If it is removed, or not included in an ingest, Fedora will make one.

There is also the optional FULLTEXT datastream which holds whatever textual content that can be harvested from the binary files, using applications such as antiword, pdftotxt, and others.

There is the EVENTS datastream, holding an iCal formatted simple log of actions taken on the item and actions that should take place, in future, on that item as well. This is not intended to replace PREMIS or a similar format. It simply allows for a much more pragmatic approach to help deal with the technicalities of performing, scheduling and logging events, rather than being the canonical data for the provenance of an item.

eprint - the bread and butter content type. I'll go through this one in depth:

Datastreams:

This includes a variety of actual content, but mainly the type of thing that you would see in a typical IR, journal articles and short article type reports. A single canonical metadata (we've chosen MODS) with two derived metadata, simple DC and MARCXML, both provided using XSLT. The item also has zero or more attached files, with the prefix of ATTACHMENT or JOURNAL (legacy) to the datastream ids and an number after the prefix, with the listing order desired to be 01, 02 , etc. There are likely to also be no inline images necessary, as producing thumbnails of the frontpages would be an utter waste - white box with black smudge or smudges in the middle. Most of these have a cover page as well, so the thumbnail would be even more useless.

Metadata:

Essential the metadata will be that provided by the MODS record. Optionally, the better catalogued items will have a mods:relatedItem element, with a type="host" and this will define the host journal and where the article came from. This information is simply filled in on a web-form normally, and the XML nature of it is hidden.

Presentation:

Test to see if the item has MODS and get that. Transform it using the eprint specific xslt stylesheet.
If not, get the DC instead and use the basic dc2html.xsl
Get the list of acceptable datastreams and present links to these as downloads.
Look for interesting RDF connections and present those too.

basic - a simple DC based atomic item. This will often be the default type of any item harvested from a EPrints.org or similar repository.

Datastreams: Ronseal item, a single 'thing' is archived with metadata no more complicated than simple DC can handle. The 'thing' can be in multiple formats, especially if it is initially submitted or harvested in a proprietary format such as MS Word.

Metadata: Simple DC

Presentation:

Get DC and pass it through the dc2html.xsl
Get the list of acceptable datastreams and present links to these as downloads.
Look for interesting RDF connections and present those too.

thesis - rich metadata stored in MODS, with a good number of binary attachments (typically one PDF per chapter, but alongside whatever original files were uploaded.) Very similar to eprint type, but with some important differences: one author is expected, with zero or more supervisors. The role is indicated as per usual in MODS, /mods:mods/mods:name/mods:role/mods:roleTerm = 'author' for example. Also, etd metadata should be present in the mods:extension section.

Datastreams and Metadata: As for eprint, but with small differences as noted above.

Presentation: Currently, as for eprint, but with a slightly different template and vocab. This will be developed given time to make more of the fact that it will have a single author, zero or more supervisors, etc.

general - ephemera collection, and it's not exactly a content type as it holds everything else. Groups of items will get 'promoted' from this collection, once we have identified that there is enough of one type of item to warrant time and effort spent doing so. The one thing that unites these items is that MODS metadata format is capable of holding a good description of what the items are.

Datastreams, Metadata, and Presentation: As for eprint

collection - a metadata only, structural item. This provides an item which serves to enable collections and so forth. It provides a node in the triplestore, which other items can then relate to. Holds a very basic dublin core record, which

conference and conferenceitem - Conference item types. The conference type itself is intimately related to the collection type; it exists to provide a URI to hang some information upon. How it differs is in the metadata it has to describe itself. A conference item has MODS metadata indicating the location, the dates, editors and other associated information that a conference implies. The item can represent either a single instance of a conference (Open Repositories 08) or a series (Open Repositories).

A picture tells a thousand words; this is a link to an image that should help show how these conference item types are used: http://www.flickr.com/photos/oxfordrepo/2102829887/

(Errata: image should have been updated to show 'isMemberOfCollection' in preference to just 'isMemberOf')

Less Talk, More Code

Thursday, 13 December 2007

Content Types in the Oxford Research Archive

1 comment:

Dopplr

Subscribe Now

Mugshot

Additional links

Labels

Blog Archive

About Me