Thursday 13 December 2007

Metadata - Why I wrap most of it in MODS and not Dublin Core

sdI should say a few things before going on -

My Opinion about metadata formats:

It is good to have the metadata to be accessible in different formats, whether the other formats are created on the fly or are stored alongside the canonical metadata format(s).
  • So, if all your metadata is expressed in MarcXML, it is handy for it to be translated(-able) into other wrappers, such as Dublin Core or MODS - remembering (and explicitly noting) that the MODS and DC are derivative and may not be a 100% perfect translation.
For a given type of object, certain metadata wrappers are more capable of providing the granularity necessary to accurately reflect the data that it is encapsulating. Also, certain metadata formats are simply easier to handle than others. A good choice for the canonical metadata format(s) for a given type of object should be good at both.

And by granularity, I mean that the data needs no additional scheme to have contextual sense. For example, there should be obvious ways in a good format for text-based item metadata to distinguish between the author, the editor, the supervisor and the artist of a given work, the date of creation and the date of publishing should be similarly distinguishable and the format should have some mechanism for including rights and identifiers.
  • So, for metadata pertaining to printed or printable text-based items - articles, books, abstracts, theses, reports, presentations, booklets, etc - MODS, DC and MarcXML are clear possibilities, as is the NLM Journal DTD although it immediately limits itself by its own scope.
    • In terms of built-in data granularity, MODS wins, followed by qualified dublin core, then MarcXML and then at a distant position, simple dublin core.
    • In terms of simplicity in handling, Simple dublin core wins hands down, followed by both MODS and Qualified dublin core - as the time+effort in building tools to create/edit MODS is likely to be comparable to the time spent dealing with and creating profiles, even though the format is simpler - and then trailing at a significant distance is MarcXML, which is less of a packaging format and akin to a coffin for data - you know the data is in there, but you dread the thought of trying to get it out again.
So, from that kind of thought process (excellent granularity for text-based items + moderate difficulty in handling due to it actually using an XML hierarchy of elements for once) led us to consider MODS as the defacto standard, the canonical format, for wrapping up the metadata associated with that item or collection of items.

Luckily, so far anyway, text-based items are the only main grouping of object types that this type of decision has been made for. All the content types created so far are based on the idea that MODS can handle the vast majority of the information needed to define the item or items.

However, there are two things which are not totally orthodox - one change was made to increase granularity at the expense of introduction a folksonomy (only in the sense that it is opposite to a controlled vocabulary) and a second change in that we are making use of the mods:extension element to hold thesis specific metadata, using elements from the UK ETD schema, specifically the 'degree' block.

Hopefully, you can see a little of the reasoning behind why we started with MODS now.

So why are multiple formats for the same information good?

Short answer - Because one format won't please all the people all of the time.

Longer answer - Best shown by example:


The eThOS service would like their thesis metadata in the ETD schema as mentioned above. OAI-PMH services tend to only harvest simple dublin core. The NEEO project (Network of European Economists Online) is only considering items harvested from an OAI-PMH service, which has the metadata in MODS, and is supplied in MPEG-DIDL format. Certain members of the library community are interested in getting the metadata in MARC format.... and so on ad nauseum.

You are not going to change their minds. You simply have to try to support what is possible and pragmatic.

But having the capability of expressing the same data by wrapping it in a variety of formats and making it accessible in the same manner as the canonical version, rather than through some special 'export' function will go a long way in helping you support a variety of these 'individual' demands....

<rant>
I mean, NEEO mandating the future use of a tired protocol (OAI-PMH) to provide MPEG-DIDL (how many existing and stable repository softwares do this out of the box?) and the DIDL itself only contains a single MODS datastream and a load of links to the binary files? Bah.

If you are going to make up a whole new system of formats to use, at least research what is currently being done. I mean, Atom can easily perform the function of the MPEG-DIDL as used in this manner, plus there are easy handy dandy tools and software pre-existing for it. Oh, and Atom is being used all over the web. Oh, and what's that? OAI-ORE, the successor to OAI-PMH, is going to use Atom as well? Hmm, if I can spot the trend here NEEO, then maybe you can too.

[NB yes, this is a very blunt statement for which I no doubt will receive flak for, but I have a lot of other things to handle and implement, and having a transient organisation such as NEEO (most organisations and governments are transient to a 900 year old+ University) accept only a very specific and uncommon type of dissemination and state that it is entirely up to the repository managers to implement this in software (i.e. it implies they won't lift a finger or fund development) is very unreasonable. I have higher priorities that to code custom software that is likely to be superceded in the near future.]
</rant>

1 comment:

David said...

This post partially depresses me as I wrote a similar themed four chapter dissertation in my masters, and you just summed it up in a blog post! doh!

Though the brilliant quote I will take away from this post: MarcXML...akin to a coffin for data...you know the data is in there, but you dread the thought of trying to get it out again"!