Thursday, 13 November 2008

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

(mods:subtitle - Slightly more technical follow-up to the Fedora Hatcheck piece.)

As I have been prompted via email by Phil Cryer (of the Missouri Botanical Garden) to talk more about how this technically works, I thought it would be best to make it a written post, rather than the more limited email response.


Forced Migration Online (FMO) had a proprietary system, supporting their document needs. It was originally designed for newpaper holdings and applied that model to encoding the mostly paginated documents that FMO held - such that each part was broken up into paragraphs of text, images and the location of all these parts on a page. It even encoded (in its own format) the location of the words on the page when it OCR'd the documents, making per-word higlighting possible. Which is nice.

However, the backend that powered this was over-priced, and FMO wanted to move to a more open, sustainable platform.

Enter the DAMS

(DAMS = Digital Asset Management System)

I have been doing work on trying to make a service out of a base of fedora-commons and additional 'plugin' services, such as the wonderful Apache Solr and the useful eXist XML db. The end aim is for departments/users/whoever to requisition a 'store' with a certain quality of service (solr attached, 50Gb+ etc) but this is not yet an automated process.

The focus for the store is a very clear separation between storage, management, indexing services and distribution - Normal filesystems, or Sun Honeycomb are the storage, Fedora-commons provides the management + CRUD, solr, eXist, mulgara, sesame, and couchDB can provide potential index and query services, and distribution is handed pragmatically, caching outgoing and mirroring where necessary.

The FMO 'store'

From discussions with FMO, and examining the information they held and the way they wished to make use of it, a simple Fedora/Solr store seemed to fufill what they wanted: a persistant store of items with attachments and the ability to search the metadata and retrieve results.

Bring in the consultants

FMO hired Aptivate to do the migration of their data from the proprietary system, in its custom format, to a Fedora/Solr store and trying as much as possible to retain the functionality they had.

Some points that I think it is important to impress on people here:
  • In general, software engineer consultants don't understand METS or FOXML.
  • They *really* don't understand the point of disseminators.
  • Having to teach software engineer consultants to do METS/FOXML/bDef's etc is likely an arduous and costly task.
  • Consultants add lots of money to do things their team don't already have the experience to do.
So, my conclusion was to not make these things part of the development at all to the extent that I might even have forgotten to mention these things to them except in passing. I helped them install their own local store and helped them with the various interfaces and gotchas of the two software packages. By showing them how I use Fedora and Solr in, they were able to hit the ground running.

They began by using the REST interface to Fedora and the RESTful interface to Solr. By having them begin by using the simple put/get REST interface to Fedora, they could concentrate on getting used to the nature of Fedora as an objectstore. I think they moved to use the SOAP interface as it better suited their Java background, although I cannot be certain as it wasn't an issue that came up.

Once they had developed the migration scripts to their satisfaction, they asked me to give them a store, which I did (but due to hardware and stupid support issues here I am afraid to say I held them up on this.) They fired off their scripts, moved all the content into the fedora with a straightforward layout per object (pdf, metadata, fulltext and thumbnail) The metadata is - from what I can see - the same XML metadata as before - very MARCXML in nature, with 'Application_Info' elements having types like 'bl:DC.Title'. If necessary, we will strip out the dublin core metadata and put what we can into the DC datastream, but that's not of particular interest to FMO right now.

Fedora/Solr notes

As for the link between Solr and Fedora? This is very loosely coupled, such that they are running in the same Tomcat container for convenience, but aren't linked in a hard way.

I've looked at GSearch, which is great for a homogenous collection of items, such that they can be acted on by the same XSLT to produce a suitable record for Solr, but as the metadata was a complete unknown for this project, it wasn't too suitable.

Currently, they have one main route into the fedora store, and so, it isn't hard to simply reindex an item after a change is made, especially for services such as Solr or eXist, which expect to have things change incrementally. I am looking at services such as ActiveMQ for scheduling these index tasks, but more and more I am starting to favour RabbitMQ which seems to be more useful, while retaining the scalability and very robust nature.

Sending an update to Solr is as simple as an HTTP POST to its /update service, consisting of a XML or JSON packet like "changeme:1John Smith...." - it uses a transactional model, such that you can push all the changes and additions into the live index via a commit call, without taking the index offline. To query Solr, all manner of clients exist, and it is built to be very simple to interact with, handling facet queries, filtering, ordering and can deliver the results in XML, JSON, PHP or Python directly. It can even do a XSLT transform of the results on the way out, leading to a trivial way to support OpenSearch, Atom feeds and even HTML blocks for embedding in other sites.

Likewise, to change a PDF in Fedora can be done by a HTTP POST as well. Does it need to be more complicated?

Last, but not least, a project to watch closely:

The Fascinator project, funded by ARROW, as part of their mini project scheme, is an Apache Solr front end to the Fedora commons repository. The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. Well worth a look, as it seeks to turn this Fedora/Solr pairing truly into an appliance, with a simple installer and handling the linkage between the two.

1 comment:

phil said...

Fascinator is well done, I'm using it now, had to hack the install script a bit, but it now works perfectly in Debian. You just download the script and run it, it'll download everything that's needed and setup a base Fascinator env running in Tomcat. Link with my Debian patch:;a=summary

Now need to learn how to harvest new Fedora objects into Solr.