Less Talk, More Code: News and updates Oct 2008

Right, I haven't forgotten about this blog, just getting all my ducks in a line as it were. Some updates:

The JISC bid for eAdministration was successful, titled "Building the Research Information Infrastructure (BRII)". The project will categorise the research information structure, build vocabularies if necessary, and populate it with information. It will link research outputs (text and data), people, projects, groups, departments, grant/funding information and funding bodys together, using RDF and as many pre-existing vocabularies as is suitable. The first vocab gap we've hit is one for funding, and I've made a draft RDF schema for this which will be openly published once we've worked out a way to make it persistent here at Oxford (trying to get a vocab.ox.ac.uk address)

One of the final outputs will be a 'foafbook' which will re-use data in the BRII store - it will act as a blue book of researchers. Think Cornell's Vivo, but with the idea of Linked Data firmly in mind.
We are just sorting out a home for this project, and I'll post up an update as soon as it is there.

Forced Migration Online (FMO) have completed their archived document migration from a crufty, proprietary store to a ORA-style store (Fedora/Solr) - you can see their preliminary frontend at http://fmo.qeh.ox.ac.uk. Be aware that this is a work in progress. We provide the store as a service to them, giving them a Fedora and a Solr to use. They contracted a company called Aptivate to migrate their content, and I believe also to create their frontend. This is a pilot project to show that repositories can be treated in a distributed way, given out like very smart, shared drive space.
We are working to archive and migrate a number of library and historical catalogs. A few projects have a similar aim to provide an architecture and software system to hold medieval catalog research - a record of what libraries existed, and what books and works they held. This is much more complex that a normal catalog, as each assertion is backed by a type of evidence, ranging from the solid (first-hand catalog evidence), to the more loose (handwriting on the front page looks like a certain person who worked at a certain library.) So modelling this informational structure is looking to be very exciting, and we will have to try a number of ways to represent this, starting with RDF due to the interlinked nature of the data. This is related to the kinds of evidence that genealogy uses, and so related ontologies may be of use.
The work on storing and presenting scanned imagery is gearing up. We are investigating storing the sequence of images and associated metadata/ocr text/etc as a single tar file as part of a Fedora object (i.e. a book object will have a catalog record, technical/provenance information and an attached tar file and and a list of file to offset information.)

This is due to us trying to hit the 'sweet spot' for most file systems. A very large number of highly compressed images and little pieces of text does not fit well with most FS internals. We estimate that for a book there will be around [4+PDFs+2xPages] files, or 500+ typically. Just counting up the various sources of scanned media we already have, we are pressing for about 1/2 million books from one source, 200,000 images from another, 54,000 from yet another... it's adding up real fast.

We are starting to deal with archiving/curating the 'long-tail' of data - small, bespoke datasets that are useful to many, but don't fall into the realm of Big Data, or Web data. I don't plan on touching Access/FoxPro databases any time soon though! I am building a Fedora/Solr/eXist box to hold and disseminate these, which should live at databank.ouls.ox.ac.uk very, very shortly. (Just waiting on a new VMware host to grow into, our current one is at capacity.)

To give a better idea of the structure, etc, I am writing it up in a second blog post to follow shortly - currently named "Modelling and storing a phonetics database inside a store"

I am in the process of integrating the Google-analytics-style statistics package at http://piwik.org with the ORA interface, to give relatively live hit counts on a per-item and to build per-collection reports.

Right now, piwik is capturing the hits and downloads from ORA, but I have yet to add in the count display on each item page, so halfway there :)

We are just waiting on a number of departments here to upgrade the version of EPrints they are using for their internal, disciplinary repositories, so that we can begin archiving surrogate copies of the work they wish to put up for this service. (Using ORE descriptions of their items) By doing so, their content becomes exposed in ORA, mirror copies are made (working on a good way to maintain these as content evolves), but they retain the content control, ORA will also act as a registry for their content. It's only when their service drops do the users get redirected to the mirror copies that ORA holds (think google cache, but a 100% copy).
In the process of battle-testing the Fedora-Honeycomb connection, but as mentioned above, just waiting for a little more hardware before I set to it. Also, we are examining a number of other storage boxes that should plug in under Fedora, using the Honeycomb software, such as the new and shiny Thumper box, "Thor" Sun Fire Xsomething-or-other. Also, getting pretty interested at the idea of MAID storage - massive array of idle disks. Hopefully, this will act like tape, but have a sustainable access speed of disk. Also, a little more green than a tower of spinning hardware.
Planning out the indexer service at the moment. It will use the Solr 1.3 multicore functionality, with a little parsing magic at the ingest side of things to make a generic indexer-as-a-service type system. One use-case is to be able to bring up VM machines with multicore solr on to act as indexers/search engines as needed. An example aim? "Economics want an index that facets on their JEL codes." POST a schema and ingest indexer to the nearest free indexer, and point the search interface at it once an XMPP message comes back that it is finished.
URI resolvers - still investigating what can be put in place for this, as I strongly wish to avoid coding this myself. Looking at OCLC's OpenURL and how I can hack it to feed it info:fedora uris and link them to their disseminated location. Also, using a tinyurl type library + simple interface might not be a bad idea for a quick PoC.
Just to let you all know that we are building up the digital team here, most recently held interviews for the futureArch project but we are looking for about 3 others to hire, due to all the projects we are doing. We will be putting out job adverts as and when we feel up to battling with HR :)

That's most of the more interesting hot topics and projects I am doing at the moment.... phew :)

2 comments:

Anonymous said...: Glad to slave labour is back in full force! Hopefully we can start hiring kids again to work in sweat dev shops?... Darn EU laws getting in the way ;-D Of course the follow-up post has got to be a mapping of lines of code to each one of those projects! Less Talk, More Code!; 16 October 2008 at 05:12
Anonymous said...: Looks like a long list to me, Ben. Is "crufty a technical term? Speaking of the long tail of data here is Chris Anderson's big data offering in which the Sloan Digital Sky Survey, a Fedora-related project, is mentioned: http://tinyurl.com/3tg7c9; 16 October 2008 at 06:23

Less Talk, More Code

Thursday, 16 October 2008

News and updates Oct 2008

2 comments:

Dopplr

Subscribe Now

Mugshot

Additional links

Labels

Blog Archive

About Me