Monday, 28 January 2008

Update

I am still here...

Just thought I'd put in a quick note to say that yes, i am still posting to this blog and yes, it is worth waiting for (I hope)

The posts that are sitting in my drafts at the moment are:

Image annotation and how to describe regions of images in a sound and stable manner. From talking to people on the #swig on freenode IRC, most notably Ben Nowack, I think that the format for doing this is pretty much crystallised.

Why is image region annotation important? It gives a consistent mechanism for recording information, such as:

- relating the area on a scanned page to words from an OCR'd text.

- it is easy to identify the same thing in multiple images - for example, identifying a person in multiple photos, labelling a burial site both in a photo and also in a topographical survey map, showing a particular phenotype being expressed on multiple slides of fruitflys, etc.

I think that is enough to whet your appetite for why this is handy to have sorted out early on.

And as this blog is called 'Less Talk, More Code', the answer to 'but how do we add this?' is that I have found a good GPL2 javascript library for drawing regions on images and the format is simple to write for.


Another thing lurking in my draft folder is all to do with "HOWTO: Building a fedora-commons backed web site from scratch, using open source tools" - Part 1 & 2.


My aim is that someone following both parts should end up with a web site that will allow them to do what is normally expected of an archive - handily, some points are listed in this photo from the Jisc-Crig unconference - http://www.flickr.com/photos/wocrig/2197484000/


"Put stuff in, get stuff out, Find stuff in, relationships between objects, annotations" plus content types, edit, event logging and visualisation and record pages built from one or more metadata sources.


Part 1 is the mundane details of installing an OS (Ubuntu JeOS), and then acquiring all the dependencies such as Sun Java 1.5 Fedora, Apache Solr, and the requisite python libraries (PyXML, Pylons, and the libraries I've written) for part 2, building the interface itself.


The key to it I hope is that the build of the python interface is described in such a way that it will help people do the same and allow them to think about what can be done further.


Progress? Part 1 is almost done and I am re-jigging part 2 to be more readable; I am trying to put part 2 into an order so that the simple things are dealt with first and the more complex things later, rather than in a 'function-by-function' structure.

Monday, 14 January 2008

Myth: Repositories should have one way in and one way out.

Repository managers can no longer rely on items - and the metadata about those items - getting into the repository through their own carefully worked ingest web service.

We need to make sure that this doesn't become problematic further down the line and that we, in turn, do not create problems or obstacles for other services using the disseminated data.


I think that, from the point of view of an institutional repository, we really need to start working with the user as soon as possible, and while this can be mostly a political problem, we [code monkeys] have to make sure that the technical barrier is as low as possible.

We have to be less arrogant about our services and more pragmatic about what is already out there. I mean, it should be obvious that it is not possible to design a single workflow or web service that everyone will happily and accurately use. Also, it should be obvious that the time it would take to customise the ingest and edit mechanism for each group of users is infeasible.

So, what's the alternative? We need to look longer and harder at the existing and emergent content repositories out there, especially those that people already use without much provocation. Flickr is a prime example of what I am talking about, and services such as google docs, blogging platforms and even Amazon may be very good sources for information that the user cares about and has already put into a 'machine-understandable' form.

My question then: Why shouldn't we take the following requests very seriously indeed?

"I have a collection of images on Flickr that I want to archive in your repository. It is at X url, that's all the info you need, right? I have a profile held at Y location."

or the more common:

"I need to put a couple of papers online for RAE - I'll email you the files. I'm a busy guy, I've told research services all the details, get it from them."

And finally, I found this, and I think it speaks for itself.

(From Language Log blog, talking about this workshop:)

With the objective of providing a more creative environment for scholarship, assume the following goal:

By 2015, all publicly-funded research products and primary resources will be readily available, accessible, and usable via common infrastructure and tools through space, time, and across disciplines, stages of research, and modes of human expression.

* Identify the intermediate tasks, resources, and enabling conditions
* Sketch a roadmap with major tracks and milestones to achieve the goal


I do like that. I like that assumption a lot. While we probably can never achieve the aims, at least having it in mind will help in the long run when designing or thinking about new services.

Friday, 11 January 2008

Image Annotation using common XML namespaces

This will outline methods for holding and describing the annotations of regions of images. I place emphasis on the work region as that is what I feel most people mean when they talk about adding metadata to an image; usually, they are more interested with the things in the photo, rather than the whole photo itself.

I'll give a few examples of what I mean:
A researcher wishes to indicate a particular architectural feature in some photographs of an archaeological site and indicate the same feature in a floorplan diagram of the same site.

A researcher cataloguing photos of insect life in the rainforest needs to be able to note the species of a particular insect in the image, who identified that particular insect and what plant/animal it is seen to be interacting with/eating/etc in this image - all of which pertain to an individual insect in the photo, although there may be one or more animals in the frame of the picture.

A foreign language book has been scanned in, and there needs to be a mechanism to relate each translated sentence and commentary to the relevant section of the image to which it applies.

A collection of ephemera from the 1900's (bus tickets, etc) have certain items which are compound items - a display case for a set of cigarette cards for example - and it is necessary to relate the location of the individual items to the original - locations on the display case continuing the example.
At first glance, it looks like we need something that can encode some metadata defining a 2-D shape and also encode the shape's relative positioning on the surface of an image. We will also need to be able to link arbitrary metadata to the individual shapes - arbitrary as we need to be flexible in terms of what information is recorded:

Region metadata --(links to)--> Region --(links to)--> Image

I knew I couldn't have been the first person to consider this, so I asked around. Benjamin Nowack (bengee) on the irc://irc.freenode.net/swig channel was very helpful and pointed me at the w3photo project which arrived at an ontology for defining regions of images:

Namespace for the ontology
Image Region Vocabulary: RDFS+OWL Documentation

[NB From what I've seen, most of the effort and work has gone into defining and producing metadata for the entire image, so I'll assume that a good image ontology/metadata format has been developed (I'm counting on the exif namespace myself, but a better one may appear.)]

So, to cut to the chase, this is a piece of RDF, hoping giving an example of what it is I am considering:


<rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:imreg="http://web-semantics.org/ns/image-regions#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:exif="http://www.w3.org/2003/12/exif/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<foaf:Image rdf:about=
"http://ora.ouls.ox.ac.uk/objects/uuid%3Aeb709f3b-a340-48bc-90d8-c8aa54dd7a07/datastreams/IMAGE01">
<exif:orientation>top-left</exif:orientation>
<exif:width>2048px</exif:width>
<exif:height>2048px</exif:height>
<!-- insert misc image metadata here, concerning the entire image. -->
<imreg:hasRegion>
<imreg:Rectangle>
<imreg:coords>1002,920 1100,1035</imreg:coords>
<dc:description>Location of the 1st burial site.</dc:description>
<dc:creator>Ben O'Steen </dc:creator>
<dc:date>circa 200BC </dc:date>
<dc:identifier>uuid:b46d3321-e8fd-41c3-81e4-793a6ab40e68</dc:identifier>
</imreg:Rectangle>
</imreg:hasRegion>
<imreg:hasRegion>
<imreg:Rectangle>
<imreg:coords>1600,1200 1750,1300</imreg:coords>
<dc:description>Location of the 2nd burial site.</dc:description>
<dc:creator>Ben O'Steen </dc:creator>
<dc:date>circa 200BC </dc:date>
<dc:identifier>uuid:7bc22dad-ff1d-45ad-86eb-dd24c1d0964a</dc:identifier>
</imreg:Rectangle>
</imreg:hasRegion>
</foaf:Image>
</rdf:RDF>


Combine that with a decent web interface for making these annotations, and a backend sophisticated to understand these user-made 'comments' on the image and you have yourself an exciting system, I feel. (I am using the Fedora RDF relationship of 'isAnnotationOf' to mark these connections between comment and original.)

(Oh, and for a good GPL'd DHTML method for drawing rectangles on an image, check out:
SpeedingRhino's Cropper tool, a working demo of which is here.

Thursday, 10 January 2008

Populating a search engine (Apache Solr) from Fedora Objects

I am going to move fairly fast on this one, with the following assumptions about the person(s) reading this:
  • You can at least set up a linux server, with Fedora 2.2 or so using its bundled tomcat without help from me
  • You have python, and have added the libraries indicated at the top of this post.
  • You have pulled the SVN hosted libraries, also from the aforementioned post from:
    • svn co https://orasupport.ouls.ox.ac.uk/archive/archive/lib
Let's get started then after I outline why I am using Solr:

Why Solr

From their front page:

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.
It has open APIs, faceted searching, replication and if you have a look at SOLR-303, the developers have added in distributed (i.e. federated) searching over HTTP between Solr instances, the precise functionality is still being refined (cleaner handling of tiebreakers, refinement queries, etc) but is functioning nonetheless.

I think from now on, I would no more code my own web server than I would code my own search engine service.

Getting and Installing Solr

First step is to grab Solr from this page. Pick a mirror that is close to you, and download the Solr package (I am currently using version 1.2), either as the .zip version or the gzipped tar, .tgz.

Extract the whole archive somewhere on disc and you will see something like this in the apache-solr-1.2 folder:

~/apache-solr-1.2.0$ ls
build.xml CHANGES.txt dist docs example KEYS.txt lib LICENSE.txt NOTICE.txt README.txt src

~/apache-solr-1.2.0$ ls dist
apache-solr-1.2.0.jar apache-solr-1.2.0.war

The easiest thing is to install Solr straight into an instance of Tomcat. One thing to be aware of is that search applications eat RAM and Heap for breakfast, so make sure you install it onto a server with plenty of RAM and it would be wise to increase the amount of Heap space available to the Tomcat instance. This can be done by making sure that the environment variable CATALINA_OPTS is set to "-Xmx512m" or even better "-Xmx1024m". This can be done inside the startup.sh script in your tomcat/bin directory is needed.

One final bit of advice before I point you at the rather good installation docs is that you might want to rename the .war file to match with the URL pathname you desire, as the guide relies on Tomcat automatically unpacking the archive:

So, a war called "apache-solr-1.2.0.war" will result in the final app being accessible at http://tomcat-hostname:8080/apache-solr-1.2.0/. I renamed mine to just solr.war.

Finally, Solr needs a place to keep its configuration files and its indexes. The indexes themselves have the capability to get huge (1Gb is not unheard of) and need somewhere to be stored. The documentation linked to below will refer to this location as 'your solr home' so it would be wise to make sure that this location has the space to expand. (NB this is not the directory inside Tomcat where the application was unbundled.)

Right, the installation instructions:

http://wiki.apache.org/solr/SolrInstall - Basic installation
http://wiki.apache.org/solr/SolrTomcat - Tomcat specific things to bear in mind

Now if you point your browser in the right place (http://tomcat_host:8080/solr/admin perhaps) you should see the admin page of the pre-built example application... which we do not want of course :)

Customising the Solr installation

The file solrconfig.xml is where you can configure the more mundane aspects of Solr - where it stores its indexes, etc. The documentation on what the options mean can be found here: http://wiki.apache.org/solr/SolrConfigXml but really, all you are initially going to change is the index save path, if you change that at all anyway.

The real file, the one that is pretty much crucial to what we are going to do, is the schema.xml. It has umpteen options, but you can get by, by just changing the example schema to hold the fields you want. Here is the wiki documentation on this file to help out and here (should) be a copy of the schema I am currently using

In fact, if you are using MODS or DC as your metadata, you may wish for the time being to just use the same schema as I am, just to progress.

Getting metadata into Solr

In my opinion, Solr has a really good API for remote management and updating the items ('documents' in solr lingo) in its indexes. Each 'document' has a unique id, and consists of one or more name-value pairs, like 'name=author, value=Ben'. To update or add a document to the Solr indexes, you HTTP POST a simple XML document of the following form to the update service, usually found at http://tomcathost:8080/solr/update.

The XML you would post looks like this:

<add>
<doc>
<field name="id">ora:1234</field>
<field name="title">The FuBar and me</field>
.... etc.
</doc>
</add>
Note that the unique identifier field is defined in the schema.xml near the end, by the element "uniqueKey", for example in my Solr instance, its <uniqueKey>id</uniqueKey>

The cunning thing about this is, is that to update a document in Solr, all I have to do is re-send the XML post above, having made any changes I wish. Solr will spot that it already has a document with id of 'ora:1234' and perform an update of the information in the index, rather than adding a second copy of this information.

One thing to note is that the service you are posting to is the update manager. No change is actually made to the indexes themselves until either the update manager is sent an XML package telling it to commit the changes, or the pre-defined (solrconfig.xml) maximum number of documents waiting to be indexed is hit.

Hooking into Solr from your library of choice

Handily, you are very unlikely to have to futz around writing code to connect to Solr, as there are a good number of libraries built for just such purposes - http://wiki.apache.org/solr/IntegratingSolr

As I tend to develop in python, I opted to start with the SolrPython script which I have made a little modification to and included in the libraries as libs/solr.py.

The BasicSolrFeeder class in solrfeeder.py and how it turns the Fedora Object into a Solr document

There are a number of different content types in the Oxford Research Archive, and ideally, there would be a type-specific method for indexing each one into the Solr service. As ORA is an emergent service, 99% of the items in the repository are text-based (journals, theses, book chapters, working/conference papers) and they all use either MODS (dsid: MODS) or simple Dublin Core (dsid: DC) to store their metadata.

I have written a python object called BasicSolrFeeder (inside the libs/solrfeeder.py file) which performs a certain script sequence of functions on a fedora object given its pid. (BasicSolrFeeder.add_pid in solrfeeder.py is the method I am describing.)

Using an array to hold all the resultant "<field name='foo'>bar</field>" strings
  • Get the content type of the object (from objContentModel field in the FOXML)
    • --> <field name="content_type">xxxxxxxx</field>
  • Get the datastream with id 'MODS' and if it is XML, pass it through the xsl sheet at 'xsl/mods2solr.xsl'
    • --> Lots of <field.... lines
  • If there is no MODS datastream, then default to getting the DC, and pass that through 'xsl/dc2solr.xsl'
    • --> Lots of <field.... lines
  • As collections are defined in a bottom-up manner in the RELS-EXT, applying an xsl transformation to the RELS-EXT datastream, will yield the collections that this object is in.
    • --> Zero or more < field name="collection">.... lines
    • (NB there is some migrationary code here that will also deduce the content type, as certain collections are being used to group types of content.)
  • Finally, if there is a datastream with id of FULLTEXT, this is loaded in as plain-text, and added to the 'text' field in Solr. This is how the searching the text of an object functions.
    • (NB There are other services that extract the text from the binary attachments to an object, which collate these texts and adds them as a datastream called FULLTEXT.)
  • This list of fields is then simply POSTed to the Solr update service, after which a commit may or may not be called.
So, if you have a fedora repository which uses either MODS or DC, and the collections are bottom-up, then the code should just work for you. You may need to tinker with the xsl stylesheets in the xsl/ directory to match what you want, but essentially it should work.

Here's an example script which will try to add a range of pids to a Solr instance:


from lib.solrfeeder import BasicSolrFeeder

sf = BasicSolrFeeder(fedora_url='http://ora.ouls.ox.ac.uk:8080/fedora',
fedora_version="2.0", # Supports either 2.0 or 2.2
solr_base="/solr", # or whatever it is for your solr app
solr_url='orasupport.ouls.ox.ac.uk:8080')
# Point at the tomcat instance for solr

# Now to scan through a range of pids, ora:1500 to ora:1550:

namespace = "ora"
start = "1500"
end = "1550"

# A simple hash to catch all the responses
responses = {}
for pid in xrange(start, end+1):
responses[pid] = sf.add_pid('%s:%s' % (namespace,pid), commit=False)

# Commit the batch afterwards to improve performance
sf.commit()

# Temporary variables to better report what went into Solr and
# what didn't
passed = []
failed = []

for key in responses:
if responses[key]:
passed.append('%s:%s' % (namespace, key))
else:
failed.append('%s:%s' % (namespace, key))
if passed:
print "Pids %s went in successfully." % passed
if failed:
print "Pids %s did not go in successfully." % failed


Adding security to Solr updates


Simple to add and update things in Solr, isn't it? A little too simple though, as by default anyone can do it. Let's add some authentication to the process. Solr does not concern itself with authenticating requests, and I think that is the right decision. The authentication should be either enforced by Tomcat, oor by some middleware.

The easiest mechanism is to use Tomcat's basic authentication scheme to password protect the solr/update url, to stop abuse by 3rd parties. It's pretty easy to do, and a quick google gives me this page - http://www.onjava.com/pub/a/onjava/2003/06/25/tomcat_tips.html - with 10 tips on running Tomcat. While most of the tips make for good reading, it is the 5th tip, about adding authentication to you Tomcat app, that is most interesting to us now.

Assuming that the password protection has been added, the script above needs a little change. The BasicSolrFeeder line needs to have two additional keywords, solr_user and solr_password and the rest of it should run as normal.

e.g.


sf = BasicSolrFeeder(fedora_url='http://ora.ouls.ox.ac.uk:8080/fedora',
fedora_version="2.0", # Supports either 2.0 or 2.2
solr_base="/solr", # or whatever it is for your solr app
solr_username="your_username",
solr_password="your_password",
solr_url='orasupport.ouls.ox.ac.uk:8080')
# Point at the tomcat instance for solr


Hopefully, this should be enough to get people to think about using Solr with Fedora, as Solr is a very, very powerful and easily embeddable search service. It is even possible to write a client in javascript to perform searches, as can be seen from the javascript search boxes in http://ora.ouls.ox.ac.uk/access/adv_search.php

I have purposefully left out how to format the queries from this post, but if people yell enough, I'll add some guidelines, more than I provide at http://ora.ouls.ox.ac.uk/access/search_help.php anyway

Wednesday, 9 January 2008

Conclusions on UUIDs and local ids in Fedora

I mentioned earlier about the possibility of using UUIDs as Fedora identifiers. I'll write my conclusion first and then the reasoning later for all you lazy people out there :)

Conclusions

Fedora repositories that wish to use UUIDs as identifiers should have the namespace 'uuid' added to the list of <retainPids> in fedora.fcfg.

The 32 character hex string representing the UUID is entered as the object id to the uuid namespace. For example:

Fedora pid - uuid:34b706b4-f080-4655-8695-641a0a8acb25

Benefits

  1. Using the uuid scheme for identifiers, (those who retain the 'uuid' namespace), administrators will be able to painlessly transfer objects from one instance of Fedora to another, or even to have a distributed set of Fedora instances as a single 'repository'; No fiddling with pid changes, changing RELS-EXT datastreams or others, or changing metadata datastream identifiers.
  2. Fedora pids will fit into the RFC 4122 mechanism for the uuid urn namespace easily -> urn:pid will be a valid URI.
  3. The command 'getNextPID()' could be used to provide a local 'id', which can be added as a FOXML field, such as label or even added as the Alternate ID on ingest/migration.
  4. Given a URI resolver that is updated when an object migrates from one fedora repository to another, a distributed set of Fedora instances could have cross-repository relationships in RDF that stay valid regardless of where the objects reside.
  5. Use of a distributed set of Fedora instances with a federated search tool such as the Apache Solr is quite an attractive prospect for large scale implementations.
Reasoning

I've thought about the logistics of actually using them, and also of the fact that some people are happier with object ids that they can type (although for the life of me, I can't work out why; when was the last time that you, as a normal user, typed in a full URL and didn't just go to a discovery tool like google or a site search to get to a specific item? 99% of the time, I rely on my browser's address bar defaulting to google for anything that doesn't look like a url.) But I digress...

I prefer to deal with situations as they are, as opposed to what might be possible later, so let's recap what pids Fedora allows or needs:

Fedora pid = namespace : id

or more formally - (from http://www.fedora.info/definitions/identifiers/
object-pid    = namespace-id ":" object-id
namespace-id = ( [A-Z] / [a-z] / [0-9] / "-" / "." ) 1+
object-id = ( [A-Z] / [a-z] / [0-9] / "-" / "." / "~" / "_" / escaped-octet ) 1+
escaped-octet = "%" hex-digit hex-digit
hex-digit = [0-9] / [A-F]
e.g. Anything that fits the following regular expression:
^([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|\.|~|_|(%[0-9A-F]{2}))+$

As I said before, I am interested in using UUIDs (or something like them) because they need no real scheme to be unique and persistantly unique; UUIDs are designed so the chances of creating two ids that are the same is vanishingly small. So, what does one look like?

(from wikipedia:)

In its canonical form, a UUID consists of 32 hexadecimal digits, displayed in 5 groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters. For example:
550e8400-e29b-41d4-a716-446655440000
Regular expressions:

[A-Fa-f0-9]{8}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{12}
matches: 550e8400-e29b-41d4-a716-446655440000

^((?-i:0x)?[A-Fa-f0-9]{32}
matches: 0x550e8400e29b41d4a716446655440000

So a UUID can't be used as it is in place of a Fedora pid. However, according to RFC 4122, there is a uuid urn namespace which makes me more hopeful. The above uuid can be represented as urn:uuid:550e8400-e29b-41d4-a716-446655440000 for example.

So, how about if we make the reasonable assumption that a pid is a "valid" urn namespace, but a namespace that may or may not be registered yet? For example, I am currently using the fedora namespace ora for items in the Oxford repository. Would it be to far fetched to say that ora:1234 is understandable as urn:ora:1234?

So, all we need to do is make sure that the namespace 'uuid' is one of the ones in the <retainPid> element of fedora.fcfg and we are set to go. Looks like the pid format 'restriction' as I thought it, was quite handy after all :) So to state it clearly:

Fedora pids that follow the UUID scheme should be in the form of:
object-pid    = "uuid:" object-id
object-id = 8-digit-hex '-' 4-digit-hex '-' 4-digit-hex '-' 4-digit-hex
'-' 4-digit-hex '-' 12-digit-hex
8-digit-hex = ( hex-digit ) 8
4-digit-hex = ( hex-digit ) 4
12-digit-hex = ( hex-digit ) 12
hex-digit = [0-9] / [A-F] / [a-f]
e.g. "uuid:34b706b4-f080-4655-8695-641a0a8acb25"

(NB forgive any syntactically slips above, I hope it's clear as it is.)

I mentioned before that some people want human-typeable fedora pids.... urgh. No really sure what purpose it serves. In fact, let me have a little rant...

<rant>

A 'Cool URL' is one that doesn't change. Short pids make for pretty URLs and guarantee little more than that.

</rant>

Right, that's out of my system. Now to accommodate the request...

Firstly, I'd just like to point out that I will ignore any external search and discovery services; essentially any type of resolver from search engine to the Handles system. This is because I feel that the format of the fedora pid is quite irrelevent to these services. (I am aware that certain systems made use of a nasty hack where the object id part of the fedora pid was used as the Handle id, after the institution namespace and believe me, this hack does have me quite worried. I can understand the reasoning behind this as the Handles system doesn't seem to have a simple way to query for the next available id, but I think this is a potential problem on the horizon.)
My suggestion is that the pid itself is the uuid as defined above, but that the repository system has a notion of local 'id'; the Fedora call of 'getNextPid()' could be used to create these 'tinypids' with whatever namespace is deemed pleasant.
They can be stored in the FOXML in fields such as Label or stored as an Alternate ID (fields which I personally have no use of). Fedora will index these for its basic search service, and could be used as a mechanism to look up the real pid given the local id.

For example, with the RDF triplestore turned on, the following iTQL query should be enough:
"select $object from <#ri>
where $object <info:fedora/fedora-system:def/model#label> 'ora:1234'"
The tuple that is returned will be something like "uuid:34b706b4-f080-4655-8695-641a0a8acb25"

But it still isn't great, and I don't think the benefits outweigh the work involved in implementing it but it's a workable solution I think for those that need it.