Thursday, 15 October 2009

Python in a Pairtree

(Thanks to @anarchivist for the title - I'll let him take all the 'credit')

"Pairtree? huh, what's that?" - in a nutshell it's 'just enough veneer on top of a conventional filesystem' for it to be able to store objects sensibly; a way of storing objects by id on a normal hierarchical filesystem in a pragmatic fashion. You could just have one directory that holds all the objects, but this would unbalance the filesystem and due to how most are implemented, would result in a less-than-efficient store. Filesystems just don't deal well with thousands or hundreds of thousands of directories in the same level.

Pairtree provides enough convention and fanning out of hierarchical directories to both spread the load of storing high numbers of objects, while retaining the ability to treat each object distinctly.

The Pairtree specification is a compromise between fanning out too much and too little and assumes that the ids used are opaque; that the ids have no meaning and are to all intents and purposes 'random'. If your ids are not, for example, they are human-readable words, then you will have to tweak how the ids are split into directories to ensure better performance.

[I'll copy&paste some examples from the specifications to illustrate what it does]

For example, to store objects that have identifiers like the following URI - http://n2t.info/ark:/13030/xt2{some string}

eg:

http://n2t.info/ark:/13030/xt2aacd
http://n2t.info/ark:/13030/xt2aaab
http://n2t.info/ark:/13030/xt2aaac

This works out to look like this on the filesystem:
current_directory/
| pairtree_version0_1 [which version of pairtree]
| ( This directory conforms to Pairtree Version 0.1. Updated spec: )
| ( http://www.cdlib.org/inside/diglib/pairtree/pairtreespec.html )
|
| pairtree_prefix
| ( http://n2t.info/ark:/13030/xt2 )
|
\--- pairtree_root/
|--- aa/
| |--- cd/
| | |--- foo/
| | | | README.txt
| | | | thumbnail.gif
| | ...
| |--- ab/ ...
| |--- af/ ...
| |--- ag/ ...
| ...
|--- ab/ ...
...
\--- zz/ ...
| ...


With the object http://n2t.info/ark:/13030/xt2aacd containing a directory 'foo', which itself contains a README and a thumbnail gif.

Creating this structure by hand is tedious, and luckily for you, you don't have to (if you use python that is)

To get the pairtree library that I've written, you can either install it from the Pypi site http://pypi.python.org/pypi/Pairtree or if python-setuptools/easy_install is on your system, you can just sudo easy_install pairtree

You can find API documentation and a quick start here.

The quick start should get you up and running in no time at all, but let's look at how we might store Fedora-like objects on disk using pairtree. (I don't mean how to replicate how Fedora stores objects on disk, I mean how to make an object store that gives us the basic framework of 'objects are bags of stuff')


>>> from pairtree import *
>>> f = PairtreeStorageFactory()
>>> fedora = f.get_store(store_dir="objects", uri_base="info:fedora/")


Right, that's the basic framework done, let's add some content:


>>> obj = fedora.create_object('changeme:1')
>>> with open('somefileofdublincore.xml', 'r') as dc:
... obj.add_bytestream('DC', dc)
>>> with open('somearticle.pdf', 'r') as pdf:
... obj.add_bytestream('PDF', pdf)
>>> obj.add_bytestream('RELS-EXT', """<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rel="info:fedora/fedora-system:def/relations-external#">
<rdf:Description rdf:about="info:fedora/changeme:1">
<rel:isMemberOf rdf:resource="info:fedora/type:article"/>
</rdf:Description>
</rdf:RDF>""")


The add_bytestream method is adaptive - if you pass it something that supports a read() method, it will attempt to stream out the content in chunks to avoid reading the whole item into memory at once. If not, it will just write the content out as is.

I hope this gives people some idea on what can be possible with a conventional filesystem, after all, filesystem code is pretty well tested in the majority of cases so why not make use of it.

(NB the with python command is a nice way of dealing with file-like objects, made part of the core in python ~2.6 I think. It tries to make sure that the file is closed at the end of the block, equivalent to an "temp = open(foo) - do stuff - temp.close()")