Friday, 14 December 2007

Using python to play with a Fedora repository

Firstly, you'll need some extra libraries:

(If you are using Windows, I'm afraid you are on your own with problems. I can't help you, it's not a system I use.)

Get easy install from here: http://peak.telecommunity.com/DevCenter/EasyInstall
(If that site is slowed to a trickle, just install ez_install.py from somewhere else that is trustworthy)

Then, as root:

easy_install ZSI
easy_install uuid
easy_install 4Suite-xml
easy_install pyxml

(There may be more, I don't have a system set aside to try it out.)

Then create a clean working directory and grab the libraries from here:

svn co https://orasupport.ouls.ox.ac.uk/archive/archive/lib

These are of questionable quality, and are in a state of transistion from proof of concept jumbled structure into a more refined and refactored set of libraries. The main failing is that I have a mix of convenience methods which might be pretty specific in use, alongside more fundamental methods which are much more generic.

(PS, if you want to try the full archive interface out, you'll need to inject some objects into the repository to start with, specifically the resource objects that have the xsl for the view transforms. If anyone wants, I'll wrap these with a bow and add them when I have time.)

But, for now, they will at least provide the fundamentals to play with.

I will assume you have a Fedora repository set up somewhere, and that you know a working username and password that will let you create/edit/etc objects inside it. It also assumes the instance has an API like Fedora 2.2 especially for SOAP. I'll post up about making the FedoraClient multi-versioned with regards to SOAP later.

For the purposes of the rest of this post, fedoraAdmin is both the username and password for the repository, and that it lives at localhost:8080/fedora.

Inside the same directory that holds the lib/ directory, start the python commandline:

~/temp$ python
Python 2.5.1c1 (release25-maint, Apr 12 2007, 21:00:25)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

Now let's get a fedora client and poke around the repository

>>> from lib.fedoraClient import FedoraClient
(cue SOAP related chugging of CPU when loading the SOAP libs)
>>> help(FedoraClient) # This will show you all sorts about this class
>>> # But we are interested in the following:
>>> f = FedoraClient(serverurl='http://localhost:8080/fedora', username='fedoraAdmin', password='fedoraAdmin', version='2.2')

Now we have the client, let's try out a few things:

>>> print f.getDescriptionXML()
(XML related stuff in reply)

>>> f.doesObjectExist('namespace:pid')
True or False depending

>>> # For example, in my dev repo:
>>> f.getContentModel('person:1')
u'person'

>>> f.listDatastreams('ora:20')
[{'mimetype': u'image/png', 'checksumtype': u'DISABLED', 'controlgroup': 'M', 'checksum': u'none', 'createdate': u'2007-09-25T14:36:29.381Z', 'pid': 'ora:20', 'versionid': u'IMAGE.0', 'label': u'Downloadable stuff', 'formaturi': None, 'state': u'A', 'location': None, 'versionable': True, 'winname': u'ora_20-IMAGE.png', 'dsid': u'IMAGE', 'size': 0}, {'mimetype': u'text/xml', 'checksumtype': u'DISABLED', 'controlgroup': 'X', 'checksum': u'none', 'createdate': u'2007-09-25T14:37:02.882Z', 'pid': 'ora:20', 'versionid': u'DC.2', 'label': u'Dublin Core Metadata', 'formaturi': None, 'state': u'A', 'location': None, 'versionable': True, 'winname': u'ora_20-DC.xml', 'dsid': u'DC', 'size': 272}, {'mimetype': u'text/calendar', 'checksumtype': u'DISABLED', 'controlgroup': 'M', 'checksum': u'none', 'createdate': u'2007-09-25T14:37:03.391Z', 'pid': 'ora:20', 'versionid': u'EVENT.3', 'label': u'Events', 'formaturi': None, 'state': u'A', 'location': None, 'versionable': True, 'winname': u'ora_20-EVENT.ics', 'dsid': u'EVENT', 'size': 0}, {'mimetype': u'text/xml', 'checksumtype': u'DISABLED', 'controlgroup': 'X', 'checksum': u'none', 'createdate': u'2007-08-31T14:21:39.743Z', 'pid': 'ora:20', 'versionid': u'MODS.4', 'label': u'MODS Record', 'formaturi': None, 'state': u'A', 'location': None, 'versionable': True, 'winname': u'ora_20-MODS.xml', 'dsid': u'MODS', 'size': 1730}]

>>> f.doesDatastreamExist('ora:20','DC')
True
>>> f.doesDatastreamExist('ora:20','IMAGE')
True
>>> f.doesDatastreamExist('ora:20','IMAGE00123')
False

Creating new items:

The steps are as simple as creating a new blank FoXML object, and ingesting it, datastreams are uploaded and added afterwards. The first example will be trivial and the second will be more detailed.

First Demo:
http://pastebin.com/f7b1f21e7

Second Demo:
Look at the 'createBlankItem' method in FedoraClient. Plenty of scope for creating complex objects on the fly there.

Poking around the Triplestore:

Using the above libs:

>>> from lib.risearch import Risearch
>>> r = Risearch(server='http://localhost:8080/fedora') # This is the default, and equivalent to Risearch()

Then you can ask it fun things:

>>> # Retrieve a list of all the objects in the repository:
>>> pids = r.getTuples("select $object from <#ri> where $object <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <info:fedora/fedora-system:def/model#FedoraObject>", format='csv', limit='10000').split("\n")[1:-1]

>>> # Get a list of the pids in a given bottom up collection (ora:neeo):
>>> pids = r.getTuples("select $object from <#ri> where $object <fedora-rels-ext:isMemberOf> <info:fedora/ora:neeo>", format='csv', limit='10000').split("\n")[1:-1]

>>> # Test to see if a certain relationship exists:
>>> # May need to change the code in risearch.py to use the old method
>>> r.doesTripleExist('<info:fedora/ora:1> <person:hasStatus> <info:fedora/collection:open>')
False

Next post, I'll write up about how Solr can be fed from objects in a Fedora repository.

8 comments:

Stephen said...

Hi, I like your blog - you are doing interesting stuff - keep the posts coming.

Anonymous said...

Your python-fedoracommons library on google code looks like great stuff,

By the way it has a problem - look at line 27 of upload.py - I have no archive.lib.mimeTypes :)

Haven't yet tested the library - have problems installing rdflib-2.4.0 on my Windows computer.

Will try on Linux tomorrow.

Ben O'Steen said...

@edgars - Yeah, you are right, sorry about that. The library is used to, er... 'correct' the python inbuilt mimetype -> windows extension library. It has a tendancy to do things like 'text/plain' -> '.ksh' if left to its own devices.

I'll sort that out now, and I should be getting round to a release soon enough.

Shravan Thummalapelly said...

Hello Ben,
Thanks a lot for this useful python wrapper for FedoraClient. As I was trying to get the API documentation for this library using setup.py pudge, it is not recognising the pudge option.

Could you provide me any URL for API for this library or could you let me know other options to generate the API documentation for the same.

Thanks,
Shravan

Shravan Thummalapelly said...

Ben,

Could you look into my above post?

And I am trying to use FedoraClient's addDatastream() but it is giving ZSI error as "'list' object has no attribute '__dict__'" Did you come across this error?

Could you reply back with your suggestions.

Ben O'Steen said...

@shravan, the python-fedoracommons code hasn't reached a release point yet, and I haven't even written up the usage guide yet!

As for the issue you've raised, please could you add it as an issue to http://code.google.com/p/python-fedoracommons/issues/entry including what versions of Fedora and ZSI you are using? Steps to reproduce? OS/environment/etc?

As a piece of general advice, I'd use the Fedora 3.0 client, as this uses the REST api and is far less troublesome to get working.

Shravan Thummalapelly said...

Ben,
Thanks a lot for your reply. I have opened an issue in the provided code.google url.

FYI,
I am using Fedora Client 2.2 in Ubuntu 7.10 gutsy.

I will try it with Fedora Client 3.0 and let you the result.

Thanks,
Shravan

Anonymous said...

Hi Ben,

Thank you for your very helpful posts on Fedora.

I'd like to have a look at the Python code, but have no joy in fetching:

[orac.129] svn co https://orasupport.ouls.ox.ac.uk/archive/archive/lib
svn: OPTIONS of 'https://orasupport.ouls.ox.ac.uk/archive/archive/lib': could not connect to server (https://orasupport.ouls.ox.ac.uk)

Is this stuff still available?

[Not too worried about it's "state", or even it it runs.]

Cheers,

--
Phil