Tuesday, 11 December 2007

Linking items together by using an RDF store

(Disclaimer: This is going to rapidly skip over functionality that is present in Fedora, and can be duplicated for other object stores given effort. The key pieces of functionality are that each item has a set of RDF triples that describe how it relates to other items, and that each item can be referenced by a URI.)

(If you wish to follow along without a pre-made environment, see the bottom of this post, where I'll stick some python code that'll create a rdf triplestore that you can tinker with - Not as a web service though, command line only at this point.)

Assumption #1 An item is the tangible object in your store. It is likely to hold both metadata and the data itself as part of it. So it could be a METS file or a Fedora Object easily.
Assumption #2 Items and their various important parts are identifiable using URIs, or at the very least, a hackable scheme.

So, considering Fedora, it's objects have URIs of the form , where the namespace and id are the determining part. A part of this object (a datastream, such as an attached pdf or even the Dublin Core of the object itself) can be written in URI form by using the datastream's id - e.g. will resolve to the Dublin Core xml that is attached to the object 'ora:1600'.

Skipping very lightly over questions of semantics and other things, the basic premise of using RDF in the way I shall show is to define relationships between nodes expressed as URIs or as text - more formally as 'Literals'. These relationships, or 'properties' are drawn from a host of namespaces where the properties themselves are defined.

Let's see a few hopefully familiar properties:

From http://purl.org/dc/elements/1.1/ (You may need to view the source of that page): {NB the following has been heavily cut, as indicated by [ snip ]}

[ snip ]
<rdf:Property about="http://purl.org/dc/elements/1.1/title">
<rdfs:label lang="en-US">Title</rdfs:label>
<rdfs:comment lang="en-US">A name given to the resource.</rdfs:comment>
<dc:description lang="en-US">Typically, a Title will be a name by which
the resource is formally known.</dc:description>
<rdfs:isdefinedby resource="http://purl.org/dc/elements/1.1/">
<dc:type resource="http://dublincore.org/usage/documents/principles/#element">
<dcterms:hasversion resource="http://dublincore.org/usage/terms/history/#title-005">
<rdf:Property about="http://purl.org/dc/elements/1.1/creator">
<rdfs:label lang="en-US">Creator</rdfs:label>
<rdfs:comment lang="en-US">An entity primarily responsible for making
the resource.</rdfs:comment>
<dc:description lang="en-US">Examples of a Creator include a person, an
organization, or a service. Typically, the name
of a Creator should be used to indicate the entity.</dc:description>
<rdfs:isdefinedby resource="http://purl.org/dc/elements/1.1/">
[ snip ]

Familiar? Perhaps not, but now we can move onto what makes the RDF world go around, 'triples'. At a basic level, all that you will need to know is that a triple is a chunk of information that states that 'A → B → C' where A is a URI, C can be a URI or a literal, and the property is B which links A and C together.

Real example now as it is quite straightforward really. You will be able to understand the following triple for example!

<info:fedora/ora:909> <http://purl.org/dc/elements/1.1/creator> "Micheal Heaney"

The object 'ora:909' has a dc:creator of 'Micheal Heaney'. Simple enough, eh? Now, let's look at a more useful namespace, and for my ease, let's choose one used by Fedora: http://www.fedora.info/definitions/1/0/fedora-relsext-ontology.rdfs

You don't really need to read the definition file, as the meaning should be fairly obvious:

<info:fedora/ora:1600> <info:fedora/fedora-system:def/relations-external#isMemberOf> <info:fedora/ora:younglives> .

<info:fedora/ora:1600> <info:fedora/fedora-system:def/relations-external#isMemberOf> <info:fedora/ora:general> .

To make things a little more clear, I have put the important part of the triple in bold. (If you are really, really curious to see all the triples on this item, go to http://tinyurl.com/33z3rn )

This, as you may have guessed is how I am implementing collections of objects. ora:general is the object corresponding to the set of general items in the repository, and ora:younglives is the collection for the 'Young Lives' project.

As this snapshot should show, this is not the extent to how I am using this. (Picture of the submission form for the archive - http://www.flickr.com/photos/oxfordrepo/2102829891/)

Next Steps:

The mechanics of how the triples get into the rdf triplestore (think of it as a db for triples) is entirely up to you. What I will say is that if you use Fedora, and put the RDF for an object in a datastream called 'RELS-EXT', this RDF is automatically added to an internal triplestore. For an example of this, see http://ora.ouls.ox.ac.uk:8080/fedora/get/ora:1600/RELS-EXT

So, I did warn that I was going to skip over the mechanics of getting triples into a triplestore, and I will now talk about the 'why' we are doing this; querying the triplestore.

Querying the store

The garden variety query is of the following form:

"Give me the nodes that have some property linking it to a particular node" - i.e. return all the objects in a given collection, find me all the objects that are part of this other object, etc.

Let's consider finding all the things in the Young Lives collection mentioned before

In iTQL:

"select $objects from <#ri> where $objects <fedora-rels-ext:ismemberof> <info:fedora/ora:younglives>" -

or equally, in SPARQL for a different triplestore:
"select ?objects where { ?objects <fedora-rels-ext:ismemberof> <info:fedora/ora:younglives> . }"

I think it should be straightforward to see that this is a very useful feature to have.

Python Triplestore backed by MySQL:

Install rdflib - either by using easy_install, or from this site: http://rdflib.net/
Install the MySQL libraries for python (debian pkg - python-mysqldb)

Then, to create a triplestore and add things to it:

# -*- coding: utf-8 -*-

import rdflib
from rdflib.Graph import ConjunctiveGraph as Graph
from rdflib import plugin
from rdflib.store import Store
from rdflib import Namespace
from rdflib import Literal
from rdflib import URIRef

from rdflib.sparql.bison import Parse

graph = Graph()

# Now, say you have some RDF you want to add to this. If you wanted to add a file of it:
# May I suggest http://ora.ouls.ox.ac.uk:8080/fedora/get/ora:1600/RELS-EXT as this will make the
# query work :)

# Adding a single triple needs a little prep, (namespace definitions, etc)
dc = Namespace("http://purl.org/dc/elements/1.1/")
fedora = Namespace("info:fedora/") # Or whatever

# the format is -> graph.add( (subject,predicate,object) )
graph.add( (fedora['ora:1600'], dc['title'], Literal('Childhood poverty, basic services and cumulative disadvantage: an international comparative analysis') ) )

# And to look at all of your RDF, you can serialize the graph:
# in RDF/XML
print graph.serialize()

#in ntriples
print graph.serialize(format='nt')

# Now to query it:
parsed_query = Parse("select ?objects where { ?objects <fedora-rels-ext:isMemberOf> <info:fedora/ora:younglives> . }"

response = graph.query(parsed_query).serialize() # in XML

# or directly as python objects:
response = graph.query(parsed_query).serialize('python')

To make it persistant, you need a store. This is how you add MySQL as a store:

# replace the graph = Graph() line with the following:
# Assumes there is a MySQL instance on localhost, with an empty, writable db 'rdfstore' owned by user 'rdflib' with password 'asdfasdf'

default_graph_uri = "http://host/rdfstore"

# Get the mysql plugin. You may have to install the python mysql libraries
store = plugin.get('MySQL', Store)('rdfstore')

# Create a new MySQL backed triplestore. Remove the 'create=True' part to reopen the store
# at a later date.
resp = store.open("host=localhost,password=asdfasdf,user=rdflib,db=rdfstore", create=True)
graph = Graph(store, identifier = URIRef(default_graph_uri))


toom said...


I can't execute the code.
where it says:
-- 8< --
# Now to query it:
parsed_query = Parse("select ?objects where { ?objects . }"
-- >8 --

i think it lacks a paranthesis, but even if i close it I can't still execute it.

I like the title of your blog. Just what I need :)


Ben O'Steen said...

Heh, sorry about that, it's a case of blogger ate my code. Should be fixed now. It should be the Sparql query from the above text.