Friday, 18 April 2008

Distributed objects - how to cope with objects scattered across multiple Fedora's

I was asked this question by Chris Wilper a couple of weeks ago:
From http://blogs.sun.com/georg/entry/open_respository_2008_day_3
"Scalability : objects can be placed in any object store on the
network, and located via their object meta data. This means that
scaling over multiple Fedora instances is a no brainer."


Sounds cool. How'd you do resolution? (e.g. a request comes in,
which repository is it in?)

At the time, I had a good number of solutions I had tried, with one solution waiting on something to become more stable for it to work. They all had their plus sides and their minuses. They generally followed the tried and tested method of either a centralised source, or having a DNS style system.

What I didn't write to him about was an idea that's been fermenting in my mind for a little while now but which wasn't thought through enough to explain at that time.

De-centralised database for Fedora object URIs

The base premise is to use something called a distributed hash table or DHT to hold the link between URI and base Fedora URL. Now, as DHTs are kinda new and tend to be found in 2 main fields - research projects and trackerless bitorrent - I'll just write a brief summary of what they are.

Distributed Hash Tables (DHT)

From Wikipedia:
Distributed hash tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table: (name, value) pairs are stored in the DHT, and any participating node can efficiently retrieve the value associated with a given name. Responsibility for maintaining the mapping from names to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows DHTs to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

DHTs form an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, and instant messaging. Notable distributed networks that use DHTs include BitTorrent (with extensions), eDonkey network, YaCy, and the Coral Content Distribution Network.

(Emphasis my own)

As other, much more qualified people, have written about the underlying algorithms that power DHT, I will simply link to the articles that I have found illuminating. (DHT can also be treated as a black-box technology if you wish too)


Let's illustrate how having a DHT of URI-to-Fedora-Server pairs will help us, by skipping ahead and imaging that we already have the situation that each Fedora server hes a DHT node service of its own - just like a peer2peer filesharing program, it maintains a list of value pairs for the items it holds.

Let's now imagine 3 commands that a Fedora instance can use to work with it's own DHT node - get, put and remove:
  • get key - gets the values associated with a given key
  • put key value - puts the given key-value pair into the DHT
  • remove key value - removes the given key-value pair from the DHT
So, as a new item is allocated to a given Fedora, it can add a pair to the DHT, the key being the item's URI <info:fedora/ns:id>, and the value being the base URL for that Fedora, e.g. <http://host:8080/fedora>


$ ./put.py info:fedora/uuid:00e41229-1c9f-4c1d-ac3a-b51d34bbbe8f http://archive.sers.ox.ac.uk:8080/fedora
Success

$ ./get.py info:fedora/uuid:00e41229-1c9f-4c1d-ac3a-b51d34bbbe8f
http://archive.sers.ox.ac.uk:8080/fedora


Specific implementation notes

The Bamboo DHT implementation is a very useful implementation of a DHT system, and has a good amount of helpful documentation. It is also the basis for a good test-bed service, the OpenDHT service mentioned earlier.

In fact, the three commands illustrated above, have real, live python implementations which are presented on the OpenDHT site - get.py, put.py, and rm.py. The underlying mechanics to the protocol is just simple XMLRPC, so most languages have solid libraries for interfacing with the API.

I see this service hooking into Fedora by using a listening service on the ActiveMQ message queue - put'ing the URI/URL hash pair when items are added, and rm'ing when purged. The Fedora and the Bamboo service should be booted and shutdown together, adding or removing whole sets of hashes to or from the DHT, accurately reflecting the accessibility of the Fedora item.

Now that we have a DHT, why not...

Another beneficial use of the hash table may be to encode certain information, such as the template URL for a HTML splash page (should one exist) - perhaps the template style as used by OpenSearch 1.1 - e.g. "http://archive.sers.ox.ac.uk:5000/resolve/{uri}" (Note: I have added this resolving service to handle both info:fedora/ns:id and info:fedora/ns:id/dsid type URIs - the first leads to the splash page and the second form redirects to the download for that given datastream)

Important final note

It is very important in distributed environments that you have UUIDs (which may or may not involve the numerical UUIDs I promote the use of) - the important part is that each Fedora identifier is unique across the whole set of Fedora instances. As there the only issue with joining these URI lookup tables across institutional boundaries is political, it may be a good thing to adopt a consistent and bullet-proof mechanism for ensuring that your id system is not going to collide with someone elses.

Thursday, 17 April 2008

Ditching the DB-based blog for a semantic one

Why Yet-another-blog-engine?

Well, blog engines tend to do the same things, their functionality is derived by simple views on a relational DB. To a large extent, I think that this RDB reliance has shaped the scope of what you can do with a blog and also I really feel it has guided how the blog (and related publishing) technology has developed.

Things like blog export, and interlinking of blogs and the bloggers who power the system are seen as extras, features added as plugins or as an additional service - this needs to change! Let's see what naturally happens when we try to build a system with a more interesting backend.

So, what to replace the RDB with?

Simply put, the data which a blog needs to function normally can be modelled in an objectstore, by linking items together by RDF predicates. Specifically, there is a namespace created by the SIOC project, aimed at defining social networks and their inter-linking in a semantic way - http://www.sioc-project.org/. I have a strong hunch that by using this work, a whole load of extra possibilities will emerge.

(Aside from the obvious benefits of simple export and reuse of objects, being able to make a single comment on more than one blog post from more than one blog, and so on.)

Okay... what's the plan?

So, my coding bias for objectstore and framework language is FedoraCommons and python, so no surprise there. It also means that I'll be reusing my code, so each object will have a OAI-ORE aggregration and good search intergration (via Apache Solr).

Modelling the blog:

Luckily, the good folks at the SIOC project have done a good load of the work for me, and having read through their work, I can say that I see no problem with it for my purposes. This means that using their namespace (http://rdfs.org/sioc/ns#), I can adopt the structure of classes, helpfully illustrated here

The first class objects from or subclassed from SIOC therefore will be as follows (a first class object has a 1-to-1 parity with the underlying Fedora objects):

User [contains or links to FOAF record] - Post [text and zero or more attachments] - Forum(Blog) [Dublin Core] - Post(Comment) [text] - Site [Dublin Core]

Other first class objects:

Link [DC record] and Petition [ Later ;) ]

Link speaks for itself. If a post contains a link, the link is promoted to an object and the post will connect to the link object. If someone else uses that link, the previous object is reused.

Petition is a social experiment which I will go into later ;)

Now for the behaviour of the blog - I am envisioning an academically focussed blog, so the social network, persistance, trust and discourse are important features to consider.

Users - accounts and blogs

Frankly, I am tired of writing authentication systems. So is everyone else. Users are tired of having an account per site too. Thank god for OpenID then :)

Right, so we have a working, live system for authenticating people, but what about authorising? Authenticate to comment, that's self-explanatory. But this is where we can do some interesting things:

The blog engine is 'seeded' by a blog author or authors, most likely the same people that installed the engine. Borrowing a common idea, these seeds can invite other people to have a publishing account. This is done by an indication of trust - at a technical level <uri-inviter> <trust:trust10> <uri-invitee>, with the <uri-invitee> being the User object created to correspond to a given OpenID. The next time they log in, the ability to create a blog should be apparent.

(trust namespace: http://trust.mindswap.org/ont/trust.owl)

Why not <foaf:knows>? Because that is saved for later :) <trust:trust10> is a predicate intended as a way for someone to fully vouch for someone else - you may know people, but that doesn't mean they should automatically gain blogging rights just because you have those rights.

Any post (post, comment or link) can be tagged as interesting to a User, via the <foaf:interest> predicate - declaring "I am interested in this thing" The payoff to the user is that their page (every User is an object, remember) will display the things they have marked as interesting. If a User declares that another User's blog is 'interesting' then all the latter User's posts will be accessible as well from here.

There are two forms of free-text tags; a trusted tag and a normal tag. Trusted tags are those placed on a Post, Blog or User, by the author/owner of that object - a statement about what the author feels is the subject of the object. An author can also tag themselves, and this is to give extra indication about what they normally blog or comment on, in addition to the tags they've put on their own items.

A normal tag can be placed by any authenticated User on any Post or Blog. Expected functionality, really.

I haven't tripped across a suitable ontology for these two, and I'd really like to use an existing one so feel free to add a comment if you have an idea of one.

Now, I mentioned a curious thing called a Petition - this is a second method to gain trust in the social network. A User can make a Petition, stating a brief summary of what they'll like to write about, and then tag the Petition with its subjects.

This is where it gets semantically fun - a Petition will be visible to existing blog posters with the following filters available:
  • Show Petitions that have tags in common with mine, from Users I trust
  • Show Petitions that have tags in common with mine, from Users I know
  • Show Petitions that have tags in common with mine, from Users my friends know
  • Show Petitions from Users who I've shown interest in
  • Show Petitions from Users who my friends show interest in
  • Show Petitions that have tags in common with mine
  • Show most recent Petitions
It doesn't take much more time to see that the information already in the triplestore can be used to create some really interesting filters.

Now, a User can then decide to place a level of trust in a given Petitioner. The actual mechanics of what is required to elevate the trust placed in the Petitioner is up to the system installer, but a few interesting things can be used: A requirement for the combined <trust:trust> given to a User equivalent to a certain level, maybe a tiered system (2 trust9's are equivalent to 1 trust10, etc)

The bottom line is that this trust system means that there is no requirement for a super User as the network should be self-regulating - take the trust away, and the Petitioner loses their Blog. (It also doesn't mean that having a Super user is a bad idea!)

Posts and Posting

I hate the word blog... I've been using it as a crutch, as what I'd like this to be is a site to have a voice on. By using semantic relationships, it is quite possible to view it as you might a blog, but you might also view it like a forum, with posts and threaded comments. The underlying information and connections are the same, but the way you can view and present this become a lot more flexible.

So when I say Post, I mean it in a twitter/blogger/thread-reply kind of way ;)

The Post objects have a summary (200 char limit) and an optional body (a blog 'post') - no title!. They can be tagged of course and the post can also hold attachments for download, or embedding here or elsewhere. (All resources have dereferenceable URIs, and so can be linked to directly)

So a Post can be a 'tweet', or a 'blog' post - it's all the same thing. However, on the user's jumpoff page, the summaries are listed.

I'll have a go at putting something together, to see what works and what doesn't, so watch this space.

Wednesday, 16 April 2008

Release of alpha-quality web interface framework for Fedora

Just a heads up that I have uploaded the code for the web interface framework for Fedora - the same framework I used at Open Repositories 2008, but cleaned up a bit. (In respect for that, it ships with the same graphics and blurb from the OR08 EPrints repository)

Project Home:
http://code.google.com/p/python-fedoracommons-webarchive/

I am adding documentation as I go along, and it is at an early preview level. E.g. if you can get it up and running (which isn't too taxing) have fun. Please raise any issues or problems on the Google code issue tracker.

One thing I'd like to point out, is that this is very much a framework - you tailor it to how you need it. What it provides is:

- Items can have a content-type, and this predefines how the item is presented - see http://code.google.com/p/python-fedoracommons-webarchive/source/browse/trunk/archive/lib/cmodel_mapper.py and http://code.google.com/p/python-fedoracommons-webarchive/source/browse/trunk/archive/lib/app_globals.py for more in depth details.

- Items support pingback, and trackback out of the box.
- See the URL structures here for more goodness: http://code.google.com/p/python-fedoracommons-webarchive/wiki/DefaultURLScheme

- Oh and the project is most definitely a WIP, so bear with me. It may not all be there at the moment, but I work fast :)