Friday 14 November 2008

Beginning with RDF triplestores - a 'survey'

Like last time, this was prompted by an email that eventually was passed to me. It was a call for opinion - "we thought we'd check first to see what software either of you recommend or use for an RDF database."

It's a good question.

In fact, it's a really great question, as searching for similar advice online results in very few opinions on the subject.

But which one's are the best for novices? Which have the best learning curves? which has the easiest install or the shortest time between starting out and being able to query things?

I'll try to pose as much as I can as a newcomer which won't be too hard :) Some of the comments will be my own, and some will be comments from others, but I'll try to be as honest as I can be to reflect new user expectation and experience and most importantly, developer-attention span. (See the end for some of my reasons for this approach.)

(Puts on newbie hat and enables PEBKAC mode.)

Installable (local) triplestores

Sesame - http://www.openrdf.org/

Simple menu on the left of the website, one called downloads. Great, I'll give that a whirl. "Download the latest Sesame 2.x release" looks good to me. Hmm 5 differently named files... I'll grab the 'onejar' file and try to run it. "Failed to load Main-Class manifest attribute from openrdf-sesame-2.2.1-onejar.jar", okay... so back to the site to find out how to install this thing.

No links for installation guide... on the Documentation page, no link for installation instructions for the sesame 2.2.1 I downloaded, but there is Sesame 2 user documentation and Sesame 2 system documentation. Phew, after guessing that the user documentation might have the guide, I finally found the installation guide  (system documentation was about the architecture, not how to administer the system as you might expect.)

(Developer losing interest...)

Ah, I see, I need the SDK. I wonder what that 'onejar' was then... "The deployment process is container-specific, please consult the
documentation for your container on how to deploy a web application. " - right, okay... let's assume that I have a Java background and am not just a user wanting to hook into it from my language of choice, such as php, ruby, python, or dare I say it, javascript.

(Only Java-friendly developers continue on)

Right, got Tomcat, and put in the war file... right so, now I need to work out how to use a commandline console tool to set up a 'repository'... does this use SVN or CVS then? Oh, it doesn't do anything unless I end the line with a period. I thought it had hung trying to connect!  "Triple indexes [spoc,posc]" Wha? Well, whatever that was, the test repository is created. Let's see what's at http://localhost:8080/openrdf-sesame then.

"You are currently accessing an OpenRDF Sesame server. This server is
intended to be accessed by dedicated clients, using a specialized
protocol. To access the information on this server through a browser,
we recommend using the OpenRDF Workbench software."

Bugger. Google for "sesame clients" then.
I've pretty much given up at this point. If I knew I needed to use a triplestore then I might have persisted, but if I was just investigating it? I would've probably given up earlier.

Mulgara - http://www.mulgara.org/

Nice, they've given the frontpage some style, not too keen on orange, but the effort makes it look professional. "Mulgara is a scalable RDF database written entirely in Java." -> Great, I found what I am looking for, and it warns me it needs Java. "DOWNLOAD NOW" - that's pretty clear. *click*

Hmm, where's the style gone? Lots of download options, but thankfully one is marked by "These released binaries are all that are required for most applications." so I'll grab those. 25Mb? Wow...

Okay, it's downloaded and unpacked now. Let's see what we've got - a 'dist/' directory and two jars. Well, I guess I should try to run one (wonder what the licence is, where's the README?)
Mulgara Semantic Store Version 2.0.6 (Build 2.0.6.local) INFO [main] (EmbeddedMulgaraServer.java:715) - RMI Registry started automatically on port 10990 [main] INFO org.mulgara.server.EmbeddedMulgaraServer  - RMI Registry started automatically on port 1099 INFO [main] (EmbeddedMulgaraServer.java:738) - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy3 [main] INFO org.mulgara.server.EmbeddedMulgaraServer  - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy2008-11-14 14:06:39,899 INFO  Database - Host name aliases for this server are: [billpardy, localhost, 127.0.0.1]
Well, I guess something has started... back to the site, there is a documentation page and a wiki. A quick view of the official documentation has just confused me, is this an external site? No easy link to something like 'getting started' or tutorials. I've heard of SPARQL, what's iTQL? nevermind, let's see if the wiki is more helpful.

Let's try 'Documentation' - sweet, first link looks like what I want - Web User Interface.
A default configuration for a standalone Mulgara server runs a set of
web services, including the Web User Interface. The standard
configuration puts uses port 8080, so the web services can be seen by
pointing a browser on the server running Mulgara to http://localhost:8080/.
Ooo cool. *click*

Available Services


SPARQL, I've heard of that. *click*

HTTP ERROR: 400

Query must be supplied

RequestURI=/sparql/

Powered by Jetty://

I guess that's the SPARQL api, good to know, but the frontpage could've warned me a little. Ah, second link is to the User Interface.

Good, I can use a drop down to look at lots of example queries, nice. Don't understand most of them at the moment, but it's definitely comforting to have examples. They look nothing like SPARQL though... wonder what it is? I'm sure it does SPARQL... was I wrong?

Quick poke at the HTML shows that it is just POSTing the query text to webui/ExecuteQuery. Looks straightforward to start hacking against too, but probably should password protect this somehow! I wonder how that is done... documentation mentions a 'java.security.policy' field:

java.security.policy

string: URL
: The URL for the security policy file to use.
Default: jar:file:/jar_path!/conf/mulgara-rmi.policy

Kinda stumped... will investigate that later, but at least there's hope. Just be firing off the example queries though shows me stuff, so I've got something to work with at least.

Jena - http://jena.sourceforge.net/

Front page is pretty clear, even if I don't understand what all those acronyms are. downloads link takes me to a page with an obvious download link, good. (Oh, and sourceforge, you suck. How many frikkin mirrors do I have to try to get this file?)

Have to put Jena on pause while Sourceforge sorts its life out.

ARC2 - http://arc.semsol.org/

Frontpage: "Easy RDF and SPARQL for LAMP systems" Nice, I know of LAMP and I particularly like the word Easy. Let's see... Download is easy to find, and tells me straight away I need PHP 4.3+ and MySQL 4.0.4+ *check* Right, now how do I enable PHP for apache again?... Ah, it helps if I install it first... Okay, done. Dropping the folder into my web space... Hmm nothing does anything. From the documentation, it does look like it is geared to providing a PHP library framework for working with its triplestore and RDF. Hang on, SPARQL Endpoint Setup looks like what I want. It wants a database, okay... done, bit of a hassle though.

Hmm, all I get is "Fatal error: Call to undefined function mysql_connect() in /********/arc2/store/ARC2_Store.php on line 53"

Of course, install php libraries to access mysql (PEBKAC)... done and I also realise I need to set up the store, like the example in "Getting Started"... done (with this) and what does the index page now look like?



Yay! there's like SPARQL and stuff... I guess 'load' and 'insert' will help me stick stuff in, and 'select' looks familiar... Well, it seems to be working at least.

Unfortunately, it looks like the Jena download from sourceforge is in a world of FAIL for now. Maybe I'll look at it next time?

Triplestores in the cloud

Talis Platform - http://www.talis.com/platform/

From the frontpage - "Developers using the Platform can spend more of their time building
extraordinary applications and less of their time worrying about how
they will scale their data storage.
" - pretty much want I wanted to hear, so how do I get to play with it?

There is a Get involved link on the left, which rapidly leads me to see the section: "Develop, play and try out" - n2 developer community seems to be where it wants me to go.

Lots of links on the frontpage, takes a few seconds to spot: "Join - join the n² community to get free developer stores and online support" - free, nice word that. So, I just have to email someone? Okay, I can live with that.

Documentation seems good, lots of choices though, a little hard to spot a single thread to follow to get up to speed, but Guides and Tutorials looks right to get going with. The Kniblet tutorial (whatever a kniblet is) looks the most beginnerish, and it's also very PHP focussed, which is either a good thing or a bad thing depending on the user :)

Commercial triplestores

Openlink Virtuoso - http://virtuoso.openlinksw.com/

Okay, I tried the Download link, but I am pretty confused by what I'm greeted with:



Not sure what one to pick just to try it out, it's late in the day, and my tolerance for all things installable has ended.

-----------------------------------------

Why take the http/web-centric, newbie approach to looking at these?

Answer: In part, I am taking this approach because I have a deep belief that it
was only after relational DBs became commoditised - "You want fries
with you MySQL database?" - that the dynamic web kicked off. If we want
the semantic web to kick off, we need to commoditise it or at least, make
it very easy for developers to get started. And I mean EASY. A query that I want answered is: "Is there something that fits: 'apt-get install
triplestore; r = store('localhost'), r.add(rdf), r.query(blah)'? "

(I am particularly interested to see what happens when Tom Morris's work on Reddy collides with ActiveRecord or activerdf...)

NB I've short circuited the discovery of software homepages - Imagine
I've seen projects stating that they use "XXXXX as a triplestore". I know
this will likely mean I've compared apples to oranges, but as a newbie, how
would I be expected to know this? "Powered by the Talis Platform" and
"Powered by Jena" seem pretty similar on the surface.)

5 comments:

Stephen said...

does fedora/mulgara fare any better?

Ben O'Steen said...

@stephen ...fedora/mulgara fares signifigantly worse in terms of "amount-of-knowledge-needed-to-install" - barring scripts like the Fascinator, of course.

Stephen said...

Thats a shame, because I think your article on modeling with rdf/fedora really brings out where the triplestore really shines, that is, combined with a digital library. An obvious application is as a viable alternative for iphotos function as a DL.

btw i couldn't tell if mulgara was packaged with the fedora3 installer? (I'm persisting with my attempts, but a ppc mac as my only available machine is hampering things. )

Ben O'Steen said...

Mulgara should be shipped as default with both the latest Fedora 3.1 and the 2.2.4(?) versions.

I've had issues installing Fedora on Mac, but if you can do it, it is worth the effort - the mix of an RDF modelled store of objects with the Fedora model of providing fixed aggregations of objects is a powerful one.

Anonymous said...

the JAVA stuff was a parade of fail for me, Virtuoso is a huge beast i dont want running on phone/netbook and i hate PHP, not to mention the needless abstraction/bloat/latency of a solution like pomegranate->reddy->activerdf->sparql->tcp->virtuoso (and back) and all the awesome sounding triplestores written in C (garlik/metaweb) are proprietary so i ended up writing a minimalist FS based triple-store for element

the tradeoff is using a decent FS for optimized for many small and/or blank nodes like BTRfs or Reiser4, and not having to worry about your DB server being there and all sorts of metalayers in between running smooth

i mean, if your FS isnt there, you got bigger issues to worry about..