Friday, 14 November 2008

Beginning with RDF triplestores - a 'survey'

Like last time, this was prompted by an email that eventually was passed to me. It was a call for opinion - "we thought we'd check first to see what software either of you recommend or use for an RDF database."

It's a good question.

In fact, it's a really great question, as searching for similar advice online results in very few opinions on the subject.

But which one's are the best for novices? Which have the best learning curves? which has the easiest install or the shortest time between starting out and being able to query things?

I'll try to pose as much as I can as a newcomer which won't be too hard :) Some of the comments will be my own, and some will be comments from others, but I'll try to be as honest as I can be to reflect new user expectation and experience and most importantly, developer-attention span. (See the end for some of my reasons for this approach.)

(Puts on newbie hat and enables PEBKAC mode.)

Installable (local) triplestores

Sesame - http://www.openrdf.org/

Simple menu on the left of the website, one called downloads. Great, I'll give that a whirl. "Download the latest Sesame 2.x release" looks good to me. Hmm 5 differently named files... I'll grab the 'onejar' file and try to run it. "Failed to load Main-Class manifest attribute from openrdf-sesame-2.2.1-onejar.jar", okay... so back to the site to find out how to install this thing.

No links for installation guide... on the Documentation page, no link for installation instructions for the sesame 2.2.1 I downloaded, but there is Sesame 2 user documentation and Sesame 2 system documentation. Phew, after guessing that the user documentation might have the guide, I finally found the installation guide  (system documentation was about the architecture, not how to administer the system as you might expect.)

(Developer losing interest...)

Ah, I see, I need the SDK. I wonder what that 'onejar' was then... "The deployment process is container-specific, please consult the
documentation for your container on how to deploy a web application. " - right, okay... let's assume that I have a Java background and am not just a user wanting to hook into it from my language of choice, such as php, ruby, python, or dare I say it, javascript.

(Only Java-friendly developers continue on)

Right, got Tomcat, and put in the war file... right so, now I need to work out how to use a commandline console tool to set up a 'repository'... does this use SVN or CVS then? Oh, it doesn't do anything unless I end the line with a period. I thought it had hung trying to connect!  "Triple indexes [spoc,posc]" Wha? Well, whatever that was, the test repository is created. Let's see what's at http://localhost:8080/openrdf-sesame then.

"You are currently accessing an OpenRDF Sesame server. This server is
intended to be accessed by dedicated clients, using a specialized
protocol. To access the information on this server through a browser,
we recommend using the OpenRDF Workbench software."

Bugger. Google for "sesame clients" then.
I've pretty much given up at this point. If I knew I needed to use a triplestore then I might have persisted, but if I was just investigating it? I would've probably given up earlier.

Mulgara - http://www.mulgara.org/

Nice, they've given the frontpage some style, not too keen on orange, but the effort makes it look professional. "Mulgara is a scalable RDF database written entirely in Java." -> Great, I found what I am looking for, and it warns me it needs Java. "DOWNLOAD NOW" - that's pretty clear. *click*

Hmm, where's the style gone? Lots of download options, but thankfully one is marked by "These released binaries are all that are required for most applications." so I'll grab those. 25Mb? Wow...

Okay, it's downloaded and unpacked now. Let's see what we've got - a 'dist/' directory and two jars. Well, I guess I should try to run one (wonder what the licence is, where's the README?)
Mulgara Semantic Store Version 2.0.6 (Build 2.0.6.local) INFO [main] (EmbeddedMulgaraServer.java:715) - RMI Registry started automatically on port 10990 [main] INFO org.mulgara.server.EmbeddedMulgaraServer  - RMI Registry started automatically on port 1099 INFO [main] (EmbeddedMulgaraServer.java:738) - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy3 [main] INFO org.mulgara.server.EmbeddedMulgaraServer  - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy2008-11-14 14:06:39,899 INFO  Database - Host name aliases for this server are: [billpardy, localhost, 127.0.0.1]
Well, I guess something has started... back to the site, there is a documentation page and a wiki. A quick view of the official documentation has just confused me, is this an external site? No easy link to something like 'getting started' or tutorials. I've heard of SPARQL, what's iTQL? nevermind, let's see if the wiki is more helpful.

Let's try 'Documentation' - sweet, first link looks like what I want - Web User Interface.
A default configuration for a standalone Mulgara server runs a set of
web services, including the Web User Interface. The standard
configuration puts uses port 8080, so the web services can be seen by
pointing a browser on the server running Mulgara to http://localhost:8080/.
Ooo cool. *click*

Available Services


SPARQL, I've heard of that. *click*

HTTP ERROR: 400

Query must be supplied

RequestURI=/sparql/

Powered by Jetty://

I guess that's the SPARQL api, good to know, but the frontpage could've warned me a little. Ah, second link is to the User Interface.

Good, I can use a drop down to look at lots of example queries, nice. Don't understand most of them at the moment, but it's definitely comforting to have examples. They look nothing like SPARQL though... wonder what it is? I'm sure it does SPARQL... was I wrong?

Quick poke at the HTML shows that it is just POSTing the query text to webui/ExecuteQuery. Looks straightforward to start hacking against too, but probably should password protect this somehow! I wonder how that is done... documentation mentions a 'java.security.policy' field:

java.security.policy

string: URL
: The URL for the security policy file to use.
Default: jar:file:/jar_path!/conf/mulgara-rmi.policy

Kinda stumped... will investigate that later, but at least there's hope. Just be firing off the example queries though shows me stuff, so I've got something to work with at least.

Jena - http://jena.sourceforge.net/

Front page is pretty clear, even if I don't understand what all those acronyms are. downloads link takes me to a page with an obvious download link, good. (Oh, and sourceforge, you suck. How many frikkin mirrors do I have to try to get this file?)

Have to put Jena on pause while Sourceforge sorts its life out.

ARC2 - http://arc.semsol.org/

Frontpage: "Easy RDF and SPARQL for LAMP systems" Nice, I know of LAMP and I particularly like the word Easy. Let's see... Download is easy to find, and tells me straight away I need PHP 4.3+ and MySQL 4.0.4+ *check* Right, now how do I enable PHP for apache again?... Ah, it helps if I install it first... Okay, done. Dropping the folder into my web space... Hmm nothing does anything. From the documentation, it does look like it is geared to providing a PHP library framework for working with its triplestore and RDF. Hang on, SPARQL Endpoint Setup looks like what I want. It wants a database, okay... done, bit of a hassle though.

Hmm, all I get is "Fatal error: Call to undefined function mysql_connect() in /********/arc2/store/ARC2_Store.php on line 53"

Of course, install php libraries to access mysql (PEBKAC)... done and I also realise I need to set up the store, like the example in "Getting Started"... done (with this) and what does the index page now look like?



Yay! there's like SPARQL and stuff... I guess 'load' and 'insert' will help me stick stuff in, and 'select' looks familiar... Well, it seems to be working at least.

Unfortunately, it looks like the Jena download from sourceforge is in a world of FAIL for now. Maybe I'll look at it next time?

Triplestores in the cloud

Talis Platform - http://www.talis.com/platform/

From the frontpage - "Developers using the Platform can spend more of their time building
extraordinary applications and less of their time worrying about how
they will scale their data storage.
" - pretty much want I wanted to hear, so how do I get to play with it?

There is a Get involved link on the left, which rapidly leads me to see the section: "Develop, play and try out" - n2 developer community seems to be where it wants me to go.

Lots of links on the frontpage, takes a few seconds to spot: "Join - join the n² community to get free developer stores and online support" - free, nice word that. So, I just have to email someone? Okay, I can live with that.

Documentation seems good, lots of choices though, a little hard to spot a single thread to follow to get up to speed, but Guides and Tutorials looks right to get going with. The Kniblet tutorial (whatever a kniblet is) looks the most beginnerish, and it's also very PHP focussed, which is either a good thing or a bad thing depending on the user :)

Commercial triplestores

Openlink Virtuoso - http://virtuoso.openlinksw.com/

Okay, I tried the Download link, but I am pretty confused by what I'm greeted with:



Not sure what one to pick just to try it out, it's late in the day, and my tolerance for all things installable has ended.

-----------------------------------------

Why take the http/web-centric, newbie approach to looking at these?

Answer: In part, I am taking this approach because I have a deep belief that it
was only after relational DBs became commoditised - "You want fries
with you MySQL database?" - that the dynamic web kicked off. If we want
the semantic web to kick off, we need to commoditise it or at least, make
it very easy for developers to get started. And I mean EASY. A query that I want answered is: "Is there something that fits: 'apt-get install
triplestore; r = store('localhost'), r.add(rdf), r.query(blah)'? "

(I am particularly interested to see what happens when Tom Morris's work on Reddy collides with ActiveRecord or activerdf...)

NB I've short circuited the discovery of software homepages - Imagine
I've seen projects stating that they use "XXXXX as a triplestore". I know
this will likely mean I've compared apples to oranges, but as a newbie, how
would I be expected to know this? "Powered by the Talis Platform" and
"Powered by Jena" seem pretty similar on the surface.)

Thursday, 13 November 2008

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

(mods:subtitle - Slightly more technical follow-up to the Fedora Hatcheck piece.)

As I have been prompted via email by Phil Cryer (of the Missouri Botanical Garden) to talk more about how this technically works, I thought it would be best to make it a written post, rather than the more limited email response.

Background

Forced Migration Online (FMO) had a proprietary system, supporting their document needs. It was originally designed for newpaper holdings and applied that model to encoding the mostly paginated documents that FMO held - such that each part was broken up into paragraphs of text, images and the location of all these parts on a page. It even encoded (in its own format) the location of the words on the page when it OCR'd the documents, making per-word higlighting possible. Which is nice.

However, the backend that powered this was over-priced, and FMO wanted to move to a more open, sustainable platform.

Enter the DAMS

(DAMS = Digital Asset Management System)

I have been doing work on trying to make a service out of a base of fedora-commons and additional 'plugin' services, such as the wonderful Apache Solr and the useful eXist XML db. The end aim is for departments/users/whoever to requisition a 'store' with a certain quality of service (solr attached, 50Gb+ etc) but this is not yet an automated process.

The focus for the store is a very clear separation between storage, management, indexing services and distribution - Normal filesystems, or Sun Honeycomb are the storage, Fedora-commons provides the management + CRUD, solr, eXist, mulgara, sesame, and couchDB can provide potential index and query services, and distribution is handed pragmatically, caching outgoing and mirroring where necessary.

The FMO 'store'

From discussions with FMO, and examining the information they held and the way they wished to make use of it, a simple Fedora/Solr store seemed to fufill what they wanted: a persistant store of items with attachments and the ability to search the metadata and retrieve results.

Bring in the consultants

FMO hired Aptivate to do the migration of their data from the proprietary system, in its custom format, to a Fedora/Solr store and trying as much as possible to retain the functionality they had.

Some points that I think it is important to impress on people here:
  • In general, software engineer consultants don't understand METS or FOXML.
  • They *really* don't understand the point of disseminators.
  • Having to teach software engineer consultants to do METS/FOXML/bDef's etc is likely an arduous and costly task.
  • Consultants add lots of money to do things their team don't already have the experience to do.
So, my conclusion was to not make these things part of the development at all to the extent that I might even have forgotten to mention these things to them except in passing. I helped them install their own local store and helped them with the various interfaces and gotchas of the two software packages. By showing them how I use Fedora and Solr in ora.ouls.ox.ac.uk, they were able to hit the ground running.

They began by using the REST interface to Fedora and the RESTful interface to Solr. By having them begin by using the simple put/get REST interface to Fedora, they could concentrate on getting used to the nature of Fedora as an objectstore. I think they moved to use the SOAP interface as it better suited their Java background, although I cannot be certain as it wasn't an issue that came up.

Once they had developed the migration scripts to their satisfaction, they asked me to give them a store, which I did (but due to hardware and stupid support issues here I am afraid to say I held them up on this.) They fired off their scripts, moved all the content into the fedora with a straightforward layout per object (pdf, metadata, fulltext and thumbnail) The metadata is - from what I can see - the same XML metadata as before - very MARCXML in nature, with 'Application_Info' elements having types like 'bl:DC.Title'. If necessary, we will strip out the dublin core metadata and put what we can into the DC datastream, but that's not of particular interest to FMO right now.

Fedora/Solr notes

As for the link between Solr and Fedora? This is very loosely coupled, such that they are running in the same Tomcat container for convenience, but aren't linked in a hard way.

I've looked at GSearch, which is great for a homogenous collection of items, such that they can be acted on by the same XSLT to produce a suitable record for Solr, but as the metadata was a complete unknown for this project, it wasn't too suitable.

Currently, they have one main route into the fedora store, and so, it isn't hard to simply reindex an item after a change is made, especially for services such as Solr or eXist, which expect to have things change incrementally. I am looking at services such as ActiveMQ for scheduling these index tasks, but more and more I am starting to favour RabbitMQ which seems to be more useful, while retaining the scalability and very robust nature.

Sending an update to Solr is as simple as an HTTP POST to its /update service, consisting of a XML or JSON packet like "changeme:1John Smith...." - it uses a transactional model, such that you can push all the changes and additions into the live index via a commit call, without taking the index offline. To query Solr, all manner of clients exist, and it is built to be very simple to interact with, handling facet queries, filtering, ordering and can deliver the results in XML, JSON, PHP or Python directly. It can even do a XSLT transform of the results on the way out, leading to a trivial way to support OpenSearch, Atom feeds and even HTML blocks for embedding in other sites.

Likewise, to change a PDF in Fedora can be done by a HTTP POST as well. Does it need to be more complicated?

Last, but not least, a project to watch closely:

The Fascinator project, funded by ARROW, as part of their mini project scheme, is an Apache Solr front end to the Fedora commons repository. The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. Well worth a look, as it seeks to turn this Fedora/Solr pairing truly into an appliance, with a simple installer and handling the linkage between the two.

OCLC - viral licence being added to WorldCat data

Very short post on this, as I just wanted to highlight a fantastic piece written by Rob Styles about OCLC's policy changes to WorldCat

In a nutshell, it seems that OCLC’s policy changes have the intention to restrict the usage of the data in order to prevent competing services from appearing. Competing services such as LibraryThing and the Internet Archive's Open Library but unfortunately, it seems that the changes will also impinge on the rights of users to collate citation lists in software such as Zotero, Endnote and others. Read Rob's post for a well researched view on the changes.

Wednesday, 12 November 2008

Useful, interesting, inspiring technology/software that is out there that you might not know about.

(I guess this is more like a filtered link list, but with added comments in case you don't feel like following the links to find out what it's all about.. A mix of old, but solid links and a load of tabs that I really should close ;))
  1. Tahoe - http://allmydata.org/~warner/pycon-tahoe.html
    The "Tahoe" project is a distributed filesystem, which safely stores files on multiple machines to protect against hardware failures. Cryptographic tools are used to ensure integrity and confidentiality, and a decentralized architecture minimizes single points of failure. Files can be accessed through a web interface or native system calls (via FUSE). Fine-grained sharing allows individual files or directories to be delegated by passing short URI-like strings through email. Tahoe grids are easy to set up, and can be used by a handful of friends or by a large company for thousands of customers.
  2. CouchDB - http://incubator.apache.org/couchdb/

    Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.

    CouchDB is written in Erlang, but can be easily accessed from any environment that provides means to make HTTP requests. There are a multitude of third-party client libraries that make this even easier for a variety of programming languages and environments.

  3. Yahoo Term Extractor - http://developer.yahoo.com/search/content/V1/termExtraction.html
    The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content.
  4. Kea term extractor (SKOS enabled) - http://www.nzdl.org/Kea/
    KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
  5. Piwik - http://piwik.org/

    piwik is an open source (GPL license) web analytics software. It gives interesting reports on your website visitors, your popular pages, the search engines keywords they used, the language they speak… and so much more. piwik aims to be an open source alternative to Google Analytics.

  6. RabbitMQ - http://www.rabbitmq.com/
    RabbitMQ is an Open (Mozilla licenced) implementation of AMQP, the emerging standard for high performance enterprise messaging. Built on top of the Open Telecom Platform (OTP). OTP is used by multiple telecommunications companies to manage switching exchanges for voice calls, VoIP and now video. These systems are designed to never go down and to handle truly vast user loads. And because the systems cannot be taken offline, they have to be very flexible, for instance it must be possible to 'hot deploy' features and fixes on the fly whilst managing a consistent user SLA.
  7. Talis Platform - http://n2.talis.com/wiki/Main_Page
    The Talis Platform provides solid infrastructure for building Semantic Web applications. Delivered as Software as a Service (SaaS), it dramatically reduces the complexity and cost of storing, indexing, searching and augmenting data. It enables applications to be brought to market rapidly with a smaller initial investment. Developers using the Platform can spend more of their time building extraordinary applications and less of their time worrying about how they will scale their data storage.
  8. The Fascinator - http://ice.usq.edu.au/projects/fascinator/trac
    The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. This contrasts with solutions that use RDF for browsing by ‘collection’, XACML for security and a text indexer for fulltext search, and in some cases relational database tables as well. We wanted to see if taking out some of these layers makes for a fast application which is easy to configure. So far so good.
  9. RDFQuery - http://code.google.com/p/rdfquery/

    This project is for plugins for jQuery that enable you to generate and manipulate RDF within web pages [in javascript]. There are two main aims of this project: 1) to provide a way of querying and manipulating RDF triples within Javascript that is as intuitive as jQuery is for querying and manipulating a DOM, and 2) to provide a way of gleaning RDF from elements within a web page, whether that RDF is represented as RDFa or with a microformat.

  10. eXist - http://exist.sourceforge.net/
    eXist-db is an open source database management system entirely built on XML technology. It stores XML data according to the XML data model and features efficient, index-based XQuery processing. It supports:
  11. Evergreen - http://evergreen-ils.org/
    Evergreen is an enterprise-class [Open Source] library automation system
    that helps library patrons find library materials, and helps libraries
    manage, catalog, and circulate those materials, no matter how large or
    complex the libraries.
  12. Apache Solr - http://lucene.apache.org/solr/
    Solr is an open source enterprise search server based on the
    Lucene Java search library
    , with XML/HTTP and JSON APIs,
    hit highlighting, faceted search, caching, replication, a web administration interface and many more features.
  13. GATE - http://gate.ac.uk/
    GATE is a leading toolkit for text-mining. It bills itself as "the Eclipse of Natural Language Engineering,
    the Lucene of Information Extraction" [NB I have yet to use this, but it has the kind of provenance and userbase that makes me feel okay about sharing this link]
  14. Ubuntu JeOS - http://www.ubuntu.com/products/whatisubuntu/serveredition/jeos
    Ubuntu Server Edition JeOS (pronounced "Juice") is an efficient variant
    of our server operating system, configured specifically for virtual
    appliances
    . Currently available as a CD-Rom ISO for download, JeOS is a
    specialised installation of Ubuntu Server Edition with a tuned kernel
    that only contains the base elements needed to run within a virtualized
    environment.
  15. Synergy2 - http://synergy2.sourceforge.net/
    Synergy lets you easily share a single mouse and keyboard between
    multiple computers with different operating systems, each with its
    own display, without special hardware
    . It's intended for users
    with multiple computers on their desk since each system uses its
    own monitor(s).