Thursday, 10 January 2008

Populating a search engine (Apache Solr) from Fedora Objects

I am going to move fairly fast on this one, with the following assumptions about the person(s) reading this:
  • You can at least set up a linux server, with Fedora 2.2 or so using its bundled tomcat without help from me
  • You have python, and have added the libraries indicated at the top of this post.
  • You have pulled the SVN hosted libraries, also from the aforementioned post from:
    • svn co https://orasupport.ouls.ox.ac.uk/archive/archive/lib
Let's get started then after I outline why I am using Solr:

Why Solr

From their front page:

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.
It has open APIs, faceted searching, replication and if you have a look at SOLR-303, the developers have added in distributed (i.e. federated) searching over HTTP between Solr instances, the precise functionality is still being refined (cleaner handling of tiebreakers, refinement queries, etc) but is functioning nonetheless.

I think from now on, I would no more code my own web server than I would code my own search engine service.

Getting and Installing Solr

First step is to grab Solr from this page. Pick a mirror that is close to you, and download the Solr package (I am currently using version 1.2), either as the .zip version or the gzipped tar, .tgz.

Extract the whole archive somewhere on disc and you will see something like this in the apache-solr-1.2 folder:

~/apache-solr-1.2.0$ ls
build.xml CHANGES.txt dist docs example KEYS.txt lib LICENSE.txt NOTICE.txt README.txt src

~/apache-solr-1.2.0$ ls dist
apache-solr-1.2.0.jar apache-solr-1.2.0.war

The easiest thing is to install Solr straight into an instance of Tomcat. One thing to be aware of is that search applications eat RAM and Heap for breakfast, so make sure you install it onto a server with plenty of RAM and it would be wise to increase the amount of Heap space available to the Tomcat instance. This can be done by making sure that the environment variable CATALINA_OPTS is set to "-Xmx512m" or even better "-Xmx1024m". This can be done inside the startup.sh script in your tomcat/bin directory is needed.

One final bit of advice before I point you at the rather good installation docs is that you might want to rename the .war file to match with the URL pathname you desire, as the guide relies on Tomcat automatically unpacking the archive:

So, a war called "apache-solr-1.2.0.war" will result in the final app being accessible at http://tomcat-hostname:8080/apache-solr-1.2.0/. I renamed mine to just solr.war.

Finally, Solr needs a place to keep its configuration files and its indexes. The indexes themselves have the capability to get huge (1Gb is not unheard of) and need somewhere to be stored. The documentation linked to below will refer to this location as 'your solr home' so it would be wise to make sure that this location has the space to expand. (NB this is not the directory inside Tomcat where the application was unbundled.)

Right, the installation instructions:

http://wiki.apache.org/solr/SolrInstall - Basic installation
http://wiki.apache.org/solr/SolrTomcat - Tomcat specific things to bear in mind

Now if you point your browser in the right place (http://tomcat_host:8080/solr/admin perhaps) you should see the admin page of the pre-built example application... which we do not want of course :)

Customising the Solr installation

The file solrconfig.xml is where you can configure the more mundane aspects of Solr - where it stores its indexes, etc. The documentation on what the options mean can be found here: http://wiki.apache.org/solr/SolrConfigXml but really, all you are initially going to change is the index save path, if you change that at all anyway.

The real file, the one that is pretty much crucial to what we are going to do, is the schema.xml. It has umpteen options, but you can get by, by just changing the example schema to hold the fields you want. Here is the wiki documentation on this file to help out and here (should) be a copy of the schema I am currently using

In fact, if you are using MODS or DC as your metadata, you may wish for the time being to just use the same schema as I am, just to progress.

Getting metadata into Solr

In my opinion, Solr has a really good API for remote management and updating the items ('documents' in solr lingo) in its indexes. Each 'document' has a unique id, and consists of one or more name-value pairs, like 'name=author, value=Ben'. To update or add a document to the Solr indexes, you HTTP POST a simple XML document of the following form to the update service, usually found at http://tomcathost:8080/solr/update.

The XML you would post looks like this:

<add>
<doc>
<field name="id">ora:1234</field>
<field name="title">The FuBar and me</field>
.... etc.
</doc>
</add>
Note that the unique identifier field is defined in the schema.xml near the end, by the element "uniqueKey", for example in my Solr instance, its <uniqueKey>id</uniqueKey>

The cunning thing about this is, is that to update a document in Solr, all I have to do is re-send the XML post above, having made any changes I wish. Solr will spot that it already has a document with id of 'ora:1234' and perform an update of the information in the index, rather than adding a second copy of this information.

One thing to note is that the service you are posting to is the update manager. No change is actually made to the indexes themselves until either the update manager is sent an XML package telling it to commit the changes, or the pre-defined (solrconfig.xml) maximum number of documents waiting to be indexed is hit.

Hooking into Solr from your library of choice

Handily, you are very unlikely to have to futz around writing code to connect to Solr, as there are a good number of libraries built for just such purposes - http://wiki.apache.org/solr/IntegratingSolr

As I tend to develop in python, I opted to start with the SolrPython script which I have made a little modification to and included in the libraries as libs/solr.py.

The BasicSolrFeeder class in solrfeeder.py and how it turns the Fedora Object into a Solr document

There are a number of different content types in the Oxford Research Archive, and ideally, there would be a type-specific method for indexing each one into the Solr service. As ORA is an emergent service, 99% of the items in the repository are text-based (journals, theses, book chapters, working/conference papers) and they all use either MODS (dsid: MODS) or simple Dublin Core (dsid: DC) to store their metadata.

I have written a python object called BasicSolrFeeder (inside the libs/solrfeeder.py file) which performs a certain script sequence of functions on a fedora object given its pid. (BasicSolrFeeder.add_pid in solrfeeder.py is the method I am describing.)

Using an array to hold all the resultant "<field name='foo'>bar</field>" strings
  • Get the content type of the object (from objContentModel field in the FOXML)
    • --> <field name="content_type">xxxxxxxx</field>
  • Get the datastream with id 'MODS' and if it is XML, pass it through the xsl sheet at 'xsl/mods2solr.xsl'
    • --> Lots of <field.... lines
  • If there is no MODS datastream, then default to getting the DC, and pass that through 'xsl/dc2solr.xsl'
    • --> Lots of <field.... lines
  • As collections are defined in a bottom-up manner in the RELS-EXT, applying an xsl transformation to the RELS-EXT datastream, will yield the collections that this object is in.
    • --> Zero or more < field name="collection">.... lines
    • (NB there is some migrationary code here that will also deduce the content type, as certain collections are being used to group types of content.)
  • Finally, if there is a datastream with id of FULLTEXT, this is loaded in as plain-text, and added to the 'text' field in Solr. This is how the searching the text of an object functions.
    • (NB There are other services that extract the text from the binary attachments to an object, which collate these texts and adds them as a datastream called FULLTEXT.)
  • This list of fields is then simply POSTed to the Solr update service, after which a commit may or may not be called.
So, if you have a fedora repository which uses either MODS or DC, and the collections are bottom-up, then the code should just work for you. You may need to tinker with the xsl stylesheets in the xsl/ directory to match what you want, but essentially it should work.

Here's an example script which will try to add a range of pids to a Solr instance:


from lib.solrfeeder import BasicSolrFeeder

sf = BasicSolrFeeder(fedora_url='http://ora.ouls.ox.ac.uk:8080/fedora',
fedora_version="2.0", # Supports either 2.0 or 2.2
solr_base="/solr", # or whatever it is for your solr app
solr_url='orasupport.ouls.ox.ac.uk:8080')
# Point at the tomcat instance for solr

# Now to scan through a range of pids, ora:1500 to ora:1550:

namespace = "ora"
start = "1500"
end = "1550"

# A simple hash to catch all the responses
responses = {}
for pid in xrange(start, end+1):
responses[pid] = sf.add_pid('%s:%s' % (namespace,pid), commit=False)

# Commit the batch afterwards to improve performance
sf.commit()

# Temporary variables to better report what went into Solr and
# what didn't
passed = []
failed = []

for key in responses:
if responses[key]:
passed.append('%s:%s' % (namespace, key))
else:
failed.append('%s:%s' % (namespace, key))
if passed:
print "Pids %s went in successfully." % passed
if failed:
print "Pids %s did not go in successfully." % failed


Adding security to Solr updates


Simple to add and update things in Solr, isn't it? A little too simple though, as by default anyone can do it. Let's add some authentication to the process. Solr does not concern itself with authenticating requests, and I think that is the right decision. The authentication should be either enforced by Tomcat, oor by some middleware.

The easiest mechanism is to use Tomcat's basic authentication scheme to password protect the solr/update url, to stop abuse by 3rd parties. It's pretty easy to do, and a quick google gives me this page - http://www.onjava.com/pub/a/onjava/2003/06/25/tomcat_tips.html - with 10 tips on running Tomcat. While most of the tips make for good reading, it is the 5th tip, about adding authentication to you Tomcat app, that is most interesting to us now.

Assuming that the password protection has been added, the script above needs a little change. The BasicSolrFeeder line needs to have two additional keywords, solr_user and solr_password and the rest of it should run as normal.

e.g.


sf = BasicSolrFeeder(fedora_url='http://ora.ouls.ox.ac.uk:8080/fedora',
fedora_version="2.0", # Supports either 2.0 or 2.2
solr_base="/solr", # or whatever it is for your solr app
solr_username="your_username",
solr_password="your_password",
solr_url='orasupport.ouls.ox.ac.uk:8080')
# Point at the tomcat instance for solr


Hopefully, this should be enough to get people to think about using Solr with Fedora, as Solr is a very, very powerful and easily embeddable search service. It is even possible to write a client in javascript to perform searches, as can be seen from the javascript search boxes in http://ora.ouls.ox.ac.uk/access/adv_search.php

I have purposefully left out how to format the queries from this post, but if people yell enough, I'll add some guidelines, more than I provide at http://ora.ouls.ox.ac.uk/access/search_help.php anyway

1 comment:

Claire said...

Ben, Thanks for your tip about Solr last week at the RepoFringe. I have indexed our Repository and am now looking at how to do my integration. I had some utf-8 issues with Python, but have managed to add documents using python and java.

Claire