Less Talk, More Code

Thursday, 8 May 2008

Internal object relationships - in the context of Fedora and Solr indexing.

Peter Sefton wrote to me recently, and noted that in the basic solr indexer I've written, it still uses the rather poor convention that the datastream with a DSID of FULLTEXT contains all the extracted text from the other binary datastreams. He wrote on to say that perhaps the connection might be able to be expressed in inter-datastream relationship expressed through the OAI-ORE resource map.

This is exactly my intention and I'll just write up the type of relationships that I am using for these purposes and also how I am serialising these with the Fedora software.

Peter was right when he said that the relationship between PDF/DOC/binary and its text version can be expressed in ORE - I am planning that very thing. While Fedora 3 doesn't seem to have plans for the RELS-INT datastream, this is the datastream ID in which I intend to store the necessary relationships as RDF/XML, as the name helps convey some of the intent of the datastream - the INTernal RELationshipS.

The predicate I'm using to bind a binary datastream, such as a pdf/doc/etc to the text that is extracted and stored alongside it is the dcterms:hasFormat property - the simple description of what this property asserts is:

dcterms:hasFormat - A related resource that is substantially the same as the pre-existing described resource, but in another format.

It also has the bonus that if you have a pdf of a .doc, you can bind them together with the same property too, e.g. (in n-triples for clarity)

<info:fedora/thing:1234/PDF2> <dcterms:hasFormat> <info:fedora/thing:1234/PDF2TEXT>
<info:fedora/thing:1234/PDF2> <dcterms:hasFormat> <info:fedora/thing:1234/DOC2>

and just to illustrate an interesting use case:

<info:fedora/thing:1234/PDF> <dcterms:hasFormat> <info:fedora/someotherthing:123456/DOC>

Another predicate I am using is the dcterms:conformsTo - again, the simple description is:

dcterms:conformsTo - An established standard to which the described resource conforms.

This I am using to indicate whether a datastream conforms to a metadata standard, such as MODS, oai_dc, etc. It could equally be used to express the file format for a binary ds.

E.g.

<info:fedora/thing:1234/DC> <dcterms:conformsTo> <http://namespace.for.whatever.schema/or/standard/used>

I think it would be useful to discuss what we use to uniquely identify a given standard, just so that it is clear. Should we use the URL that resolves to the .xsd for the standard, or do we use the namespace URI?

(The end idea being that anything in the RELS-INT is expressed in the OAI-ORE, and consequently, the easiest way to include extra information for the ORE is to put it into the RELS-INT)

So, to illustrate how this can be used to index an object held in Fedora:

1) get the list of dsids and mimetypes via the API call of ListDatastreams
2) get the RELS-INT from the object itself and create a graph of the triples.
2b) [Get the RELS-INT from the content model for the object and munge the two graphs together. The content model RELS-INT could hold all the common relationships for a given type, e.g. Image to Thumbnail, etc. This will help performance by enabling caching, etc.]
3) Resolve and GET the metadata ds, as indicated by the RELS-INT and put it into a suitable configured Solr (1 to 1 relationship - index doc to object). The decision of which metadata to use should be left up to the indexer. The Fedora object should only indicate (via hasFormat) which are the derived metadata and which is the canonical.
4) Grab the text/plain alternative formats and put them into a (second) solr (1 to 1 - index doc to parent datastream e.g. info:fedora/t:1234/PDF2, etc)
5) Use the distributed index query of solr to query both, or each individually, as the portal requires.

Wednesday, 7 May 2008

python-xml module depreciated in Hardy/Debian

If you've recently installed the shiny new Hardy Heron release of Ubuntu, or updated to the latest Debian, you may be surprised that a few old xml techniques in python no longer work. For example, the following no longer exist:

from xml import xpath
from xml.dom.ext import Anything_Really

See for more details and a temporary workaround: http://www.aigarius.com/blog/2008/05/06/ubuntu-removing-xml-from-python/

Now, the reasons for the depreciation I can understand - the package has seen no real updates for some time, and a subset of its functionality is no part of the core python system. But no real warning? That's annoying.

It does mean that I'll have to update my (admittedly oldschool) xpath-dependent functions to use elementtree, which I had intended to do, but now my hand is forced :)

Friday, 18 April 2008

Distributed objects - how to cope with objects scattered across multiple Fedora's

I was asked this question by Chris Wilper a couple of weeks ago:

From http://blogs.sun.com/georg/entry/open_respository_2008_day_3
"Scalability : objects can be placed in any object store on the
network, and located via their object meta data. This means that
scaling over multiple Fedora instances is a no brainer."

Sounds cool. How'd you do resolution? (e.g. a request comes in,
which repository is it in?)

At the time, I had a good number of solutions I had tried, with one solution waiting on something to become more stable for it to work. They all had their plus sides and their minuses. They generally followed the tried and tested method of either a centralised source, or having a DNS style system.

What I didn't write to him about was an idea that's been fermenting in my mind for a little while now but which wasn't thought through enough to explain at that time.

De-centralised database for Fedora object URIs

The base premise is to use something called a distributed hash table or DHT to hold the link between URI and base Fedora URL. Now, as DHTs are kinda new and tend to be found in 2 main fields - research projects and trackerless bitorrent - I'll just write a brief summary of what they are.

Distributed Hash Tables (DHT)

From Wikipedia:

Distributed hash tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table: (name, value) pairs are stored in the DHT, and any participating node can efficiently retrieve the value associated with a given name. Responsibility for maintaining the mapping from names to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows DHTs to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

DHTs form an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, and instant messaging. Notable distributed networks that use DHTs include BitTorrent (with extensions), eDonkey network, YaCy, and the Coral Content Distribution Network.

(Emphasis my own)

As other, much more qualified people, have written about the underlying algorithms that power DHT, I will simply link to the articles that I have found illuminating. (DHT can also be treated as a black-box technology if you wish too)

Distributed Hash Tables, Part I October 1st, 2003 by Brandon Wiley, Linux Journal

Wikipedia: Distributed Hash Tables

OpenDHT: User's Guide

Let's illustrate how having a DHT of URI-to-Fedora-Server pairs will help us, by skipping ahead and imaging that we already have the situation that each Fedora server hes a DHT node service of its own - just like a peer2peer filesharing program, it maintains a list of value pairs for the items it holds.

Let's now imagine 3 commands that a Fedora instance can use to work with it's own DHT node - get, put and remove:

get key - gets the values associated with a given key
put key value - puts the given key-value pair into the DHT
remove key value - removes the given key-value pair from the DHT

So, as a new item is allocated to a given Fedora, it can add a pair to the DHT, the key being the item's URI <info:fedora/ns:id>, and the value being the base URL for that Fedora, e.g. <http://host:8080/fedora>


$ ./put.py info:fedora/uuid:00e41229-1c9f-4c1d-ac3a-b51d34bbbe8f http://archive.sers.ox.ac.uk:8080/fedora
Success

$ ./get.py info:fedora/uuid:00e41229-1c9f-4c1d-ac3a-b51d34bbbe8f 
http://archive.sers.ox.ac.uk:8080/fedora

Specific implementation notes

The Bamboo DHT implementation is a very useful implementation of a DHT system, and has a good amount of helpful documentation. It is also the basis for a good test-bed service, the OpenDHT service mentioned earlier.

In fact, the three commands illustrated above, have real, live python implementations which are presented on the OpenDHT site - get.py, put.py, and rm.py. The underlying mechanics to the protocol is just simple XMLRPC, so most languages have solid libraries for interfacing with the API.

I see this service hooking into Fedora by using a listening service on the ActiveMQ message queue - put'ing the URI/URL hash pair when items are added, and rm'ing when purged. The Fedora and the Bamboo service should be booted and shutdown together, adding or removing whole sets of hashes to or from the DHT, accurately reflecting the accessibility of the Fedora item.

Now that we have a DHT, why not...

Another beneficial use of the hash table may be to encode certain information, such as the template URL for a HTML splash page (should one exist) - perhaps the template style as used by OpenSearch 1.1 - e.g. "http://archive.sers.ox.ac.uk:5000/resolve/{uri}" (Note: I have added this resolving service to handle both info:fedora/ns:id and info:fedora/ns:id/dsid type URIs - the first leads to the splash page and the second form redirects to the download for that given datastream)

Important final note

It is very important in distributed environments that you have UUIDs (which may or may not involve the numerical UUIDs I promote the use of) - the important part is that each Fedora identifier is unique across the whole set of Fedora instances. As there the only issue with joining these URI lookup tables across institutional boundaries is political, it may be a good thing to adopt a consistent and bullet-proof mechanism for ensuring that your id system is not going to collide with someone elses.

Thursday, 17 April 2008

Ditching the DB-based blog for a semantic one

Why Yet-another-blog-engine?

Well, blog engines tend to do the same things, their functionality is derived by simple views on a relational DB. To a large extent, I think that this RDB reliance has shaped the scope of what you can do with a blog and also I really feel it has guided how the blog (and related publishing) technology has developed.

Things like blog export, and interlinking of blogs and the bloggers who power the system are seen as extras, features added as plugins or as an additional service - this needs to change! Let's see what naturally happens when we try to build a system with a more interesting backend.

So, what to replace the RDB with?

Simply put, the data which a blog needs to function normally can be modelled in an objectstore, by linking items together by RDF predicates. Specifically, there is a namespace created by the SIOC project, aimed at defining social networks and their inter-linking in a semantic way - http://www.sioc-project.org/. I have a strong hunch that by using this work, a whole load of extra possibilities will emerge.

(Aside from the obvious benefits of simple export and reuse of objects, being able to make a single comment on more than one blog post from more than one blog, and so on.)

Okay... what's the plan?

So, my coding bias for objectstore and framework language is FedoraCommons and python, so no surprise there. It also means that I'll be reusing my code, so each object will have a OAI-ORE aggregration and good search intergration (via Apache Solr).

Modelling the blog:

Luckily, the good folks at the SIOC project have done a good load of the work for me, and having read through their work, I can say that I see no problem with it for my purposes. This means that using their namespace (http://rdfs.org/sioc/ns#), I can adopt the structure of classes, helpfully illustrated here

The first class objects from or subclassed from SIOC therefore will be as follows (a first class object has a 1-to-1 parity with the underlying Fedora objects):

User [contains or links to FOAF record] - Post [text and zero or more attachments] - Forum(Blog) [Dublin Core] - Post(Comment) [text] - Site [Dublin Core]

Other first class objects:

Link [DC record] and Petition [ Later ;) ]

Link speaks for itself. If a post contains a link, the link is promoted to an object and the post will connect to the link object. If someone else uses that link, the previous object is reused.

Petition is a social experiment which I will go into later ;)

Now for the behaviour of the blog - I am envisioning an academically focussed blog, so the social network, persistance, trust and discourse are important features to consider.

Users - accounts and blogs

Frankly, I am tired of writing authentication systems. So is everyone else. Users are tired of having an account per site too. Thank god for OpenID then :)

Right, so we have a working, live system for authenticating people, but what about authorising? Authenticate to comment, that's self-explanatory. But this is where we can do some interesting things:

The blog engine is 'seeded' by a blog author or authors, most likely the same people that installed the engine. Borrowing a common idea, these seeds can invite other people to have a publishing account. This is done by an indication of trust - at a technical level <uri-inviter> <trust:trust10> <uri-invitee>, with the <uri-invitee> being the User object created to correspond to a given OpenID. The next time they log in, the ability to create a blog should be apparent.

(trust namespace: http://trust.mindswap.org/ont/trust.owl)

Why not <foaf:knows>? Because that is saved for later :) <trust:trust10> is a predicate intended as a way for someone to fully vouch for someone else - you may know people, but that doesn't mean they should automatically gain blogging rights just because you have those rights.

Any post (post, comment or link) can be tagged as interesting to a User, via the <foaf:interest> predicate - declaring "I am interested in this thing" The payoff to the user is that their page (every User is an object, remember) will display the things they have marked as interesting. If a User declares that another User's blog is 'interesting' then all the latter User's posts will be accessible as well from here.

There are two forms of free-text tags; a trusted tag and a normal tag. Trusted tags are those placed on a Post, Blog or User, by the author/owner of that object - a statement about what the author feels is the subject of the object. An author can also tag themselves, and this is to give extra indication about what they normally blog or comment on, in addition to the tags they've put on their own items.

A normal tag can be placed by any authenticated User on any Post or Blog. Expected functionality, really.

I haven't tripped across a suitable ontology for these two, and I'd really like to use an existing one so feel free to add a comment if you have an idea of one.

Now, I mentioned a curious thing called a Petition - this is a second method to gain trust in the social network. A User can make a Petition, stating a brief summary of what they'll like to write about, and then tag the Petition with its subjects.

This is where it gets semantically fun - a Petition will be visible to existing blog posters with the following filters available:

Show Petitions that have tags in common with mine, from Users I trust
Show Petitions that have tags in common with mine, from Users I know
Show Petitions that have tags in common with mine, from Users my friends know
Show Petitions from Users who I've shown interest in
Show Petitions from Users who my friends show interest in
Show Petitions that have tags in common with mine
Show most recent Petitions

It doesn't take much more time to see that the information already in the triplestore can be used to create some really interesting filters.

Now, a User can then decide to place a level of trust in a given Petitioner. The actual mechanics of what is required to elevate the trust placed in the Petitioner is up to the system installer, but a few interesting things can be used: A requirement for the combined <trust:trust> given to a User equivalent to a certain level, maybe a tiered system (2 trust9's are equivalent to 1 trust10, etc)

The bottom line is that this trust system means that there is no requirement for a super User as the network should be self-regulating - take the trust away, and the Petitioner loses their Blog. (It also doesn't mean that having a Super user is a bad idea!)

Posts and Posting

I hate the word blog... I've been using it as a crutch, as what I'd like this to be is a site to have a voice on. By using semantic relationships, it is quite possible to view it as you might a blog, but you might also view it like a forum, with posts and threaded comments. The underlying information and connections are the same, but the way you can view and present this become a lot more flexible.

So when I say Post, I mean it in a twitter/blogger/thread-reply kind of way ;)

The Post objects have a summary (200 char limit) and an optional body (a blog 'post') - no title!. They can be tagged of course and the post can also hold attachments for download, or embedding here or elsewhere. (All resources have dereferenceable URIs, and so can be linked to directly)

So a Post can be a 'tweet', or a 'blog' post - it's all the same thing. However, on the user's jumpoff page, the summaries are listed.

I'll have a go at putting something together, to see what works and what doesn't, so watch this space.

Wednesday, 16 April 2008

Release of alpha-quality web interface framework for Fedora

Just a heads up that I have uploaded the code for the web interface framework for Fedora - the same framework I used at Open Repositories 2008, but cleaned up a bit. (In respect for that, it ships with the same graphics and blurb from the OR08 EPrints repository)

Project Home:
http://code.google.com/p/python-fedoracommons-webarchive/

I am adding documentation as I go along, and it is at an early preview level. E.g. if you can get it up and running (which isn't too taxing) have fun. Please raise any issues or problems on the Google code issue tracker.

One thing I'd like to point out, is that this is very much a framework - you tailor it to how you need it. What it provides is:

- Items can have a content-type, and this predefines how the item is presented - see http://code.google.com/p/python-fedoracommons-webarchive/source/browse/trunk/archive/lib/cmodel_mapper.py and http://code.google.com/p/python-fedoracommons-webarchive/source/browse/trunk/archive/lib/app_globals.py for more in depth details.

- Items support pingback, and trackback out of the box.
- See the URL structures here for more goodness: http://code.google.com/p/python-fedoracommons-webarchive/wiki/DefaultURLScheme

- Oh and the project is most definitely a WIP, so bear with me. It may not all be there at the moment, but I work fast :)

Sunday, 24 February 2008

Creating a web application from scratch, backed by Fedora-Commons and Apache Solr (Part 1)

(Part 1 will detail the installation and setup of the basic system, services and libraries needed for a Fedora-Commons/Apache Solr backed web 'service'. Subsequent parts will deal with configuring and feeding the search engine, and constructing a web interface to handle article/blog/comment posting and using OpenID for authentication.)

Step 1 - Get a nice clean linux distribution focused on use on servers.

I am using Ubuntu JeOS, as it will be hosted on a VMware virtual machine. This also means that the walkthrough that follows will be very debian specific.

See the following pages for help:

This page aims at documenting how to create virtual appliance using Ubuntu Server Edition's JeOS.

This page has snapshots of the entire install process and also information on how to set up a LAMP stack but if you want to follow this guide, don't install any of the applications it asks. It is only useful for our purposes up until the first reboot, before the guide does things we don't need, like activating the root user account, or adding PHP.

Note that the user name I will be using in this guide is simply 'user' and I will refer to this as either 'user' or 'username'. Replace this with whatever the username was that you chose during installation.

Step 2 - install it and set up networking and firewalls

Now, firewalls aren't as critical as you might think, especially if you have installed something like JeOS which has nothing really running as default. But there are some very handy tricks to help stop abuse from malicious script kiddies.

(For example, my favorite two liner I always add to the iptables firewall script is a couple of lines that rate limits ssh access tries to 3 attempts every 180 seconds. (Note, that the following doesn't immediately ACCEPT it, it passes it into a iptables chain called 'TRUSTED' which deals with what may be genuine attempts at access. If you wanted to just accept it, change TRUSTED to ACCEPT.)


# Rate limit SSH attempts.
iptables -A INPUT -p tcp -m tcp --dport ssh -m state --state NEW \
-m recent --hitcount 3 --seconds 180 --update -j DROP

# Allow first attempts through
iptables -A INPUT -p tcp -m tcp --dport ssh -m state --state NEW \
-m recent --set -j TRUSTED

NB Something along these lines would be fine to rate-limit upload attempts to Fedora as well.

[Edit: seems there was something awry with the following script - as illustrated here. Thanks to the Rubric team. I've added their fix :) But their fix may not be enough, so I'd advise not applying this until everything is installed and working correctly! I'll test it out as soon as I can. ]
Full example firewall script, including this snippet and port opening for tomcat, http/https, and ssh. (As this is from my home server, it'll include a few other services that you may not need for this walkthrough).

If you wish to use the SSL connection to Fedora on port 8443, remember to open that port as well!

Step 3 - Install all updates and get the basic applications we will need

Get a root prompt on a commandline:

e.g.

[user@server]$ sudo -s
[root@server]#

Then make sure you can a) connect to the internet and that b) the server is up to date:

[root@server]# apt-get update
[..... lots of lines of stuff ....]
Hit http://gb.archive.ubuntu.com gutsy-updates/multiverse Sources [1708B]
Fetched 278kB in 0s (319kB/s)
Reading package lists... Done

[root@server]# apt-get upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
[ whatever packages that need to be upgraded will be listed here]
X upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 2497kB of archives.
After unpacking 16.4kB of additional disk space will be used.
Do you want to continue [Y/n]? Y
[ .... lots of lines of packages installing hopefully without error ..... ]

Hopefully, once those are installed, your machine will be up to date. Now to install on all the necessary packages:

[root@server]# apt-get install build-essential python-dev mysql-server sun-java5-jdk openssh-server python-mysqldb python-pysqlite2

Just let those install, but be aware that certain packages should ask you for information during installation, such as the required default root password for MySQL and a prompt will ask you if you agree with the Sun licence for Java.

Also, note that at this point, you should be able to SSH into the machine to continue working on it. It makes it a lot easier to cut and paste from guides if you do!

Step 4 - install some python libraries using Easy Install

Go here: http://peak.telecommunity.com/DevCenter/EasyInstall and download and install the ez_install.py script as it shows. Simply running it as root should do the trick:

[root@server]# wget http://peak.telecommunity.com/dist/ez_setup.py
[root@server]# python ez_install.py

[Edit: I did originally write the first part of this for Fedora 2.2, but since the REST api for Fedora 3.0 looks pretty damn usable, I've re-written this guide for version 3. Removing the installation/configuration for the SOAP client, drastically reduces the library dependancies this needs.]

So, we need to install some python libraries for later, iCalendar format (vobject) , OpenID consumer library (python-openid), and also install other miscellaneous things, such as a library that can generate UUIDs and a very good web framework called Pylons:

[root@server]# easy_install python-openid
[root@server]# easy_install uuid
[root@server]# easy_install vobject
[root@server]# easy_install pylons

(NB we have already installed the python libraries to interact with MySQL and SQLite with the apt-get install command earlier. It is best to install the latest stable packages for the items above, which is why they are installed through easy_install.)

Step 5 - Get Fedora-Commons and Apache Solr.

Either just blindly download the packages I tell you to:

[user@server]$ wget http://downloads.sourceforge.net/fedora-commons/fedora-3.0b1-installer.jar
[user@server]$ wget http://apache.rmplc.co.uk/lucene/solr/1.2/apache-solr-1.2.0.tgz

Or better, download them from the homepages of the projects themselves, using links2

Install a text-based web-browser and browse and download the packages that way (A manual page for links2):

[root@server]# apt-get install links2

When that has finished installing, you can drop out of your root session (press Ctrl+D, or type 'exit') and download the relevant applications:

[root@server]# exit
[user@server]$ links2

Don't be alarmed, it's meant to blank the screen! Press the letter 'g' and an location bar prompt will appear.

First let's go to the Fedora commons site so type in 'http://www.fedora-commons.org/' and press enter. Use the cursor keys to go down and click (press return) on the 'Download Fedora 3.0 beta 1' link (24/02/2008). Scroll down a bit, and you should see a link to download the installer. You will be presented with the 'save jar file' dialog, so save the fedora installer jar file.

Now, let's get the search appliance, Solr. Got to 'http://lucene.apache.org/solr/' and click on the 'download' link. Choose a mirror, go into the 1.2/ folder on that mirror and download the 'apache-solr-1.2.0.tgz' file. Press 'q' to quit links2.

Step 6 - Make the server environment ready for Fedora Commons

If you now list the home directory, you should see something like this:

[user@server]:~$ ls
apache-solr-1.2.0.tgz doc ez_setup.py fedora-3.0b1-installer.jar

We will need the following:

A directory to store Fedora's root directory (config files, logs, libraries, and default Tomcat instance)
A mysql database and account for Fedora to use
(Optional) A large filesystem to hold Fedora's data storage directory

Point 1 then - I chose to store the Fedora root directory at /opt/fedora30b1 -

[user@server]$ sudo -s
[root@server]# mkdir /opt/fedora30b1

Let the user own it: (Remember change 'user' to whatever your user is actually called!)

[root@server]# chown user:user /opt/fedora30b1

(Optional) And to aid upgrading, create a symlink at /opt/fedora to this folder:

[root@server]# ln -s /opt/fedora30b1 /opt/fedora

Fedora needs certain environment variables to be set up now, FEDORA_HOME and JAVA_HOME at the very least. Open up the system wide profile (/etc/profile) and add them in there. (I'm using the nano editor, vim is also available from a default JeOS install.)

[root@server]# nano -w /etc/profile

And add the following lines to the end of the file (also, note that there *must not* be any gaps either side of the '=' character, as tempting as it might be to press space to space it out to look better.):

# If you did not create the symlink, just point directly at your Fedora root
FEDORA_HOME=/opt/fedora30b1
# or if you did do the 'ln -s ...' step, use this instead:
FEDORA_HOME=/opt/fedora

export FEDORA_HOME

# If you did not create the symlink, just point directly at your tomcat root
CATALINA_HOME=/opt/fedora30b1/tomcat
# or if you did do the 'ln -s ...' step, use this instead:
CATALINA_HOME=/opt/fedora/tomcat

export CATALINA_HOME

JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun
export JAVA_HOME

Save the file (Ctrl-X in nano)

Now, to check that this has worked, type the command 'exit' a few times to logout and then log back in again as your default user. If things have worked well, the following commands should work:

[user@server]$ echo $FEDORA_HOME
/opt/fedora30b1

[Or '/opt/fedora' depending on what you chose.]

[user@server]$ echo $JAVA_HOME
/usr/lib/jvm/java-1.5.0-sun

Now to sort out MySQL. Remember that default root password you set for MySQL? You'll need it now.

[user@server]$ mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 5.0.45-Debian_1ubuntu3.1-log Debian etch distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql>

Now issue the following commands:

mysql> create database fedora30;
Query OK, 1 row affected (0.00 sec)

mysql> grant all on fedora30.* to 'fedoraAdmin'@'localhost' identified by 'PUTYOURPASSWORDHERE';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> ALTER DATABASE fedora30 DEFAULT CHARACTER SET utf8;
Query OK, 1 row affected (0.00 sec)

mysql> ALTER DATABASE fedora30 DEFAULT COLLATE utf8_bin;
Query OK, 1 row affected (0.00 sec)

mysql> exit
Bye

[user@server]$

(NB You may or may not need to add the utf-8 configuration lines for your particular version of MySQL, but as far as I know, the commands are harmless if you don't need them and utterly crucial if you do. Well, crucial unless you are dealing purely with ascii, but could you really guarantee that?)

Step 7 - install Fedora commons 3.0b1

(Note - Official installation guide is here)

Go to the location where you saved the fedora installer, probably the user's home directory and run the installer. I'll include the entire installation dialog here. Where the response is blank, I simply pressed enter to accept the default.

[user@server]$ cd /home/user
[user@server]$ java -jar fedora-3.0b1-installer.jar

***********************
Fedora Installation
***********************

To install Fedora, please answer the following questions.
Enter CANCEL at any time to abort the installation.
Detailed installation instructions are available at:
http://www.fedora.info/download/

Installation type
-----------------
The 'quick' install is designed to get you up and running with Fedora
as quickly and easily as possible. It will install Tomcat and an
embedded version of the McKoi database. SSL support and XACML policy
enforcement will be disabled.
For more options, including the choice of hostname, ports, security,
and databases, select 'custom'.
To install only the Fedora client software, enter 'client'.

Options : quick, custom, client

Enter a value ==> custom

Fedora home directory
---------------------
This is the base directory for Fedora scripts, configuration files, etc.
Enter the full path where you want to install these files.

Enter a value [default is /opt/fedora] ==>

Fedora administrator password
-----------------------------
Enter the password to use for the Fedora administrator (fedoraAdmin) account.

Enter a value ==> PUTTHEPASSWORDYOUDLIKEHERE

Fedora server host
------------------
The host Fedora will be running on.
If a hostname (e.g. www.example.com) is supplied, a lookup will be
performed and the IP address of the host (not the host name) will be used
in the default Fedora XACML policies.

Enter a value [default is localhost] ==>

Authentication requirement for API-A
------------------------------------
Fedora's management (API-M) interface always requires user authentication.
Require user authentication for Fedora's access (API-A) interface?

Options : true, false

Enter a value [default is false] ==>

SSL availability
----------------
Should Fedora be available via SSL? Note: this does not preclude
regular HTTP access; it just indicates that it should be possible for
Fedora to be accessed over SSL.

Options : true, false

Enter a value [default is true] ==>

SSL required for API-A
----------------------
Should API-A be accessible exclusively via SSL? If true, requests
to access API-A URLs will be automatically redirected to the secure port.

Options : true, false

Enter a value [default is false] ==>

SSL required for API-M
----------------------
Should API-M be accessible exclusively via SSL? If true, requests
to access API-M URLs will be automatically redirected to the secure port.

Options : true, false

Enter a value [default is true] ==> false

Servlet engine
--------------
Which servlet engine will Fedora be running in?
Enter 'included' to use the bundled Tomcat 5.5.23 server.
To use your own, existing installation of Tomcat, enter 'existingTomcat'.
Enter 'other' to use a different servlet container.

Options : included, existingTomcat, other

Enter a value [default is included] ==> included

Tomcat home directory
---------------------
Please provide the full path to your existing Tomcat installation, or
the path where you plan to install the bundled Tomcat.

Enter a value [default is /opt/fedora/tomcat] ==>

Tomcat HTTP port
----------------
Which HTTP port (non-SSL) should Tomcat listen on? This can be changed
later in Tomcat's server.xml file.

Enter a value [default is 8080] ==>

Tomcat shutdown port
--------------------
Which port should Tomcat use for shutting down? Make sure this doesn't
conflict with an existing service. This can be changed later in Tomcat's
server.xml file.

Enter a value [default is 8005] ==>

Tomcat Secure HTTP port
-----------------------
Which port (SSL) should Tomcat listen on? This can be changed
later in Tomcat's server.xml file.

Enter a value [default is 8443] ==>

Keystore file
-------------
For SSL support, Tomcat requires a keystore file.
If the keystore file is located in the default location expected by
Tomcat (a file named .keystore in the user home directory under which
Tomcat is running), enter 'default'.
Otherwise, please enter the full path to your keystore file, or, enter
'included' to use the the sample, self-signed certificate) provided by
the installer.
For more information about the keystore file, please consult:
http://tomcat.apache.org/tomcat-5.5-doc/ssl-howto.html.

Enter a value ==> included

Policy enforcement enabled
--------------------------
Should XACML policy enforcement be enabled? Note: This will put a set of
default security policies in play for your Fedora server.

Options : true, false

Enter a value [default is true] ==> false

Enable Resource Index
---------------------
Enable the Resource Index?

Options : true, false

Enter a value [default is false] ==> true

Enable REST-API
---------------
Enable the REST-API? The REST-API is an EXPERIMENTAL feature that exposes
the Fedora API with a REST-style interface. In particular, URL endpoints
should not be considered final, nor has policy enforcement been evaluated.
For more information about the REST-API, see
http://www.fedora.info/wiki/index.php/RESTful_Fedora_Proposal

Options : true, false

Enter a value [default is false] ==> true

Database
--------
Please select the database you will be using with
Fedora. The supported databases are McKoi, MySQL, Oracle and Postgres.
If you do not have a database ready for use by Fedora or would prefer to
use the embedded version of McKoi bundled with Fedora, enter 'included'.

Options : mckoi, mysql, oracle, postgresql, included

Enter a value ==> mysql

MySQL JDBC driver
-----------------
You may either use the included JDBC driver or your own copy.
Enter 'included' to use the included JDBC driver, or, enter the location
(full path) of the driver.

Enter a value [default is included] ==>

Database username
-----------------
Enter the database username Fedora will use to connect to the Fedora database.

Enter a value ==> fedoraAdmin

Database password
-----------------
Enter the database password Fedora will use to connect to the Fedora database.

Enter a value ==> PUTYOURDBPASSWORDHERE

JDBC URL
--------
Please enter the JDBC URL.

Enter a value [default is jdbc:mysql://localhost/fedora30?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true] ==>

JDBC DriverClass
----------------
Please enter the JDBC driver class.

Enter a value [default is com.mysql.jdbc.Driver] ==>

Successfully connected to MySQL
Deploy local services and demos
-------------------------------
Several sample back-end services are included with this distribution.
These are required if you want to use the demonstration objects.
If you'd like these to be automatically deployed, enter 'true'.
Otherwise, the installer will put the files in your FEDORA_HOME/install
directory in case you want to deploy them later.

Options : true, false

Enter a value [default is true] ==>

Preparing FEDORA_HOME...
Configuring fedora.fcfg
Installing beSecurity
Installing Tomcat...
Preparing fedora.war...
Processing web.xml
Deploying fedora.war...
Deploying fop.war...
Deploying imagemanip.war...
Deploying saxon.war...
Deploying fedora-demo.war...
Installation complete.

----------------------------------------------------------------------
Before starting Fedora, please ensure that any required environment
variables are correctly defined
(e.g. FEDORA_HOME, JAVA_HOME, JAVA_OPTS, CATALINA_HOME).
For more information, please consult the Installation & Configuration
Guide, located online at
http://www.fedora.info/download/ or locally at
/opt/fedora/docs/userdocs/distribution/installation.html
----------------------------------------------------------------------

And that should merrily go away and install and setup Fedora and the bundled Tomcat server for you. Unlike other services you may install, this won't start the Fedora service, nor will it create a handy startup/shutdown script that integrates with you linux startup scripts in /etc/init.d. We will create one later on.

Step 8 - Further configuration of Fedora 3.0

!IMPORTANT! Fix the broken 'mail.jar' library! (Broken, as in the REST api will not work correctly with the version release in 3.0b1)

Get it from here: http://python-fedoracommons-webarchive.googlecode.com/files/mail.jar and use it to replace the mail.jar found in $FEDORA_HOME/tomcat/webapps/fedora/WEB-INF/libs/mail.jar. Restart Tomcat if you need to.

I am keen on UUIDs, and I cannot see a good reason for not using them. I suggest using the fedora id 'namespace' of uuid, so that a fedora URI will look like <info:fedora/uuid:d3733f61-1083-4a3e-b914-5a853c42189b>

It is also trivial to generate these in python, consider the following code:

[user@server]$ python
Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from uuid import uuid4
>>> uuid4().urn[4:]
'uuid:d3733f61-1083-4a3e-b914-5a853c42189b'

To get Fedora to accept these though, the 'uuid' namespace needs to be added to the retainPID region in fedora's configuration file.

[user@server]$ nano -w /opt/fedora/server/config/fedora.fcfg

Press Ctrl-W and search for retainPID. Add in uuid to the list of namespaces (the ordering is not important):

<param name="retainPIDs" value="demo uuid test changeme ...

Step 9 - Installing Solr

[NB you will only have to follow the guide below, but here are the official docs, should you get in trouble
http://wiki.apache.org/solr/SolrInstall - Basic installation
http://wiki.apache.org/solr/SolrTomcat - Tomcat specific things to bear in mind]

Extract the whole archive somewhere on disc and you will see something like this in the apache-solr-1.2 folder:

~/apache-solr-1.2.0$ ls
build.xml  CHANGES.txt  dist  docs  example  KEYS.txt  lib  LICENSE.txt  NOTICE.txt  README.txt  src

~/apache-solr-1.2.0$ ls dist
apache-solr-1.2.0.jar  apache-solr-1.2.0.war

The easiest thing is to install Solr straight into the instance of Tomcat that Fedora has installed. One thing to be aware of is that search applications eat RAM and Heap for breakfast, so make sure you install it onto a server with plenty of RAM and it would be wise to increase the amount of Heap space available to the Tomcat instance. This can be done by making sure that the environment variable CATALINA_OPTS is set to "-Xmx512m". This can be done inside the catalina.sh script in your /opt/fedora/tomcat/bin directory.

[i.e. just add CATALINA_OPTS="-Xmx512m" at the beginning of the file if it doesn't already exist.]

One final bit of advice before I point you at the rather good installation docs is that you might want to rename the .war file to match with the URL pathname you desire, as the guide relies on Tomcat automatically unpacking the archive:

So, a war called "apache-solr-1.2.0.war" will result in the final app being accessible at http://tomcat-hostname:8080/apache-solr-1.2.0/. We will rename ours when we copy it into Tomcat's webapps directory.

Finally, Solr needs a place to keep its configuration files and its indexes. The indexes themselves have the capability to get huge (1Gb is not unheard of) and need somewhere to be stored. The documentation linked to below will refer to this location as 'your solr home' so it would be wise to make sure that this location has the space to expand. (NB this is not the directory inside Tomcat where the application was unbundled.)

So, let's create a solr home in /opt as we did for fedora (NB change user):

[user@server]$ sudo -s
[root@server]# mkdir /opt/solr
[root@server]# chown user:user /opt/solr

Place the solr.war into Fedora's Tomcat instance:

[root@server]# exit
[user@server]$ pwd
/home/user/apache-solr-1.2.0
[user@server]$ cp dist/apache-solr-1.2.0.war $CATALINA_HOME/webapps/solr.war

Finally, we have to make sure a variable is available in Tomcat's environment; the location of the Solr home directory. Remember that CATALINA_OPTS line we added before? Amend that now to look like:

(E.g. via nano -w $CATALINA_HOME/bin/catalina.sh )

CATALINA_OPTS="-Xmx512m -Dsolr.solr.home=/opt/solr"

Now, as we will shape the Solr search service later on (i.e. choosing the fields to be indexed, and how to index them for faceted searching) we will just copy across the basic solr example, to make sure everything is running fine.

[Make sure you are in the unpacked solr directory:]
[user@server]$ pwd
/home/user/apache-solr-1.2.0
[user@server]$ cp -a example/solr/* /opt/solr
[user@server]$ ls /opt/solr
bin conf README.txt

Adding HTTP authentication to Solr update

First add a username/password to tomcat/conf/tomcat-users.xml:

<tomcat-users>
...
<user username="solradmin" password="XXXXXXXX" roles="solradmin">
...
</user>
Then, in your Solr context, in tomcat/webapps/solr/WEB-INF/web.xml, add the following:

<web-app>

.... usual stuff ....

<security-constraint>
<web-resource-collection>
<web-resource-name>
SolrUpdate
</web-resource-name>
<url-pattern>/update/*</url-pattern>
</web-resource-collection>
<auth-constraint>
<role-name>solradmin</role-name>
</auth-constraint>
</security-constraint>

<login-config>
<auth-method>BASIC
<realm-name>Auth needed
</login-config>

</web-app>

NB BASIC authentication sends the password over by plain-text, so this isn't too great but is suitable for a localhost updater. Change this to DIGEST to increase the security, but bear in mind you may need to set the Realm for the Tomcat container and Digest hash mechanism (SHA1, MD5, etc)

(Some good guides to securing Tomcat services are but a Google search away - for example: http://www.unidata.ucar.edu/projects/THREDDS/tech/reference/TomcatSecurity.html )

Step 10 - Test your foundation

Now, we need to start up Fedora, and hopefully, it will all go smoothly:

[user@server]$ cd /opt/fedora/tomcat/bin/
[user@server]$ ./startup.sh
Using CATALINA_BASE: /opt/fedora/tomcat
Using CATALINA_HOME: /opt/fedora/tomcat
Using CATALINA_TMPDIR: /opt/fedora/tomcat/temp
Using JRE_HOME: /usr/lib/jvm/java-1.5.0-sun

Now try these links:

http://localhost:8080/fedora/search
http://localhost:8080/fedora/describe - make sure 'uuid' is one of the retainPIDs
http://localhost:8080/solr/admin Should look like a whole heap of options and bells and whistles.

Any 404 or 500 Server errors means that something has come unstuck. But, if you've followed this guide, using an Ubuntu Gutsy you should be all set without a problem - I just followed it on my home computer without a hitch :)

Next up, we are going to build a pylons interface to do basic CRUD type functionality with the ability to link together items semantically, using the SIOC project's (http://sioc-project.org/ontology) namespace, at http://rdfs.org/sioc/ns#

Fedora, RDF, Pylons and OpenID - a VRE?

I was toying with the idea of simple, atomic objects with a limited payload - a block of text, a file, or a reference (typically a URL) - and letting the user bind these together using RDF descriptions (the user will have all the technical details hidden from them of course. It'll be point and click to them.)

And I wanted to test this out by letting anyone come along and make use of it. I looked at Shibboleth for the auth layer, but there are some serious roadblocks to casual hacky use of it, which is part of its design (for better not worse, I might add.)

So, I've turned to OpenID for this demo thingy - Anything that takes away the authent/authorise part of an app is a Good Thing(tm) from my point of view.

Right, how to bind it all together? After asking on irc://irc.freenode.net#swig 'kwijibo' pointed me towards the SIOC namespace - http://rdfs.org/sioc/ns which has handy classes like #Item and #Space, and very handy properties, such as #about, #attachment, #content, #note, #related_to, and #reply_of.

And, due to planets aligning, synchronicity, and all that type of stuff, I am going to implement it as the final part of my tutorial on building a website backed by Fedora-Commons.

Less Talk, More Code

Thursday, 8 May 2008

Internal object relationships - in the context of Fedora and Solr indexing.

Wednesday, 7 May 2008

python-xml module depreciated in Hardy/Debian

Friday, 18 April 2008

Distributed objects - how to cope with objects scattered across multiple Fedora's

Thursday, 17 April 2008

Ditching the DB-based blog for a semantic one

Wednesday, 16 April 2008

Release of alpha-quality web interface framework for Fedora

Sunday, 24 February 2008

Creating a web application from scratch, backed by Fedora-Commons and Apache Solr (Part 1)

Fedora, RDF, Pylons and OpenID - a VRE?

Dopplr

Subscribe Now

Mugshot

Additional links

Labels

Blog Archive

About Me