Tuesday 2 December 2008

Archive file resiliences

Tar format: http://en.wikipedia.org/wiki/Tar_(file_format)#Format_details

The tar file format seems to be quite robust as a container: the files are put in byte for byte as they are on disc, with a 512 byte header prefix, and the file length padded to fit into 512 byte blocks.

(Also, quick tip with working with archives on the commandline - the utility 'less' can list the contents of many: less test.zip .tar etc)

Substitution damage:

By hacking the header files (using a hexeditor like ghex2), the inbuilt header checksum seems to detected corruption as intended. The normal tar utility will (possibly by default) skip corrupted headers and therefore files, but will find the other files in the archive:

ben@billpardy:~/tar_testground$ tar -tf test.tar
New document 1.2007_03_07_14_59_58.0
New document 1.2007_09_07_14_18_02.0
New document 1.2007_11_16_12_21_20.0
ben@billpardy:~/tar_testground$ echo $?
0
(tar reports success)
ben@billpardy:~/tar_testground$ ghex2 test.tar
(munge first header a little)
ben@billpardy:~/tar_testground$ tar -tf test.tar
tar: This does not look like a tar archive
tar: Skipping to next header
New document 1.2007_09_07_14_18_02.0
New document 1.2007_11_16_12_21_20.0
tar: Error exit delayed from previous errors
ben@billpardy:~/tar_testground$ echo $?
2

Which is all well and good, at least you can detect errors which hit important parts of the tar file. But what's a couple of 512 byte targets in a 100Mb tar file?

As the files are in the file unaltered, any damage that isn't in tar header sections or padding is restricted to damaging the file itself and not the files around it. So a few bytes of damage will be contained to the area it occurs in. It is important to make sure that you checksum the files then!

Additive damage:

The tar format is (due to legacy tape concerns) pretty fixed on the 512 block size. Any addition seems to cause the part of the file after the addition to be 'unreadable' to the tar utility as it checks only on the 512 byte mark. The exception is - of course - a 512 byte addition (or multiple thereof).
Summary: the tar format is reasonably robust, in part due to its uncompressed nature and also due to its inbuilt header checksum. Bitwise damage seems to be localised to the area/files it affects. Therefore, if used as, say, a BagIT container, it might be useful for the server to allow individual file re-download, to avoid pulling the entire container again.


ZIP format: http://en.wikipedia.org/wiki/ZIP_(file_format)#The_format_in_detail

Zip has a similar layout to Tar, in that the files are stored sequentially, with the main file table or 'central directory' being at the end of a file.

Substitution damage:

Through similar hacking of the file as in the tar test above, a few important things come out:
  • It is possible to corrupt individual files, without the ability to unzip being affected. However, like tar, it will report a error (echo $? -> '2') if the file crc32 doesn't match the checksum in the central directory.
    • BUT, this corruption seemed to only affect the file that the damage was made too. The other files in the archive were unaffected. Which is nice.
  • Losing the end of the file renders the archive hard to recover.
  • It does not checksum the central directory, so slight alterations here can cause all sorts of unintended changes: (NB altered a filename in the central directory [a '3'->'4'])
    • ben@billpardy:~/tar_testground/zip_test/3$ unzip test.zip
      Archive: test.zip
      New document 1.2007_04_07_14_59_58.0: mismatching "local" filename (New document 1.2007_03_07_14_59_58.0),
      continuing with "central" filename version
      inflating: New document 1.2007_04_07_14_59_58.0
      inflating: New document 1.2007_09_07_14_18_02.0
      inflating: New document 1.2007_11_16_12_21_20.0
      ben@billpardy:~/tar_testground/zip_test/3$ echo $?
      1
    • Note that the unzip utility did error out, but the filename of the file extracted was altered to the 'phony' central directory one. It should be New document 1.2007_03_07_14_59_58.0
Additive damage:

Addition to file region: This test resulted in quite a surprise for me, it did a lot better than I had anticipated. The unzip utility not only worked out that I had added 3 bytes to the first file in the set, but managed to reroute around that damage so that I could retrieve the subsequent files in the archive:
ben@billpardy:~/tar_testground/zip_test/4$ unzip test.zip
Archive: test.zip
warning [test.zip]: 3 extra bytes at beginning or within zipfile
(attempting to process anyway)
file #1: bad zipfile offset (local header sig): 3
(attempting to re-compensate)
inflating: New document 1.2007_03_07_14_59_58.0
error: invalid compressed data to inflate
file #2: bad zipfile offset (local header sig): 1243
(attempting to re-compensate)
inflating: New document 1.2007_09_07_14_18_02.0
inflating: New document 1.2007_11_16_12_21_20.0
Addition to central directory region: This one was, an anticipated, more devastating. A similar addition of three bytes to the middle of the region gave this result:
ben@billpardy:~/tar_testground/zip_test/5$ unzip test.zip
Archive: test.zip
warning [test.zip]: 3 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [test.zip]: reported length of central directory is
-3 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1
zipfile?). Compensating...
inflating: New document 1.2007_03_07_14_59_58.0
file #2: bad zipfile offset (lseek): -612261888
It rendered just a single file readable in the archive, but this was still a good result. It might be possible to remove the problem addition, given thorough knowledge of the zip format. However, the most important part is that it errors out, rather than silently failing.
Summary: the zip format looks quite robust as well, but pay attention to the error codes that the commandline utility (and hopefully, native unzip libraries) emit. Bitwise errors to files do not propagate to the other files, but do do widespread damage to the file in question, due to its compressed nature. It survives additive damage far better than the tar format, able to compensate and retrieve unaffected files.


Tar.gz 'format':

This can be seen as a combination of the two archive formats above. On the surface, it is a zip-style archive, but with a single file (the .tar file) and a single entry in the central directory. It shouldn't take long to realise then, that any change to the bulk of the tar.gz will cause havok thoughout the archive. In fact, a single byte substitution I made to a the file region in a test archive caused all byte sequences of 'it' to turn into ':/'. This decimated the XML in the archive, as all files where affected in the same manner.
Summary: Combine the two archive types above but in a bad way. Errors affect the archive as a whole - damage to any part of the bulk of the archive can cause widespread damage and damage to the end of the file can cause it all to be unreadable.

Friday 14 November 2008

Beginning with RDF triplestores - a 'survey'

Like last time, this was prompted by an email that eventually was passed to me. It was a call for opinion - "we thought we'd check first to see what software either of you recommend or use for an RDF database."

It's a good question.

In fact, it's a really great question, as searching for similar advice online results in very few opinions on the subject.

But which one's are the best for novices? Which have the best learning curves? which has the easiest install or the shortest time between starting out and being able to query things?

I'll try to pose as much as I can as a newcomer which won't be too hard :) Some of the comments will be my own, and some will be comments from others, but I'll try to be as honest as I can be to reflect new user expectation and experience and most importantly, developer-attention span. (See the end for some of my reasons for this approach.)

(Puts on newbie hat and enables PEBKAC mode.)

Installable (local) triplestores

Sesame - http://www.openrdf.org/

Simple menu on the left of the website, one called downloads. Great, I'll give that a whirl. "Download the latest Sesame 2.x release" looks good to me. Hmm 5 differently named files... I'll grab the 'onejar' file and try to run it. "Failed to load Main-Class manifest attribute from openrdf-sesame-2.2.1-onejar.jar", okay... so back to the site to find out how to install this thing.

No links for installation guide... on the Documentation page, no link for installation instructions for the sesame 2.2.1 I downloaded, but there is Sesame 2 user documentation and Sesame 2 system documentation. Phew, after guessing that the user documentation might have the guide, I finally found the installation guide  (system documentation was about the architecture, not how to administer the system as you might expect.)

(Developer losing interest...)

Ah, I see, I need the SDK. I wonder what that 'onejar' was then... "The deployment process is container-specific, please consult the
documentation for your container on how to deploy a web application. " - right, okay... let's assume that I have a Java background and am not just a user wanting to hook into it from my language of choice, such as php, ruby, python, or dare I say it, javascript.

(Only Java-friendly developers continue on)

Right, got Tomcat, and put in the war file... right so, now I need to work out how to use a commandline console tool to set up a 'repository'... does this use SVN or CVS then? Oh, it doesn't do anything unless I end the line with a period. I thought it had hung trying to connect!  "Triple indexes [spoc,posc]" Wha? Well, whatever that was, the test repository is created. Let's see what's at http://localhost:8080/openrdf-sesame then.

"You are currently accessing an OpenRDF Sesame server. This server is
intended to be accessed by dedicated clients, using a specialized
protocol. To access the information on this server through a browser,
we recommend using the OpenRDF Workbench software."

Bugger. Google for "sesame clients" then.
I've pretty much given up at this point. If I knew I needed to use a triplestore then I might have persisted, but if I was just investigating it? I would've probably given up earlier.

Mulgara - http://www.mulgara.org/

Nice, they've given the frontpage some style, not too keen on orange, but the effort makes it look professional. "Mulgara is a scalable RDF database written entirely in Java." -> Great, I found what I am looking for, and it warns me it needs Java. "DOWNLOAD NOW" - that's pretty clear. *click*

Hmm, where's the style gone? Lots of download options, but thankfully one is marked by "These released binaries are all that are required for most applications." so I'll grab those. 25Mb? Wow...

Okay, it's downloaded and unpacked now. Let's see what we've got - a 'dist/' directory and two jars. Well, I guess I should try to run one (wonder what the licence is, where's the README?)
Mulgara Semantic Store Version 2.0.6 (Build 2.0.6.local) INFO [main] (EmbeddedMulgaraServer.java:715) - RMI Registry started automatically on port 10990 [main] INFO org.mulgara.server.EmbeddedMulgaraServer  - RMI Registry started automatically on port 1099 INFO [main] (EmbeddedMulgaraServer.java:738) - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy3 [main] INFO org.mulgara.server.EmbeddedMulgaraServer  - java.security.policy set to jar:file:/home/ben/Desktop/apache-tomcat-6.0.18/mulgara-2.0.6/dist/mulgara-2.0.6.jar!/conf/mulgara-rmi.policy2008-11-14 14:06:39,899 INFO  Database - Host name aliases for this server are: [billpardy, localhost, 127.0.0.1]
Well, I guess something has started... back to the site, there is a documentation page and a wiki. A quick view of the official documentation has just confused me, is this an external site? No easy link to something like 'getting started' or tutorials. I've heard of SPARQL, what's iTQL? nevermind, let's see if the wiki is more helpful.

Let's try 'Documentation' - sweet, first link looks like what I want - Web User Interface.
A default configuration for a standalone Mulgara server runs a set of
web services, including the Web User Interface. The standard
configuration puts uses port 8080, so the web services can be seen by
pointing a browser on the server running Mulgara to http://localhost:8080/.
Ooo cool. *click*

Available Services


SPARQL, I've heard of that. *click*

HTTP ERROR: 400

Query must be supplied

RequestURI=/sparql/

Powered by Jetty://

I guess that's the SPARQL api, good to know, but the frontpage could've warned me a little. Ah, second link is to the User Interface.

Good, I can use a drop down to look at lots of example queries, nice. Don't understand most of them at the moment, but it's definitely comforting to have examples. They look nothing like SPARQL though... wonder what it is? I'm sure it does SPARQL... was I wrong?

Quick poke at the HTML shows that it is just POSTing the query text to webui/ExecuteQuery. Looks straightforward to start hacking against too, but probably should password protect this somehow! I wonder how that is done... documentation mentions a 'java.security.policy' field:

java.security.policy

string: URL
: The URL for the security policy file to use.
Default: jar:file:/jar_path!/conf/mulgara-rmi.policy

Kinda stumped... will investigate that later, but at least there's hope. Just be firing off the example queries though shows me stuff, so I've got something to work with at least.

Jena - http://jena.sourceforge.net/

Front page is pretty clear, even if I don't understand what all those acronyms are. downloads link takes me to a page with an obvious download link, good. (Oh, and sourceforge, you suck. How many frikkin mirrors do I have to try to get this file?)

Have to put Jena on pause while Sourceforge sorts its life out.

ARC2 - http://arc.semsol.org/

Frontpage: "Easy RDF and SPARQL for LAMP systems" Nice, I know of LAMP and I particularly like the word Easy. Let's see... Download is easy to find, and tells me straight away I need PHP 4.3+ and MySQL 4.0.4+ *check* Right, now how do I enable PHP for apache again?... Ah, it helps if I install it first... Okay, done. Dropping the folder into my web space... Hmm nothing does anything. From the documentation, it does look like it is geared to providing a PHP library framework for working with its triplestore and RDF. Hang on, SPARQL Endpoint Setup looks like what I want. It wants a database, okay... done, bit of a hassle though.

Hmm, all I get is "Fatal error: Call to undefined function mysql_connect() in /********/arc2/store/ARC2_Store.php on line 53"

Of course, install php libraries to access mysql (PEBKAC)... done and I also realise I need to set up the store, like the example in "Getting Started"... done (with this) and what does the index page now look like?



Yay! there's like SPARQL and stuff... I guess 'load' and 'insert' will help me stick stuff in, and 'select' looks familiar... Well, it seems to be working at least.

Unfortunately, it looks like the Jena download from sourceforge is in a world of FAIL for now. Maybe I'll look at it next time?

Triplestores in the cloud

Talis Platform - http://www.talis.com/platform/

From the frontpage - "Developers using the Platform can spend more of their time building
extraordinary applications and less of their time worrying about how
they will scale their data storage.
" - pretty much want I wanted to hear, so how do I get to play with it?

There is a Get involved link on the left, which rapidly leads me to see the section: "Develop, play and try out" - n2 developer community seems to be where it wants me to go.

Lots of links on the frontpage, takes a few seconds to spot: "Join - join the n² community to get free developer stores and online support" - free, nice word that. So, I just have to email someone? Okay, I can live with that.

Documentation seems good, lots of choices though, a little hard to spot a single thread to follow to get up to speed, but Guides and Tutorials looks right to get going with. The Kniblet tutorial (whatever a kniblet is) looks the most beginnerish, and it's also very PHP focussed, which is either a good thing or a bad thing depending on the user :)

Commercial triplestores

Openlink Virtuoso - http://virtuoso.openlinksw.com/

Okay, I tried the Download link, but I am pretty confused by what I'm greeted with:



Not sure what one to pick just to try it out, it's late in the day, and my tolerance for all things installable has ended.

-----------------------------------------

Why take the http/web-centric, newbie approach to looking at these?

Answer: In part, I am taking this approach because I have a deep belief that it
was only after relational DBs became commoditised - "You want fries
with you MySQL database?" - that the dynamic web kicked off. If we want
the semantic web to kick off, we need to commoditise it or at least, make
it very easy for developers to get started. And I mean EASY. A query that I want answered is: "Is there something that fits: 'apt-get install
triplestore; r = store('localhost'), r.add(rdf), r.query(blah)'? "

(I am particularly interested to see what happens when Tom Morris's work on Reddy collides with ActiveRecord or activerdf...)

NB I've short circuited the discovery of software homepages - Imagine
I've seen projects stating that they use "XXXXX as a triplestore". I know
this will likely mean I've compared apples to oranges, but as a newbie, how
would I be expected to know this? "Powered by the Talis Platform" and
"Powered by Jena" seem pretty similar on the surface.)

Thursday 13 November 2008

A Fedora/Solr Digital Library for Oxford's 'Forced Migration Online'

(mods:subtitle - Slightly more technical follow-up to the Fedora Hatcheck piece.)

As I have been prompted via email by Phil Cryer (of the Missouri Botanical Garden) to talk more about how this technically works, I thought it would be best to make it a written post, rather than the more limited email response.

Background

Forced Migration Online (FMO) had a proprietary system, supporting their document needs. It was originally designed for newpaper holdings and applied that model to encoding the mostly paginated documents that FMO held - such that each part was broken up into paragraphs of text, images and the location of all these parts on a page. It even encoded (in its own format) the location of the words on the page when it OCR'd the documents, making per-word higlighting possible. Which is nice.

However, the backend that powered this was over-priced, and FMO wanted to move to a more open, sustainable platform.

Enter the DAMS

(DAMS = Digital Asset Management System)

I have been doing work on trying to make a service out of a base of fedora-commons and additional 'plugin' services, such as the wonderful Apache Solr and the useful eXist XML db. The end aim is for departments/users/whoever to requisition a 'store' with a certain quality of service (solr attached, 50Gb+ etc) but this is not yet an automated process.

The focus for the store is a very clear separation between storage, management, indexing services and distribution - Normal filesystems, or Sun Honeycomb are the storage, Fedora-commons provides the management + CRUD, solr, eXist, mulgara, sesame, and couchDB can provide potential index and query services, and distribution is handed pragmatically, caching outgoing and mirroring where necessary.

The FMO 'store'

From discussions with FMO, and examining the information they held and the way they wished to make use of it, a simple Fedora/Solr store seemed to fufill what they wanted: a persistant store of items with attachments and the ability to search the metadata and retrieve results.

Bring in the consultants

FMO hired Aptivate to do the migration of their data from the proprietary system, in its custom format, to a Fedora/Solr store and trying as much as possible to retain the functionality they had.

Some points that I think it is important to impress on people here:
  • In general, software engineer consultants don't understand METS or FOXML.
  • They *really* don't understand the point of disseminators.
  • Having to teach software engineer consultants to do METS/FOXML/bDef's etc is likely an arduous and costly task.
  • Consultants add lots of money to do things their team don't already have the experience to do.
So, my conclusion was to not make these things part of the development at all to the extent that I might even have forgotten to mention these things to them except in passing. I helped them install their own local store and helped them with the various interfaces and gotchas of the two software packages. By showing them how I use Fedora and Solr in ora.ouls.ox.ac.uk, they were able to hit the ground running.

They began by using the REST interface to Fedora and the RESTful interface to Solr. By having them begin by using the simple put/get REST interface to Fedora, they could concentrate on getting used to the nature of Fedora as an objectstore. I think they moved to use the SOAP interface as it better suited their Java background, although I cannot be certain as it wasn't an issue that came up.

Once they had developed the migration scripts to their satisfaction, they asked me to give them a store, which I did (but due to hardware and stupid support issues here I am afraid to say I held them up on this.) They fired off their scripts, moved all the content into the fedora with a straightforward layout per object (pdf, metadata, fulltext and thumbnail) The metadata is - from what I can see - the same XML metadata as before - very MARCXML in nature, with 'Application_Info' elements having types like 'bl:DC.Title'. If necessary, we will strip out the dublin core metadata and put what we can into the DC datastream, but that's not of particular interest to FMO right now.

Fedora/Solr notes

As for the link between Solr and Fedora? This is very loosely coupled, such that they are running in the same Tomcat container for convenience, but aren't linked in a hard way.

I've looked at GSearch, which is great for a homogenous collection of items, such that they can be acted on by the same XSLT to produce a suitable record for Solr, but as the metadata was a complete unknown for this project, it wasn't too suitable.

Currently, they have one main route into the fedora store, and so, it isn't hard to simply reindex an item after a change is made, especially for services such as Solr or eXist, which expect to have things change incrementally. I am looking at services such as ActiveMQ for scheduling these index tasks, but more and more I am starting to favour RabbitMQ which seems to be more useful, while retaining the scalability and very robust nature.

Sending an update to Solr is as simple as an HTTP POST to its /update service, consisting of a XML or JSON packet like "changeme:1John Smith...." - it uses a transactional model, such that you can push all the changes and additions into the live index via a commit call, without taking the index offline. To query Solr, all manner of clients exist, and it is built to be very simple to interact with, handling facet queries, filtering, ordering and can deliver the results in XML, JSON, PHP or Python directly. It can even do a XSLT transform of the results on the way out, leading to a trivial way to support OpenSearch, Atom feeds and even HTML blocks for embedding in other sites.

Likewise, to change a PDF in Fedora can be done by a HTTP POST as well. Does it need to be more complicated?

Last, but not least, a project to watch closely:

The Fascinator project, funded by ARROW, as part of their mini project scheme, is an Apache Solr front end to the Fedora commons repository. The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. Well worth a look, as it seeks to turn this Fedora/Solr pairing truly into an appliance, with a simple installer and handling the linkage between the two.

OCLC - viral licence being added to WorldCat data

Very short post on this, as I just wanted to highlight a fantastic piece written by Rob Styles about OCLC's policy changes to WorldCat

In a nutshell, it seems that OCLC’s policy changes have the intention to restrict the usage of the data in order to prevent competing services from appearing. Competing services such as LibraryThing and the Internet Archive's Open Library but unfortunately, it seems that the changes will also impinge on the rights of users to collate citation lists in software such as Zotero, Endnote and others. Read Rob's post for a well researched view on the changes.

Wednesday 12 November 2008

Useful, interesting, inspiring technology/software that is out there that you might not know about.

(I guess this is more like a filtered link list, but with added comments in case you don't feel like following the links to find out what it's all about.. A mix of old, but solid links and a load of tabs that I really should close ;))
  1. Tahoe - http://allmydata.org/~warner/pycon-tahoe.html
    The "Tahoe" project is a distributed filesystem, which safely stores files on multiple machines to protect against hardware failures. Cryptographic tools are used to ensure integrity and confidentiality, and a decentralized architecture minimizes single points of failure. Files can be accessed through a web interface or native system calls (via FUSE). Fine-grained sharing allows individual files or directories to be delegated by passing short URI-like strings through email. Tahoe grids are easy to set up, and can be used by a handful of friends or by a large company for thousands of customers.
  2. CouchDB - http://incubator.apache.org/couchdb/

    Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.

    CouchDB is written in Erlang, but can be easily accessed from any environment that provides means to make HTTP requests. There are a multitude of third-party client libraries that make this even easier for a variety of programming languages and environments.

  3. Yahoo Term Extractor - http://developer.yahoo.com/search/content/V1/termExtraction.html
    The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content.
  4. Kea term extractor (SKOS enabled) - http://www.nzdl.org/Kea/
    KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
  5. Piwik - http://piwik.org/

    piwik is an open source (GPL license) web analytics software. It gives interesting reports on your website visitors, your popular pages, the search engines keywords they used, the language they speak… and so much more. piwik aims to be an open source alternative to Google Analytics.

  6. RabbitMQ - http://www.rabbitmq.com/
    RabbitMQ is an Open (Mozilla licenced) implementation of AMQP, the emerging standard for high performance enterprise messaging. Built on top of the Open Telecom Platform (OTP). OTP is used by multiple telecommunications companies to manage switching exchanges for voice calls, VoIP and now video. These systems are designed to never go down and to handle truly vast user loads. And because the systems cannot be taken offline, they have to be very flexible, for instance it must be possible to 'hot deploy' features and fixes on the fly whilst managing a consistent user SLA.
  7. Talis Platform - http://n2.talis.com/wiki/Main_Page
    The Talis Platform provides solid infrastructure for building Semantic Web applications. Delivered as Software as a Service (SaaS), it dramatically reduces the complexity and cost of storing, indexing, searching and augmenting data. It enables applications to be brought to market rapidly with a smaller initial investment. Developers using the Platform can spend more of their time building extraordinary applications and less of their time worrying about how they will scale their data storage.
  8. The Fascinator - http://ice.usq.edu.au/projects/fascinator/trac
    The goal of the project is to create a simple interface to Fedora that uses a single technology – that’s Solr – to handle all browsing, searching and security. This contrasts with solutions that use RDF for browsing by ‘collection’, XACML for security and a text indexer for fulltext search, and in some cases relational database tables as well. We wanted to see if taking out some of these layers makes for a fast application which is easy to configure. So far so good.
  9. RDFQuery - http://code.google.com/p/rdfquery/

    This project is for plugins for jQuery that enable you to generate and manipulate RDF within web pages [in javascript]. There are two main aims of this project: 1) to provide a way of querying and manipulating RDF triples within Javascript that is as intuitive as jQuery is for querying and manipulating a DOM, and 2) to provide a way of gleaning RDF from elements within a web page, whether that RDF is represented as RDFa or with a microformat.

  10. eXist - http://exist.sourceforge.net/
    eXist-db is an open source database management system entirely built on XML technology. It stores XML data according to the XML data model and features efficient, index-based XQuery processing. It supports:
  11. Evergreen - http://evergreen-ils.org/
    Evergreen is an enterprise-class [Open Source] library automation system
    that helps library patrons find library materials, and helps libraries
    manage, catalog, and circulate those materials, no matter how large or
    complex the libraries.
  12. Apache Solr - http://lucene.apache.org/solr/
    Solr is an open source enterprise search server based on the
    Lucene Java search library
    , with XML/HTTP and JSON APIs,
    hit highlighting, faceted search, caching, replication, a web administration interface and many more features.
  13. GATE - http://gate.ac.uk/
    GATE is a leading toolkit for text-mining. It bills itself as "the Eclipse of Natural Language Engineering,
    the Lucene of Information Extraction" [NB I have yet to use this, but it has the kind of provenance and userbase that makes me feel okay about sharing this link]
  14. Ubuntu JeOS - http://www.ubuntu.com/products/whatisubuntu/serveredition/jeos
    Ubuntu Server Edition JeOS (pronounced "Juice") is an efficient variant
    of our server operating system, configured specifically for virtual
    appliances
    . Currently available as a CD-Rom ISO for download, JeOS is a
    specialised installation of Ubuntu Server Edition with a tuned kernel
    that only contains the base elements needed to run within a virtualized
    environment.
  15. Synergy2 - http://synergy2.sourceforge.net/
    Synergy lets you easily share a single mouse and keyboard between
    multiple computers with different operating systems, each with its
    own display, without special hardware
    . It's intended for users
    with multiple computers on their desk since each system uses its
    own monitor(s).

Friday 24 October 2008

"Expressing Argumentative Discussions in Social Media sites" - and why I like it

At the moment, there are a lot of problems with the way information is cited or referenced in paper-based research articles, etc and I am deeply concerned that this model is applied to digitally held content. I am by no means first to say the following, and people have put a lot more thought into this field but I can't find a way out of the situation by using the information that is captured. I find it easier to see the holes when I express it to myself in rough web metaphors:
  • a reference list is a big list of 'bookmarks', ordered generally by the order in which they appear in the research paper.
  • Sometimes a bookmark is included, because everyone else in the field includes the same one, a core text for example.
  • There are no tags, no comments and no numbers on how often a given bookmark is used.
  • There is no intentional way to find out how often the bookmark appears in other people's lists. This has to be reverse engineered.
  • There is no way to see from the list what the intent of the reference is, whether the author agrees, disagrees, refutes, or relies on the thing in question.
  • There are no anchor tags in the referenced articles (normally), so there is little way to reliably refer to a single quote in a text. Articles are often referenced as a whole, rather than to the line, chart, or paragraph.
  • The bookmark format varies from publisher to publisher, and from journal to journal
  • Due to the coase grained citation, a single reference will sometimes be used when the author refers to multiple parts of a given piece of work
Now, on a much more positive note, this issue is being tackled. At the VoCamp in Oxford, I talked with CaptSolo about the developments with SIOC and their idea to extend the vocabularies to deal with argumentative discourse. Their paper is now online at the SDoW2008 site (or directly to the pdf) The essence of this, is an extension to the SIOC vocab, recording the intent of a statement, such as Idea, Issue, Elaboration as well as recording an Argument.

I (maybe naïvely) have felt that there is a direct parallel to social discourse and academic discourse, to the point where I used the sioc:has_reply property to connect links made in blogs to items held in the archive (using trackbacks and pingbacks, a system in hiatus until I get time to beef up the antispam/moderation facilities) So, to see an argumentation vocab developing makes me more happy :) Hopefully, we can extend this vocab's intention with more academic-focussed terms.

What about the citation vocabularies that exist? I think that those that I have looked at suffer from the same issue - they are built to represent what exists in the paper-world, rather than what could exist in the web-world.

I also want to point out the work of the Spider project, which aims to semantically markup a single journal article, as they have taken significant steps towards showing what could be possible with enhanced citations. Take a look at their enhanced article, all sorts of very useful examples of what is possible. Pay special attention to how the references are shown, how they can be reordered, typed and so. Note that I am able to link to the references section in the first place! The part I really find useful is demonstrated by the two references in red in the 2nd paragraph of the introduction. Hover over them to find out what I mean. Note that even though the two references are the same in the reference list (due to this starting as a paper version article) they have been enhanced to popup the reasons and sections referred to in each case.

In summary then, please think twice when compiling a sparse reference list! Quote the actual section of text if you can and harvard format be damned ;)

Monday 20 October 2008

Modelling and storing a phonetics database inside a store

Long story short, a researcher asked us to store and dissmeninate a DVD containing ~600+ audio files and associated analyses comprising a phonetics database, focussed on the beat of speech.

This was the request that let me start on something I had planned for a while; a databank of long tail data. This is data that is too small to fit into the plans of Big Data (which have IMO a vanishingly small userbase for reuse) and too large and complex to sit as archived lumps on the web. The system supporting databank is a Fedora-commons install, with a basic Solr implemented for indexing.

As I haven't gotten an IP address for the databank machine, I cannot link to things yet, but I will walk through the modelling process. (I'll add in links to the raw objects once I have a networked VM for the databank)

Analysis: "What have we got here?"

The dataset has been given to us by a researcher called Dr. Greg Kochanski, the data having been burnt onto a DVD-R. He called it the "2008 Oxford Tick1 corpus". A quick look at the contents showed that it was a collection of files, grouped by file folder into a hierarchy of some sort. First things first, though - this is a DVD-R and very much prone to degradation. As it was the files on the disk that are important, rather than the disc image itself, a quick "tar -cvzf Tick1_backup.tar.gz /media/cdrom0/" made a zipped archive of the files. Remember, burnt DVDs have an integrity halflife of around 1 1/2 -> 2 years (according to a talk I heard at the last SUN-PAsig) and I myself have lost data to unreadable discs.

Disc contents: http://pastebin.com/f74aadacc

Top-level directory listing:

ch ej lp ps rr sh ta DB.fiat DBsub.fiat README.txt
cw jf nh rb sb sl tw DBsent.fiat LICENSE.txt

Each one of the two letter directories holds a number of files, each file seemingly having a subdirectory for it, containing ~6+ data files in a custom format.

The .fiat top-level files, DB.fiat, etc are in a format roughly described here - the documentation about the data held within each file being targeted for humans. In essence, it looks like a way to add comments and information to a csv file, but it doesn't seem to support/encourage the line syntax that csv enjoys, like quotes, likely due to it not using any standard csv library. For instance, the same data could be captured and use standard csv libs without any real invention, but I digress.

Ultimately, by parsing the fiat files, I can get the text spoken in each audio file, some metadata about each one, and some information about how some of the audio is interrelated (by text spoken, type, etc)

Modelling: "How to turn the data into a representation of objects"

There are very many ways to approach this, but I shall outline the aims I keep in mind, and also the fact that this will always be a compromise between different aims, in part due to the origins of the data.

I am a fan of sheer curation; I think that this not only is a great way to work, but also the only practical way to deal with this data in the first place. The researcher knows their intentions better than a post-hoc curator. Injecting data modelling/reuse considerations into the working lives of researchers is going to take a very long time. I have other projects focussed on just this, but I don't see it being the workflow through which the majority of data is piped any time soon.

In the meantime, the way I believe is best for this type of data is to capture and to curate by addition. Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time. We need to start putting (cleared) data out there, in parallel to capturing the information necessary for understanding and then preserving the information. By putting the data out there quickly, the researcher feels that they are having an effect and are able to see that what we are doing for them is worth it. This can favourably aid dialogue that you need to have with the researcher (or other related researchers or analysts) to further characterise this data.

So, step one for me is to describe a physical view of the filesystem in terms of objects and then to represent this description in terms of RDF and Fedora objects. Luckily, this is straightforward, as it's already pretty intuitive.
  • There are groupings
    • a top-level containing 10 group folders and a description of the files
    • Each group folder contains a folder for each recording in its group
    • Each recording folder contains
      • There is a .wav file for each recording
      • There are 6 analysis .dat files (sound analyses like RMS, f0, etc), dependent and exclusive to each audio file
      • There are some optional files, dependent and exclusive to each audio file
    • Each audio file is symbolically linked into its containing group folder.
So, from this, I have 3 types of object: the top-level 'dataset' object, a grouping object of some kind, and a recording object, containing the files pertinent to a single recording. (In fact, the two grouping classes are preexisting classes in the archival system here, albeit informally.)

We can get a crude, but workable representation by using three 'different' (marked different in the RELS-EXT ds) fedora objects, and by using the dcterms 'isPartof' property to indicate the groupings (recording --- dcterms:isPartOf --> grouping)



Curation by addition


The way I've approached this with Fedora objects is to use a datastream with a reserved ID to capture the characteristics of the data being held. At the moment, I am using full RDF stored in a datastream called RELS-INT. (NB I fully expect someone to look at this dataset later in its life and say 'RDF? that's so passé'; the curation of the dataset will be a long-term process that may not end entirely.) RELS-INT is to contain any RDF that cannot be contained by the RELS-EXT datastream. (yes, having two sources for the RDF where one should do is not desirable, but it's a compromise between how Fedora works, and how i would like it to work.)

To indicate that the RELS-INT should also be considered when viewing the RELS-EXT, I add an assertion (which slightly abuses the intended range of the dcterms requires property, but reuse before reinvention):

<info:fedora/{object-id}> <http://dublincore.org/documents/dcmi-terms/#terms-requires> <info:fedora/{object-id}/RELS-INT> .

Term Name: requires
URI: http://purl.org/dc/terms/requires
Definition: A related resource that is required by the described resource to support its function, delivery, or coherence.
I am also using the convention of storing timestamped notes in iCal format within a datastream called EVENTS (I am doing this in a much more widespread fashion throughout) These notes are intended to document the curational/archivist why's behind changes to the objects, rather than the technical ones, which Fedora can keep track of. As the notes are available to be read, they are intended to describe how this dataset has evolved and why it has been changed or added to.

An assertion to add then is that the EVENTS datastream contains information pertinent to the provenance of the whole object (into the RELS-INT in this case) I am not happy with the following method, but I am open to suggestions.

<info:fedora/{object-id}> <http://dublincore.org/documents/dcmi-terms/#terms-provenance>
[ a http://purl.org/dc/terms/ProvenanceStatement .
dc:source <info:fedora/{object-id}/EVENTS> ];
Term Name: provenance
URI: http://purl.org/dc/terms/provenance
Definition: A statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity, and interpretation.
Comment: The statement may include a description of any changes successive custodians made to the resource.

From this point on, the characteristics of the files attached to each object can be recorded in a sensible and extendable manner. My next steps are to add simple dublin core metadata for what I can (both the objects and the individual files) and to indicate how the data is interrelated. I will also add (as an object in the databank) the basic description of the custom data format, which seems to be based loosely on the NASA FITS format, but not based well enough for FITS tools to work on, or to be able to validate the data.

Data Abstracts: "Binding data to traditional formats of research (articles, etc)

It should be possible to cite a single piece of data, as well as a grouping or even an arbitrary set of data and groupings of data. From a re-use point of view, this citation is a form of data currency that is passed around and re-cited, so it makes sense to make this citation as simple as possible; a data citation as a URL.

I start from the point of view that a single, generic, perfectly modelled citation format for data will take more time and resources to develop than I have. What I believe I can do though, is enable the more practically focussed case for re-use and the sharing of citations. If a single URL(URI) is created, one which serves as an anchor node to bind together resources and information. It should provide a means for the original citing author to indicate why they had selected this grouping of information, what this grouping of information means at that time and for what reason. I can imagine modelling the simple facts of such a citation in RDF assertions (person, date of citation, etc) but it's beyond me to imagine a generic but useful way to indicate author intention and perception in the same way. The best I can do is to adopt the model that researchers/academics are most comfortable with, and allow them to record a data 'abstract' in the native language.




Hopefully, this will prove a useful hook for researchers to focus on and to link to or from related or derived data. Whilst groupings are typically there to make it easier to understand the underlying data as a whole, the data abstract is there to record an author's perception of a grouping, based on whatever reasoning they choose to express.

Thursday 16 October 2008

News and updates Oct 2008

Right, I haven't forgotten about this blog, just getting all my ducks in a line as it were. Some updates:

  • The JISC bid for eAdministration was successful, titled "Building the Research Information Infrastructure (BRII)". The project will categorise the research information structure, build vocabularies if necessary, and populate it with information. It will link research outputs (text and data), people, projects, groups, departments, grant/funding information and funding bodys together, using RDF and as many pre-existing vocabularies as is suitable. The first vocab gap we've hit is one for funding, and I've made a draft RDF schema for this which will be openly published once we've worked out a way to make it persistent here at Oxford (trying to get a vocab.ox.ac.uk address)
    • One of the final outputs will be a 'foafbook' which will re-use data in the BRII store - it will act as a blue book of researchers. Think Cornell's Vivo, but with the idea of Linked Data firmly in mind.
    • We are just sorting out a home for this project, and I'll post up an update as soon as it is there.
  • Forced Migration Online (FMO) have completed their archived document migration from a crufty, proprietary store to a ORA-style store (Fedora/Solr) - you can see their preliminary frontend at http://fmo.qeh.ox.ac.uk. Be aware that this is a work in progress. We provide the store as a service to them, giving them a Fedora and a Solr to use. They contracted a company called Aptivate to migrate their content, and I believe also to create their frontend. This is a pilot project to show that repositories can be treated in a distributed way, given out like very smart, shared drive space.
  • We are working to archive and migrate a number of library and historical catalogs. A few projects have a similar aim to provide an architecture and software system to hold medieval catalog research - a record of what libraries existed, and what books and works they held. This is much more complex that a normal catalog, as each assertion is backed by a type of evidence, ranging from the solid (first-hand catalog evidence), to the more loose (handwriting on the front page looks like a certain person who worked at a certain library.) So modelling this informational structure is looking to be very exciting, and we will have to try a number of ways to represent this, starting with RDF due to the interlinked nature of the data. This is related to the kinds of evidence that genealogy uses, and so related ontologies may be of use.
  • The work on storing and presenting scanned imagery is gearing up. We are investigating storing the sequence of images and associated metadata/ocr text/etc as a single tar file as part of a Fedora object (i.e. a book object will have a catalog record, technical/provenance information and an attached tar file and and a list of file to offset information.)
    • This is due to us trying to hit the 'sweet spot' for most file systems. A very large number of highly compressed images and little pieces of text does not fit well with most FS internals. We estimate that for a book there will be around [4+PDFs+2xPages] files, or 500+ typically. Just counting up the various sources of scanned media we already have, we are pressing for about 1/2 million books from one source, 200,000 images from another, 54,000 from yet another... it's adding up real fast.
  • We are starting to deal with archiving/curating the 'long-tail' of data - small, bespoke datasets that are useful to many, but don't fall into the realm of Big Data, or Web data. I don't plan on touching Access/FoxPro databases any time soon though! I am building a Fedora/Solr/eXist box to hold and disseminate these, which should live at databank.ouls.ox.ac.uk very, very shortly. (Just waiting on a new VMware host to grow into, our current one is at capacity.)
    • To give a better idea of the structure, etc, I am writing it up in a second blog post to follow shortly - currently named "Modelling and storing a phonetics database inside a store"
  • I am in the process of integrating the Google-analytics-style statistics package at http://piwik.org with the ORA interface, to give relatively live hit counts on a per-item and to build per-collection reports.
    • Right now, piwik is capturing the hits and downloads from ORA, but I have yet to add in the count display on each item page, so halfway there :)
  • We are just waiting on a number of departments here to upgrade the version of EPrints they are using for their internal, disciplinary repositories, so that we can begin archiving surrogate copies of the work they wish to put up for this service. (Using ORE descriptions of their items) By doing so, their content becomes exposed in ORA, mirror copies are made (working on a good way to maintain these as content evolves), but they retain the content control, ORA will also act as a registry for their content. It's only when their service drops do the users get redirected to the mirror copies that ORA holds (think google cache, but a 100% copy).
  • In the process of battle-testing the Fedora-Honeycomb connection, but as mentioned above, just waiting for a little more hardware before I set to it. Also, we are examining a number of other storage boxes that should plug in under Fedora, using the Honeycomb software, such as the new and shiny Thumper box, "Thor" Sun Fire Xsomething-or-other. Also, getting pretty interested at the idea of MAID storage - massive array of idle disks. Hopefully, this will act like tape, but have a sustainable access speed of disk. Also, a little more green than a tower of spinning hardware.
  • Planning out the indexer service at the moment. It will use the Solr 1.3 multicore functionality, with a little parsing magic at the ingest side of things to make a generic indexer-as-a-service type system. One use-case is to be able to bring up VM machines with multicore solr on to act as indexers/search engines as needed. An example aim? "Economics want an index that facets on their JEL codes." POST a schema and ingest indexer to the nearest free indexer, and point the search interface at it once an XMPP message comes back that it is finished.
  • URI resolvers - still investigating what can be put in place for this, as I strongly wish to avoid coding this myself. Looking at OCLC's OpenURL and how I can hack it to feed it info:fedora uris and link them to their disseminated location. Also, using a tinyurl type library + simple interface might not be a bad idea for a quick PoC.
  • Just to let you all know that we are building up the digital team here, most recently held interviews for the futureArch project but we are looking for about 3 others to hire, due to all the projects we are doing. We will be putting out job adverts as and when we feel up to battling with HR :)
That's most of the more interesting hot topics and projects I am doing at the moment.... phew :)

Monday 18 August 2008

Cherry picking the Semantic Web (from Talis's Nodalities magazine)

Just to say that in the Talis Nodalities magazine, Issue 3 [PDF] page 13 they have published an article of mine about how treating everything - author, department, funder, etc - as an first class object will have knock-on benefits to curation and cataloguing of archived items.

When I find a good, final version of the article that I haven't accidentally deleted, I'll post the text of it here ;) Until then, download the PDF version. Read all the articles actually, they are all good!

(NB The magazine itself is licensed under the CC-by-sa, which I think is excellent!)

DSpace and Fedora *need* opinionated installers.

Just to say that both Fedora-Commons and DSpace really, really need opinionated installers that make choices for the user. Getting either installed is a real struggle - which we demonstrated during the Crigshow, so please don't write in the comments that it is easy, it just isn't.

Something that is relatively straightforward to install, is a debian package.

So, just a plea in the dark, can we set up a race? Who can make their repository software installable as a .deb first? will it be DSpace or Fedora? Who am I going to send a box of cookies to and a thank you note from the entire developer community?

(EPrints doesnt count in this race; they've already done it)

Re-using video compression code to aid document quality checking

(Expanding on this video post from the Crigshow)

Problem:


The volume of pages from a large digitisation project can be overwhelming. Add into that the simple fact that all (UK) institutional projects are woefully underfunded and underresourced, it's surprising that we can cope with them really.

One issue that repeatedly comes up is the idea of quality assurance; How can we know that a given book has been scanned well? How can we spot images easily? Can we detect if foreign bodies were present in the scan, such as thumbs, fingers or bookmarks?

A quick solution:

Inspired by a talk at one of the conference strands at WorldComp, where the author talked about the use of a component of a commonly used video compression standard (MPEG2) to detect degrees of motion and change in a video, without having to analyse the image sequences using a novel, or smart algorithm.

He talked about using the motion vector stream to be a good rough guide to the amount of change between frames of video.

So, why did this inspire me?
  • MPEG-2 compression is a pretty much a solved problem; there are some very fast and scalable solutions out there today - direct benefit: No new code needs to be written and maintained
  • The format is very well understood and stripping out the motion vector stream wouldn't be tricky. Code exists for this too.
  • Pages of text in printed documents tend towards being justified so that the two edges of the text columns are straight lines. There is also (typically) a fixed number of lines on a page.
  • A (comparatively rapid) MPEG2 compression of the scans of a book would have the following qualities:
    • The motion vectors between pages of text would either shown little overall change (as differing letters are actually quite similar) or a small, global shift if the page was printed on a slight offset.
    • The motion vectors between a page of text and a page with an image embedded in text on the next, or a thumb on the edge, would show localised and distinct changes that differ greatly from the overall perspective.
  • In fact, a real crude solution could be, just using the vector stream to create a bookmark list for all the suspect changes. This might bring the number of pages to check down to a level that a human mediator could handle.
How much needs to be checked?

Via basic sample survey statistics: to be sure to 95% (±5%) that the scanned images of 300 million pages are okay, just 387 totally random pages need to be checked. However, to be sure that each individual book is okay to the same degree, a book being ~300 pages, 169 pages need to be checked in each book. I would suggest that the above technique would significantly lower this threshold, but it would be by an empirically found amount.

Also note that the above figures carry the assumption that the scanning process doesn't change over time, which of course it does!

The four rules of the web and compound documents

A real quirk that truly interests me is the difference in aims between the way documents are typically published and the way that the information within them is reused.

A published document is normally in a single 'format' - a paginated layout, and this may comprise text, numerical charts, diagrams, tables of data and so on.

My assumption is that, to support a given view or argument, a reference to the entirety of an article is not necessary; The full paper gives the context to the information, but it is much more likely that a small part of this paper contains the novel insight being referenced.

In the paper-based method, it is difficult to uniquely identify parts of an article as items in their own right. You could reference a page number, give line numbers, or quote a table number, but this doesn't solve this issue that the author hadn't put time to considering that a chart, a table or a section of text would be reused.

So, on the web, where multiple representations of the same information is getting to be commonplace (mashups, rss, microblogs, etc), what can we do to help better fulfill both aims, to show a paginated final version of a document, and also to allow each of the components to exist as items in their own right, with their own URIs (or better, URLs containing some notion of the context e.g. if /store/article-id gets to the splash page of the article, /store/article-id/paragraph-id will resolve to the text for that paragraph in the article.)

Note that the four rules of the web (well, of Linked Data) are in essence:
  • give everything a name,
  • make that name a URL ...
  • ...which results in data about that thing,
  • and have it link to other related things.
[From TimBL's originating article. Also, see this presentation - a remix of presentations from TimBL and the speaker, Kingsley Idehen - given at the recent Linked Data Planet conference]

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author's know that we may present alternate versions, based on a user's demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

The system holding the articles needs to be able to clearly indicate versions and show multiple versions for a single record.

When a compound document is submitted to the archive, a second parallel version should be made by fragmenting the document into paragraphs of text, individual diagrams, tables of data, and other natural elements. One issue that has already come up in testing, is that documents tend to clump multiple, separate diagrams together into a single physical image. It is likely that the only solution to breaking these up to this is going to be a human one, either author/publisher education(unlikely) or by breaking them up by hand.

I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly.

Summary:

1) Break apart any compound document into its constituent elements (paragraph level is suggested for text)
2) Make sure that each one of these parts are clearly expressed in the context they are in, using hierarchical URLs, /article/paragraph or even better, /article/page/chart
3) On the article's splashpage, make a clear distinction between the real article and the broken up version. I would suggest a scheme like Google search's 'View [PDF, PPT, etc] as HTML'. I would assert that many people intuitively understand that this view is not like the original and will look or act differently.

Some related video blogs from the Crigshow trip
Finding and reusing algorithms from published articles
OCR'ing documents; Real documents are always complex
Providing a systematic overview of how a Research paper is written - giving each component and each version of a component would have major benefits here

Trackbacks, and spammers, and DDoS, oh my!

The Idea

Before I give you all the dark news about this, let me set out my position: I really, really think that repositories communicating the papers that are cited and referenced to each other is a really good thing. If a paper was deposited in the Oxford archive, and it referenced a paper held in a different repository, say in Southampton's EPrints archive, I think that it is a really fantastic idea to let the Oxford archive tell the Southampton one about it.

And I decided to do something about it - I added two linkback facilities to the archive's user interface, allowing both trackbacks and pingbacks to be archived by the system. I adopted the pre-existing "standards" - really, they are just rough api's - because I think we have all learned our lessons about making up new APIs for basic tasks.

What is Trackback?

Trackback is an agreed technique from the blogging world. Many blogging systems have it built in, and it enables one blog post to explicitly reference and talk about another post, made on a remote blog somewhere. It does this by POSTing a number of form-encoded parameters to a specific URL, specific to the item that is being referenced. The parameters include things like title, abstract and URL of the item making the reference.

So on the surface, it appears that this trackback idea performs exactly what I was looking for.

BUT! Trackback has massive, gaping flaws, akin to the flaws in the email system which is full of spam. For one, all trackbacks are trusted in the basic specifications. No checking that the URL exists, no checking of the text for relevance, etc.

Pingback is a slightly different system, in that all that is passed, is the URL of the referencing item. It is then up to the remote server to go and get the requested page and parse that out to find the reference. (The next version of the specification is crying out to recommend microformats et al, in my opinion)

So, these systems, trackback and pingback, have been on trial in the live system for about 4 or 5 months, and I am sure you all want to hear my conclusions:

  • Don't implement Trackback as it is defined in its specifications... seriously. It is a poorly designed method, with so much slack that it is a spammers goldmine.
  • Even after adding in some safeguards to the Trackback method, such as parsing the supposed referencing page and checking for HTML and the presence of the supposed link, it was still possible for spammers to get through.
  • When I implemented Trackbacks, I did so with the full knowledge that I might have to stand at a safe distance and nuke the lot. Here is the Trackback model used in the Fedora repository - A DC datastream containing the POSTed information mapped to simple dublin core and a RELS-EXT RDF entry asserting that this Fedora Object <info:fedora/trackback-id>, referenced <dcterms:references> the main item in the archive <info:fedora/item-id>. As the user interface for the archive gets the graph for that object, it was easy to get the trackbacks out as well. Having separate objects for the trackbacks and not changing the referenced item at all, made it very easy to remove the trackbacks at the end.
  • The Trackback system did get hit, once the spammers found a way around my safeguards. So, yes, the trackbacks got 'nuked' and the system turned off.
  • Currently, the system is under a sort of mini-DDoS, from the spammer's botnet trying to make trackbacks and overloading the session tracking system.
  • The Pingback system, utilising XML-RPC calls, was never hit by spam. I still turned it off, as the safeguards on this system were equivalent to the Trackback system.
So, how do we go on from this quagmire of spam?

Well, for one, if I had time (and resources) to pass all requests through spamassassin or pay for akismet, that would have cut down the number drastically. Also, if I had time to sit and moderate all the linkbacks, again, spam would be nipped in the bud.

So, while I truly believe that this type of system is the future, it certainly isn't the case that it can be a system that can just be turned on and the responsibility for maintaining it added to an already full workload.

Alternatives?

White-listing sites may be one method. To limit the application to sharing references between institutions, you could use the PGP idea - a web of trust; a technique of encrypting the passed information with a private key that resolves to a public key from a white-listed institution. This would ensure that the passed reference really was from a given institution. This should be more flexible than requiring a single IP address to accept references from.

(There is always the chance that the private key could be leaked and made not-so-private by the institution, but that would have to be their responsibility. Any spam from a mistake of this sort would be directly attributed to those at fault!)

A slower, far less accurate but more traditional method, would be for a given institution to harvest references from all the other repositories it knows about. I really don't think this is workable, but has the pro that a harvester can be sure that a reference links to a given URL, (barring the more and more common DNS poisoning attacks)

Monday 14 July 2008

OSCELOT Open Source Day III - views

The event was held at the Nevada gaming institute, and was overall, a well-structured day. The driving ideology was that of the unconference - "... a facilitated, face-to-face, and participant-driven conference centered around a theme or purpose."

However, it seemed that the theme or purpose of the event was not about Open Source - it was as if it were a Blackboard self-help group, trying to solve the issues and failings of this proprietary software. Some of the issues were a little shocking - someone proposed that they had "a need to search the content of [Blackboard Vista] repository" - it came as some surprise to me that this wasn't already possible in such a mature product.

I was pleased that we were able to help and inform the other attendees about more open technology and standards, such as OAuth, resource-orientated architecture, creative commons licensing and more.

One session I lead on was titled - controversially - "Why [bother with] Portals?" - in which I wanted to get a discussion on what students actually use. The point I wanted to make was that URLs are the base currency of the internet - search engines produce lists of them, people bookmark them, and URLs are used when sharing information between people.

This means that there is a very large responsibility on the content providers not to change URLs, or they will devalue the very resources they are trying to get people to use. This is the reason why persistent URLs are a crucial thing to aim for.

I hope that we were able to bring extra value to the meeting, due to the fact that, unlike the vast majority of attendees, we do not have a Blackboard background.

However, I do think that the event needed to have more emphasis on real-world open source projects such as Sakai and Moodle, and examine how best to intergrate their systems with external systems.

Tuesday 8 July 2008

Open replacements for Twitter and more importantly, Tinyurl

I hope that you all already know about http://identi.ca and the software stack laconi.ca that it runs - in short, it's a Twitter-like micro-blogging service, that is geared to be open. It provides the possibility of a distributed micro-blogging set of services that can talk to each other. Pretty cool.

But the less well known release, was that of the lilurl service, a Tinyurl replacement, again, geared to be very open. For example, the database of links the service holds can be downloaded by any user! BUT it lacks an API to create these links on the fly...

I think you've already guessed the end of that statement, I've made an API for it as the base code for the service is open source. Hearty thanks to Evan Prodromou!

So, changes from the source (which is at: http://ur1.ca/ur1-source.tar.gz)

Firstly, change the .htaccess rewrite rules:

From:
  • RewriteRule (.*) index.php
To:
  • RewriteRule s/(.*) index.php


This requires a few cosmetic changes to the index.php to serve correct lil'urls:

Line 41 in index.php:
From:
  • $url = 'http://'.$_SERVER['SERVER_NAME'].'/'.$lilurl->get_id($longurl);
To:
  • $url = 'http://'.$_SERVER['SERVER_NAME'].'/s/'.$lilurl->get_id($longurl);
And then, in the root directory for the app, add in api.php, which is currently pastebinned:

http://pastebin.com/f29465399 - api.php

How it works - Creating lilurls:

POST to /api.php with parameters of longurl=desired url

This will be the response:

HTTP/1.1 201 Created
Date: Tue, 08 Jul 2008 16:51:13 GMT
Server: Apache/2.0.52 (Red Hat)
X-Powered-By: PHP/5.1.4
Content-Length: 36
Connection: close
Content-Type: text/html; charset=utf-8

http://somehost.com/s/1


The message body of the response will contain the URL. Similarly, you can lookup a given lilurl, by GET /api.php?id=lilurl_id

GET /api.php?id=1 HTTP/1.1
Host: somehost.com
content-type: text/plain
accept-encoding: compress, gzip
user-agent: Basic Agent

HTTP/1.1 200 OK
Date: Tue, 08 Jul 2008 16:55:00 GMT
Server: Apache/2.0.52 (Red Hat)
X-Powered-By: PHP/5.1.4
Content-Length: 24
Connection: close
Content-Type: text/html; charset=utf-8

http://ora.ouls.ox.ac.uk


So... er, yeah. Job done ;)

Monday 7 July 2008

Archiving Webpages with ORE

(Idea presented is from the school of "write it down, and then see how silly/workable it is")


Following on from the example by pkeane on the OAI-ORE mailing list, about constructing an Atom 'feed', listing the resources linked to by a webpage. Well, it was more a post wondering what ORE provides that we didn't have before, which for me is the idea of an abstract model with multiple possible serialisations. But anyway, I digress.

(pkeane++ for an actual code example too!)

For me, this could be the start of a very good, incremental method for archiving static/semi-static (wiki) pages.

Archiving:
  1. Create 'feed' of page (either as an Atom feed or an RDF serialisation)
    • It should be clearly asserted in the feed which one of the resources are the (X)HTML resource that is the page being archived.
  2. Walk through the resources, and work out which ones are suitable for archiving
    • Ignore adverts, video perhaps and maybe also some remote resources (but decisions here based on policy and the process is an incremental one.Step 2 can be revisited with new policy decisions, such as remote PDF harvesting and so on.)
  3. For each resource selected for archiving,
    1. Copy it by value to a new, archived location.
    2. Add this new resource to the feed.
    3. Indicate that this new resource is a direct copy of the original in the feed as well (using the new rdf-in-atom syntax, or just plain rdf in the graph.)
Presentation: (Caching-reliant)
  1. A user queries the service to give a representation for an archived page.
  2. Service recovers ORE map for requested page from internal store
  3. Resource determination Again, policy based - suggestions:
    • Last-Known-Good: Replaces all URIs in (X)HTML source
      with their archived duplicates, and sends the page to the user.
      (Assumes dupes are RESTful - archived URIs can be GET'ed)
    • Optimistic: Wraps embedded resources with javascript, to attempt to get the original resources or timeout to get the archived versions.
Presentation: (CPU-reliant)
  1. Service processes ORE map on completion of resource archiving
  2. Resource determination Again, policy based - suggestions:
    • Last-Known-Good: Replaces all URIs in (X)HTML source
      with their archived duplicates, and sends the page to the user.
      (Assumes dupes are RESTful - archived URIs can be GET'ed)
    • Optimistic: Wraps embedded resources with javascript, to attempt to get the original resources or timeout to get the archived versions.
  3. Service stores a new version of the (X)HTML with the URI changes, adds this to the feed, and indicates that this is the archived version.
  4. A user queries the service to give a representation for an archived page and gets it back simply.
So, one presentation method relies on caching, but doesn't need a lot of CPU power to get up and running. The archived pages are also quick to update, and this route may even be a nice way to 'discover' pages for archiving - i.e. the method for archive url submission is the same as the request. Archiving can continue in the background, while the users get a progressively more archived view of the resource.

The upshot of having a dynamic URI swapping on page request is that there can be multiple copies in potentially mobile locations for each resource, and the service can 'round-robin' or pick the best copies to serve as replacement URIs. This is obviously a lot more difficult to implement with static 'archived' (X)HTML, and would involve URI lookup tables embedded into the DNS or resource server.

Wednesday 11 June 2008

Internet Explorer can go to hell

Download a PDF file from the archive through a browser


[Just to get the headers here, I used: "curl -I http://ora.ouls.ox.ac.uk/objects/uuid%3A4af22069-ec0e-4407-b42d-2926c5a6c9ca/datastreams/ATTACHMENT01"]

Server Response:
HTTP/1.1 200 OK
content-length: 53760
content-disposition: attachment; filename="uuid4af22069-ec0e-4407-b42d-2926c5a6c9ca-ATTACHMENT01.doc"
accept-ranges: bytes
last-modified: Wed, 11 Jun 2008 13:54:45 GMT
content-range: 0-53759/53760
etag: 1213192485.0-53760
content-type: application/msword
x-pingback: http://ora.ouls.ox.ac.uk/pingback
Date: Wed, 11 Jun 2008 13:54:45 GMT
Server: CherryPy/3.0.2
  • Firefox (any version, any OS) downloads this and passes it to an external app fine, job done.
  • Safari, same deal.
  • Camino, ditto.
  • Opera, not a problem.
  • Even Internet Explorer 7, downloads it and opens it fine.
But:

Internet Explorer 6 -> Adobe Acrobat says "Error, file not found"

Internet Explorer 5 -> Adobe still says "Error, file not found"

I'll post up the fix to IE5/6 not being able to download the file properly. It's all in the response headers, and I'll let you play spot the difference:

Server Response (works with IE5/6):
HTTP/1.1 200 OK
content-length: 53760
content-disposition: attachment; filename="uuid4af22069-ec0e-4407-b42d-2926c5a6c9ca-ATTACHMENT01.doc"
accept-ranges: bytes
last-modified: Wed, 11 Jun 2008 13:54:45 GMT
content-range: 0-53759/53760
etag: 1213192485.0-53760
pragma:
content-cache:
content-type: application/msword
x-pingback: http://ora.ouls.ox.ac.uk/pingback
Date: Wed, 11 Jun 2008 13:54:45 GMT
Server: CherryPy/3.0.2

Found it? No?

etag: 1213192485.0-53760
pragma:
content-cache:
content-type: application/msword

That's right, adding empty fields into the response headers magically fixes the download issue for IE6 and 5. One day, I'll find an IE engineer and if I do, when I do... well... my bail money will be with my lawyer. in that case.

Tuesday 3 June 2008

Google App Engine SDK - How to work out if you are running deployed or locally

Short post this one:

os.environ['SERVER_SOFTWARE'] is your variable to see.
  • Deployed, it reads 'Google Apphosting/1.0'
  • Running locally, it reads 'Development/1.0'
So doing "if os.environ['SERVER_SOFTWARE'].startswith('Development'):" should be enough to deal with the differences between local and deployed.

Release 0.2 of the Python REST Client

The update is due to improvements made to the Google App Engine flavour of the restful lib code - I've added automatic Basic and Digest authentication to the code, so that apps deployed to the App engine can now use services that require either form of authentication, such as Twitter (Basic auth needed) or the Talis Platform (Digest auth only).


http://code.google.com/p/python-rest-client/

OAI-ORE reaches beta

Beta release of specs - http://www.openarchives.org/ore/0.9/

At a cursory glance through the documents, there seems to be a few refinements on version 0.3:
  1. Resource map metadata is now (IMO) handled better
  2. The inclusion of resource typing into the aggregation data model
    - Yeah, wonder where they got that idea ;) But seriously, it suggests to type the objects at a conceptual level. I would have liked to see inclusion of more physical typing alongside this, such as metadata standard adherence (dcterms:conformsTo) and a 'mimetype' of some sort (dc:format but this is open to debate).
  3. Proxies are more fleshed out (I still think that this is a solution looking for a problem though)
So, aside from the resource map metadata changes, I don't think this will cause me undue work to update to. The alpha to beta spec transition was just refinement and not an overhaul.

Thursday 29 May 2008

A method for flexibly using external services

aka "How I picture a simple REST mechanism for queuing tasks with external services, such as transformation of PDFs to images, Documents to text, or scanning, file format identification."
Example: Document to .txt service utilisation


Step 1: send the URL for the document to the service (in this example, the request is automatically accepted - code 201 indicates that a new resource has been created)

{(( server h - 'in the cloud/resolvable' - /x.doc exists ))}

u | ----------------- POST /jobs (msg 'http://h/x.doc') ----------------> | Service (s)
| <---------------- HTTP resp code 201 (msg 'http://s/jobs/1') ---------- |

Step 2: Check the returned resource to find out if the job has completed (an error code 4XX would be suitable if there has been a problem such as unavailability of the doc resource)


u | ----------------- GET /jobs/1 (header: "Accept: text/rdf+n3") --------> | s


If job is in progress:

u | <---------------- HTTP resp code 202 ---------------------------------- | s

If job is complete (and accept format is supported):

u | <---------------- HTTP resp code 303 (location: /jobs/1.rdf ----------- | s

u | ----------------- GET /jobs/1.rdf --------------> | s
| <---------------- HTTP 200 msg below ------------ |


@PREFIX s: <http://s/jobs/>.
@PREFIX store: <http://s/store/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix ore: <http://www.openarchives.org/ore/terms/>.
@prefix dcterms: <http://purl.org/dc/terms/>.

s:1
ore:isDescribedBy
s:1.rdf;
dc:creator
"Antiword service - http://s";
dcterms:created
"2008-05-29T20:33:33.009152";
ore:aggregates
store:1.txt.

store:1.txt
dc:format
"text/plain";
dcterms:created
"2008-05-29T12:20:33.009152";
XXXXXXX:deleted
"2009-05-29T00:00:00.000";
dc:title
"My Document"

<http://h/x.doc>
dcterms:hasFormat
store:1.txt

---------------------------------------------------------


Then, the user can get the aggregate parts as required, noting the TTL (the deleted date predicate, for which I need to find a good real choice for)

Also, as this is a transformation, the service has indicated this with the final triple, asserting that the created resource is a rendition of the original resource, but in a different format.

A report based on the item, such as something that would be output from JHOVE, Droid or a virus-scanning service, can be shown as an aggregate resource in the same way, or if the report can be rendered using RDF, can be included in the aggregation itself.

It should be straightforward to see that this response gives the opportunity for services to return zero or more files and for that reply to be self-describing. The re-use of the basic structure of the OAI-ORE profile, means that the work going into the Atom format rendition can be repicated here, so an Atom report format could also work.

General service description:

All requests have {[?pass=XXXXXXXXXXXXXXXXX]} as an optional. Service has the choice whether to support it or not.

Request:
GET /jobs
Response
Content-negotiation applies, but default response is Atom format
List of job URIs that the user can see (without a pass, the list is just the anonymous ones if the service allows it)

Request:
POST /jobs
Body: "Resource URL"

Response:
HTTP Code 201 - Job accepted - Resp body == URI of job resource
HTTP Code 403 - Fobidden, due to bad credentials
HTTP Code 402 - Request is possible, but requires payment
- resp body => $ details and how to credit the account

Request:
DELETE /jobs/job-id
Response:
HTTP Code 200 - Job is removed from the queue as will any created resources
HTTP Code 403 - Bad credentials/Not allowed

Request:
GET /jobs/job-id
Header (optional): "Accept: application/rdf+xml" to get rdf/xml rather than the default atom, should the service support it
Response:
HTTP Code 406 - Service cannot make a response to comply with the Accept header formats
HTTP Code 202 - Job is in process - msg MAY include an expected time for completion
HTTP Code 303 - Job complete, redirect to formatted version of the report (typically /jobs/job-id.atom/rdf/etc)

Request:
GET /jobs/job-id.atom
Response:
HTTP Code 200 - Job is completed, and the msg header is the ORE map in Atom format
HTTP Code 404 - Job is not complete

Authorisation and economics


The authorisation for use of the service is a separate consideration, but ultimately it is dependent on the service implementation - if anonymous access is allowed, rate-limits, authorisation through invitation only, etc.

I would suggest the use of SSL for those service that do use it, but not HTTP Digest per se. HTTP Basic through an SSL connection should be good enough; the Digest standard is not pleasant to try to implement and get working (standard is a little ropey).

Due to the possibility of a code 402 (payment required) on a job request, it is possible to start to add in some economic valuations. It is required that the server holding the resource can respond to a HEAD request sensibly and report information such as file-size and format.

A particular passcode can be credited to allow it to make use of a service, the use of which debits the account as required. When an automated system hits upon a 402 (payment required) rather than a plain 403 (Forbidden), this could trigger mechanisms to get more credit, rather than a simple fail.

Links:
OAI-ORE spec - http://www.openarchives.org/ore/0.3/toc

HTTP status codes - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html