Monday 18 August 2008

Cherry picking the Semantic Web (from Talis's Nodalities magazine)

Just to say that in the Talis Nodalities magazine, Issue 3 [PDF] page 13 they have published an article of mine about how treating everything - author, department, funder, etc - as an first class object will have knock-on benefits to curation and cataloguing of archived items.

When I find a good, final version of the article that I haven't accidentally deleted, I'll post the text of it here ;) Until then, download the PDF version. Read all the articles actually, they are all good!

(NB The magazine itself is licensed under the CC-by-sa, which I think is excellent!)

DSpace and Fedora *need* opinionated installers.

Just to say that both Fedora-Commons and DSpace really, really need opinionated installers that make choices for the user. Getting either installed is a real struggle - which we demonstrated during the Crigshow, so please don't write in the comments that it is easy, it just isn't.

Something that is relatively straightforward to install, is a debian package.

So, just a plea in the dark, can we set up a race? Who can make their repository software installable as a .deb first? will it be DSpace or Fedora? Who am I going to send a box of cookies to and a thank you note from the entire developer community?

(EPrints doesnt count in this race; they've already done it)

Re-using video compression code to aid document quality checking

(Expanding on this video post from the Crigshow)

Problem:


The volume of pages from a large digitisation project can be overwhelming. Add into that the simple fact that all (UK) institutional projects are woefully underfunded and underresourced, it's surprising that we can cope with them really.

One issue that repeatedly comes up is the idea of quality assurance; How can we know that a given book has been scanned well? How can we spot images easily? Can we detect if foreign bodies were present in the scan, such as thumbs, fingers or bookmarks?

A quick solution:

Inspired by a talk at one of the conference strands at WorldComp, where the author talked about the use of a component of a commonly used video compression standard (MPEG2) to detect degrees of motion and change in a video, without having to analyse the image sequences using a novel, or smart algorithm.

He talked about using the motion vector stream to be a good rough guide to the amount of change between frames of video.

So, why did this inspire me?
  • MPEG-2 compression is a pretty much a solved problem; there are some very fast and scalable solutions out there today - direct benefit: No new code needs to be written and maintained
  • The format is very well understood and stripping out the motion vector stream wouldn't be tricky. Code exists for this too.
  • Pages of text in printed documents tend towards being justified so that the two edges of the text columns are straight lines. There is also (typically) a fixed number of lines on a page.
  • A (comparatively rapid) MPEG2 compression of the scans of a book would have the following qualities:
    • The motion vectors between pages of text would either shown little overall change (as differing letters are actually quite similar) or a small, global shift if the page was printed on a slight offset.
    • The motion vectors between a page of text and a page with an image embedded in text on the next, or a thumb on the edge, would show localised and distinct changes that differ greatly from the overall perspective.
  • In fact, a real crude solution could be, just using the vector stream to create a bookmark list for all the suspect changes. This might bring the number of pages to check down to a level that a human mediator could handle.
How much needs to be checked?

Via basic sample survey statistics: to be sure to 95% (±5%) that the scanned images of 300 million pages are okay, just 387 totally random pages need to be checked. However, to be sure that each individual book is okay to the same degree, a book being ~300 pages, 169 pages need to be checked in each book. I would suggest that the above technique would significantly lower this threshold, but it would be by an empirically found amount.

Also note that the above figures carry the assumption that the scanning process doesn't change over time, which of course it does!

The four rules of the web and compound documents

A real quirk that truly interests me is the difference in aims between the way documents are typically published and the way that the information within them is reused.

A published document is normally in a single 'format' - a paginated layout, and this may comprise text, numerical charts, diagrams, tables of data and so on.

My assumption is that, to support a given view or argument, a reference to the entirety of an article is not necessary; The full paper gives the context to the information, but it is much more likely that a small part of this paper contains the novel insight being referenced.

In the paper-based method, it is difficult to uniquely identify parts of an article as items in their own right. You could reference a page number, give line numbers, or quote a table number, but this doesn't solve this issue that the author hadn't put time to considering that a chart, a table or a section of text would be reused.

So, on the web, where multiple representations of the same information is getting to be commonplace (mashups, rss, microblogs, etc), what can we do to help better fulfill both aims, to show a paginated final version of a document, and also to allow each of the components to exist as items in their own right, with their own URIs (or better, URLs containing some notion of the context e.g. if /store/article-id gets to the splash page of the article, /store/article-id/paragraph-id will resolve to the text for that paragraph in the article.)

Note that the four rules of the web (well, of Linked Data) are in essence:
  • give everything a name,
  • make that name a URL ...
  • ...which results in data about that thing,
  • and have it link to other related things.
[From TimBL's originating article. Also, see this presentation - a remix of presentations from TimBL and the speaker, Kingsley Idehen - given at the recent Linked Data Planet conference]

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author's know that we may present alternate versions, based on a user's demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

The system holding the articles needs to be able to clearly indicate versions and show multiple versions for a single record.

When a compound document is submitted to the archive, a second parallel version should be made by fragmenting the document into paragraphs of text, individual diagrams, tables of data, and other natural elements. One issue that has already come up in testing, is that documents tend to clump multiple, separate diagrams together into a single physical image. It is likely that the only solution to breaking these up to this is going to be a human one, either author/publisher education(unlikely) or by breaking them up by hand.

I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly.

Summary:

1) Break apart any compound document into its constituent elements (paragraph level is suggested for text)
2) Make sure that each one of these parts are clearly expressed in the context they are in, using hierarchical URLs, /article/paragraph or even better, /article/page/chart
3) On the article's splashpage, make a clear distinction between the real article and the broken up version. I would suggest a scheme like Google search's 'View [PDF, PPT, etc] as HTML'. I would assert that many people intuitively understand that this view is not like the original and will look or act differently.

Some related video blogs from the Crigshow trip
Finding and reusing algorithms from published articles
OCR'ing documents; Real documents are always complex
Providing a systematic overview of how a Research paper is written - giving each component and each version of a component would have major benefits here

Trackbacks, and spammers, and DDoS, oh my!

The Idea

Before I give you all the dark news about this, let me set out my position: I really, really think that repositories communicating the papers that are cited and referenced to each other is a really good thing. If a paper was deposited in the Oxford archive, and it referenced a paper held in a different repository, say in Southampton's EPrints archive, I think that it is a really fantastic idea to let the Oxford archive tell the Southampton one about it.

And I decided to do something about it - I added two linkback facilities to the archive's user interface, allowing both trackbacks and pingbacks to be archived by the system. I adopted the pre-existing "standards" - really, they are just rough api's - because I think we have all learned our lessons about making up new APIs for basic tasks.

What is Trackback?

Trackback is an agreed technique from the blogging world. Many blogging systems have it built in, and it enables one blog post to explicitly reference and talk about another post, made on a remote blog somewhere. It does this by POSTing a number of form-encoded parameters to a specific URL, specific to the item that is being referenced. The parameters include things like title, abstract and URL of the item making the reference.

So on the surface, it appears that this trackback idea performs exactly what I was looking for.

BUT! Trackback has massive, gaping flaws, akin to the flaws in the email system which is full of spam. For one, all trackbacks are trusted in the basic specifications. No checking that the URL exists, no checking of the text for relevance, etc.

Pingback is a slightly different system, in that all that is passed, is the URL of the referencing item. It is then up to the remote server to go and get the requested page and parse that out to find the reference. (The next version of the specification is crying out to recommend microformats et al, in my opinion)

So, these systems, trackback and pingback, have been on trial in the live system for about 4 or 5 months, and I am sure you all want to hear my conclusions:

  • Don't implement Trackback as it is defined in its specifications... seriously. It is a poorly designed method, with so much slack that it is a spammers goldmine.
  • Even after adding in some safeguards to the Trackback method, such as parsing the supposed referencing page and checking for HTML and the presence of the supposed link, it was still possible for spammers to get through.
  • When I implemented Trackbacks, I did so with the full knowledge that I might have to stand at a safe distance and nuke the lot. Here is the Trackback model used in the Fedora repository - A DC datastream containing the POSTed information mapped to simple dublin core and a RELS-EXT RDF entry asserting that this Fedora Object <info:fedora/trackback-id>, referenced <dcterms:references> the main item in the archive <info:fedora/item-id>. As the user interface for the archive gets the graph for that object, it was easy to get the trackbacks out as well. Having separate objects for the trackbacks and not changing the referenced item at all, made it very easy to remove the trackbacks at the end.
  • The Trackback system did get hit, once the spammers found a way around my safeguards. So, yes, the trackbacks got 'nuked' and the system turned off.
  • Currently, the system is under a sort of mini-DDoS, from the spammer's botnet trying to make trackbacks and overloading the session tracking system.
  • The Pingback system, utilising XML-RPC calls, was never hit by spam. I still turned it off, as the safeguards on this system were equivalent to the Trackback system.
So, how do we go on from this quagmire of spam?

Well, for one, if I had time (and resources) to pass all requests through spamassassin or pay for akismet, that would have cut down the number drastically. Also, if I had time to sit and moderate all the linkbacks, again, spam would be nipped in the bud.

So, while I truly believe that this type of system is the future, it certainly isn't the case that it can be a system that can just be turned on and the responsibility for maintaining it added to an already full workload.

Alternatives?

White-listing sites may be one method. To limit the application to sharing references between institutions, you could use the PGP idea - a web of trust; a technique of encrypting the passed information with a private key that resolves to a public key from a white-listed institution. This would ensure that the passed reference really was from a given institution. This should be more flexible than requiring a single IP address to accept references from.

(There is always the chance that the private key could be leaked and made not-so-private by the institution, but that would have to be their responsibility. Any spam from a mistake of this sort would be directly attributed to those at fault!)

A slower, far less accurate but more traditional method, would be for a given institution to harvest references from all the other repositories it knows about. I really don't think this is workable, but has the pro that a harvester can be sure that a reference links to a given URL, (barring the more and more common DNS poisoning attacks)