Thursday 15 October 2009

Python in a Pairtree

(Thanks to @anarchivist for the title - I'll let him take all the 'credit')

"Pairtree? huh, what's that?" - in a nutshell it's 'just enough veneer on top of a conventional filesystem' for it to be able to store objects sensibly; a way of storing objects by id on a normal hierarchical filesystem in a pragmatic fashion. You could just have one directory that holds all the objects, but this would unbalance the filesystem and due to how most are implemented, would result in a less-than-efficient store. Filesystems just don't deal well with thousands or hundreds of thousands of directories in the same level.

Pairtree provides enough convention and fanning out of hierarchical directories to both spread the load of storing high numbers of objects, while retaining the ability to treat each object distinctly.

The Pairtree specification is a compromise between fanning out too much and too little and assumes that the ids used are opaque; that the ids have no meaning and are to all intents and purposes 'random'. If your ids are not, for example, they are human-readable words, then you will have to tweak how the ids are split into directories to ensure better performance.

[I'll copy&paste some examples from the specifications to illustrate what it does]

For example, to store objects that have identifiers like the following URI - http://n2t.info/ark:/13030/xt2{some string}

eg:

http://n2t.info/ark:/13030/xt2aacd
http://n2t.info/ark:/13030/xt2aaab
http://n2t.info/ark:/13030/xt2aaac

This works out to look like this on the filesystem:
current_directory/
| pairtree_version0_1 [which version of pairtree]
| ( This directory conforms to Pairtree Version 0.1. Updated spec: )
| ( http://www.cdlib.org/inside/diglib/pairtree/pairtreespec.html )
|
| pairtree_prefix
| ( http://n2t.info/ark:/13030/xt2 )
|
\--- pairtree_root/
|--- aa/
| |--- cd/
| | |--- foo/
| | | | README.txt
| | | | thumbnail.gif
| | ...
| |--- ab/ ...
| |--- af/ ...
| |--- ag/ ...
| ...
|--- ab/ ...
...
\--- zz/ ...
| ...


With the object http://n2t.info/ark:/13030/xt2aacd containing a directory 'foo', which itself contains a README and a thumbnail gif.

Creating this structure by hand is tedious, and luckily for you, you don't have to (if you use python that is)

To get the pairtree library that I've written, you can either install it from the Pypi site http://pypi.python.org/pypi/Pairtree or if python-setuptools/easy_install is on your system, you can just sudo easy_install pairtree

You can find API documentation and a quick start here.

The quick start should get you up and running in no time at all, but let's look at how we might store Fedora-like objects on disk using pairtree. (I don't mean how to replicate how Fedora stores objects on disk, I mean how to make an object store that gives us the basic framework of 'objects are bags of stuff')


>>> from pairtree import *
>>> f = PairtreeStorageFactory()
>>> fedora = f.get_store(store_dir="objects", uri_base="info:fedora/")


Right, that's the basic framework done, let's add some content:


>>> obj = fedora.create_object('changeme:1')
>>> with open('somefileofdublincore.xml', 'r') as dc:
... obj.add_bytestream('DC', dc)
>>> with open('somearticle.pdf', 'r') as pdf:
... obj.add_bytestream('PDF', pdf)
>>> obj.add_bytestream('RELS-EXT', """<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rel="info:fedora/fedora-system:def/relations-external#">
<rdf:Description rdf:about="info:fedora/changeme:1">
<rel:isMemberOf rdf:resource="info:fedora/type:article"/>
</rdf:Description>
</rdf:RDF>""")


The add_bytestream method is adaptive - if you pass it something that supports a read() method, it will attempt to stream out the content in chunks to avoid reading the whole item into memory at once. If not, it will just write the content out as is.

I hope this gives people some idea on what can be possible with a conventional filesystem, after all, filesystem code is pretty well tested in the majority of cases so why not make use of it.

(NB the with python command is a nice way of dealing with file-like objects, made part of the core in python ~2.6 I think. It tries to make sure that the file is closed at the end of the block, equivalent to an "temp = open(foo) - do stuff - temp.close()")

Friday 19 June 2009

What is a book if you can print one in 5 minutes?

There exists technology now, available in bookshops and certain forward-thinking libraries, to print a book in 5 minutes from pressing Go, to getting the book into your hands.

This excites me a lot. Yes, that does imply I am a geek, but whatever.

So, what would I want to do with one? well, printing books that already exist is fun but not grasping the potential. If you can print a book in 5 minutes, for how long must that book have been in existence before you press print? Why can't we start talking about repurposing corrently licenced or public domain content?

Well, what I need (and am keen to get going with) is the following:

1) PDF generator -> pass it an RSS feed of items and it will do it's best to generate page content from these.
- blogs/etc: grab the RSS/Atom feed and parse out the useful content
- Include option to use blog comments or to gather comments/backlinks/tweets from the internet
- PDFs - simply concatenate the PDF as is into the final PDF
- Books/other digital items with ORE -> interleave these
- offer similar comment/backlink option as above
- ie the book can be added 'normally' with the internet-derived comments on the facing page to the book/excerpt they actually refer to, or the discussion can be mirrored with the comments in order and threaded, with the excerpts from the pages being attached to these. Or why not both?

- Automated indexes of URLs, dates and commenters can be generated without too much trouble on demand.
- Full-text indexes will be more demanding to generate, but I am sure that a little money and a crowd-sourced solution can be found.

2) Ability to (onsite) print these PDFs into a single, (highly sexy) bound volume using a machine such as can be found in many Blackwell's bookshops today.

3) A little capital to run competitions, targeting various levels in the university, asking the simple question "If you could print anything you want as a bound book in 5 minutes, what's the most interesting thing you can think of to print out?"

Why?
People like books. They work, they don't need batteries and people who can read can intuitively 'work' a book. But books are not very dynamic. You have to have editors, drafters, publishers, and so on and so forth, and the germination of a book has to be measured in years... right?

Print on demand smashes that and breaks down conceptions of what a book is. Is it a sacred tome that needs to be safeguarded and lent only to the most worthy? Or is is a snapshot of an ongoing teaching/research process? Or can it simply be a way to print out a notebook with page numbers as you would like them? Can a book be an alive and young collation of works, useful now, but maybe not as critical in a few years?

Giving people the ability to make and generate their own books offers more potential - what books are they creating? Which generated books garner the most reuse, comments and excitement? Would the comments about the generated works be worth studying and printing in due course? Will people break through the pen-barrier, that taboo of taking pen to a page? Or will we just see people printing wikitravel guides and their flickr account?

Use-cases to give a taste of the possibilities:
- Print and share a discussion about an author, with excepts ordered and surrounded by the chronologically ordered and threaded comments made by a research group, a teaching group or even just a book club.
- Library 'cafe' - library can subsidise the printing of existing works for use in the cafe, as long as the books stay in the cafe. Spillages, crumbs are not an issue to these facsimile books.
- Ability to record and store your terms/years/etc worth of notes in a single volume for posterity. At £5 a go, many students will want this.
- Test print a Thesis/Dissertation, without the expense of consulting a book binder.
- Archive in paper a snapshot of a digital labbook implemented on drupal or wordpress.
- Lecturer's notes from a given term, to avoid the looseleaf A4 overload spillage that often occurs.
- Printing of personalised or domain specific notebooks. (ie. a PDF with purposed fields, named columns and uniquely identified pages for recording data in the field - who says a printed book has to be full of info?)
- Maths sheets/tests/etc
- Past Papers

I am humbled by the work done by Russell Davies, Ben Terrett and friends in this area and I can pinpoint the time at which I started to think more about these things to BookCamp sponsored by Penguin UK and run by Jeremy Ettinghausen (blog)

Please, please see:

http://tinyurl.com/9qfoyt - Things Our Friends Have Written On The Internet 2008

Russell Davies UnNotebook: http://russelldavies.typepad.com/planning/2009/02/unnotebook.html
(http://tinyurl.com/cpdllw)

Friday 15 May 2009

RDF + UI + Fedora for object metadata (RDF) editing

Just a walkthrough of something I am trying to implement at the moment:

Requirements:

For the Web UI:

Using jQuery and 3 plugins: jEditable, autocomplete and rdfquery.

Needed middleware controls from the Web App:
  1. create new session (specifically, a delta of the RDF expressed in iand's ChangeSet schema http://vocab.org/changeset/schema ) POST /{object-id}/{RDF}/session/new -> HTTP 201 - session url (includes object id root)
  2. POST triples to /{session-url}/update to add to the 'add' and/or 'delete' portions
  3. A POST to /{session-url}/commit or just DELETE /{session-url}
And all objects typed by rdf:type (multitypes allowed)

Workflow:

  1. Template grabs RDF info from object, and then displays it in the typical manner (substituting labels for uris when relevant), but also encodes the values with RDFa.
  2. If the user is auth'd to edit, each of these values has a css class added so that the inline editing for jeditable can act on it.
  3. It then reads for the given type of object the cardinality of the fields present (eg from an OWL markup for the class) and also the other predicates that can be applied to this object. For multivalued predicates, an 'add another' type link is appended below. For unused predicates, its up to the template to suggest these - currently, all the objects in the repo can have type specific templates, but for this example, I am considering generics.
  4. For predicates which have usefully typed ranges, ie foaf:knows in our system points to a URI, rather than a string - autocomplete is used to hook into our or maybe anothers index of known labels for uris to suggest correct values. For example, if an author was going to indicate their affiliation to a department here at oxford (BRII project) it would be handy if a correct list of department labels was used. A choice from the list would view as the label, but represent the URI in the page.
  5. When the user clicks on it to change the value, a session is created if none exists stamped with the start time of the edit and the last modified date of the RDF datastream, along with details of the editor, etc.
  6. rdfquery is used to pull the triple from the RDFa in the edited field. When the user submits a change, the rdfa triple is posted to the session url as a 'delete' triple and the new one is encoded as an 'add' triple.
  7. A simple addition would just post to the session with no 'delete' parameter.
  8. The UI should then reflect that the session is live and should be committed when the user is happy with the changes.
  • On commit, the session would save the changeset to the object being edited, and update the RDF file in question. (so we keep rdfquery would then update the RDFa in the page to the new values, upon a 200/204 reply.
  • On cancel, the values would be restored, and the session deleted.
Commit Notes:
If the lastmodified date on the datastream is different from the one marked on the session (ie possible conflict), the page information is updated to the most recent and the session is reapplied in the browser, highlighting the conflicts, and a warning given to the user.

I am thinking of increasing the feedback using a messaging system, while keeping the same optimistic edit model - you can see the status of an item, and that someone else has a session open on it. The degree to the feedback is something I am still thinking about - should the UI highlight or even reflect the values that the other user(s) is editing in realtime? is that useful?

Monday 30 March 2009

Early evaluation and serialisation of preservation policy decisions.

(Apologies, as this has been a draft when I thought it published. I have updated it to reflect changes that have been made since we started doing this.)

It may be policy to make sure that the archive's materials are free of computer malware - one part of enacting this policy is running anti-virus and anti-spyware scans of the content. However, malware may be stored in the archive a number of months before it is widely recognised as such. So, the enactment of the policy of 'no malware' would mean that content is scanned on ingest, in 3 to 6 months after ingest and one final scan a year later.

Given that it is possible to monitor when changes occur to the preservation archive, it is not necessary to run continual sweeps of the content held in the archive to assess whether a preservation action is needed or not. Most actions that result from a preservation policy choice can be pre-assigned to an item when it iundergoes a change of state (creation, modification, deletion, extension)

These decisions for actions (and also the auditable record of them occurring) are recorded inside the object itself and in a bid to reuse a standard rather than reinvent, this serialisation uses the iCal standard. iCal already has a proven capability to mark and schedule events and even handle reoccurring events, and to attach arbitrary information to these individual events.

For the archive to self-describe and be preservable for the longer term, it is necessary for the actions taken to be archivable in some way too. A human-readable description of the action, alongside a best-effort attempt to describe this in machine-readable terms should be archived and referenced by any event that is an instance of that action. ('best-effort' due to the underwhelming nature of the current semantics and schemas for describe these preservation processes)

In the Oxford system, an iCal calendar implementation called Darwin Calendar server was initially used to provide a queriable index of the preservation actions, along with a report of what events needed to be queued to be performed on a given day. These actions are queued in the short-term job queues (technically, being held in persistent AMQP queues) for later processing. However, the various iCal server implementations were not lightweight nor stable enough to be easily reused so from this point on, simple indexes are created as needed and retained from the serialised iCal to be used in its stead.

Preservation actions such as scanning (virii, file-format, etc) are not the only systems to benefit from monitoring the state of an item. Text- and data-mining and analysis, indexing for search indices, dissemination copy production, and so on are all actions that can be driven and kept in line with the content in this way. For example, it is likely that the indices will be altered or benefit from refreshing on a periodic basis and the event of last-indexing can be included in this iCal file as a VJOURNAL event.

NB no effort has been made to intertwine the iCal-serialised information with the packaging standard used, as this is heavily expected to both take considerable time and effort, and also severely limit our ability to reuse or migrate content from this packaging standard to a later, newer format. It is being stored as a separate Fedora datastream within the same object it describes, and is registered to be an iCal file containing preservation event information using information stored in the internal RDF manifest.

Thursday 19 March 2009

We need people!

(UPDATE - Grrr.... seems that the concept of persistent URLs is lost on the admin - link below has been removed - see google cached copy here)

http://www.admin.ox.ac.uk/ps/oao/ar/ar3979j.shtml - job description.

Essentially, we need smart people who are willing to join us to do good, innovative stuff; work that isn't by-the-numbers with room for initiative and ideas.

Help us turn our digital repository into a digital library, it'll be fun! Well, maybe not fun, but it will be very interesting at least!

bulletpoints: python/ruby frameworks, REST, a little SemWeb, ajax, jQuery, AMQP, Atom, JSON, RDF+RDFa, Apache WSGI deployment, VMs, linux, NFS, storage, RAID, etc.

Wednesday 25 February 2009

Developer Happiness days - why happyness is important

Creativity and innovation

One of the defining qualities of a good innovative developer is creativity and a pragmatic attitude; someone with the 'rough consensus, running code' mentality that pervades good software innovation. This can be seen as the drive to experiment, to turn inspiration and ideas into real, running code or to pathfind by trying out different things. Innovation can often happen when talking about quite separate, seemingly unrelated things, even to the point that most of the time, the 'outcomes' of an interaction are impossible to pin down.

Play, vagueness and communication

Creativity, inspiration, innovation, ideas, fun, and curiousity are all useful and important when developing software. These words convey concepts that do not thrive in situations that are purely scheduled, didactic, and teacher-pupil focussed. There needs to be an amount of 'play' in the system (see 'Play'.) While this 'play' is bad in a tightly regimented system, it is an essential part in a creative system, to allow for new things to develop, new ideas to happen and for 'random' interactions to take place.

Alongside this notion of play in an event, there also needs to be an amount of blank space, a vagueness to the event. I think that we can agree that much of the usefulness of normal conferences comes from the 'coffee breaks' and 'lunch breaks', which are blank spaces of a sort. It is the recognition of this that is important and to factor it in more.

Note that if a single developer could guess at how things should best be developed in the academic space, they would have done so by now. Pre-compartmentalisation of ideas into 'tracks' can kill potential innovation stone-dead. The distinction between CMSs, repositories and VLE developers is purely semantic and it is detrimental for people involved in one space to not overhear the developments, needs, ideas and issues in another. It is especially counter-productive to further segregate by community, such as having simultaneous Fedora, DSpace and EPrints strands at an event.

While the inherent and intended vagueness provides the potential for cross-fertilisation of ideas, and the room for play provides the space, the final ingredient is that of speech, or any communication that takes place with the same ease and at the same speed of speech. While some may find the 140 character limit on twitter or identi.ca a strange constraint, this provides a target for people to really think about what they wish to convey and keeps the dialogue from becoming a series of monologues - much like the majority of emails of mailing lists - and keeps it as a dialogue between people.

Communication and Developers

One of the dichotomies in the necessity of communication to development is that developers can be shy, initially preferring the false anonymity of textual communication to spoken words between real people. There is a need to provide means for people to break the ice, and to strike up conversations with people that they can recognise as being of like minds. Asking that people's public online avatars are changed to be pictures of them can help people at an event find those that they have been talking to online and to start talking, face to face.

On a personal note, one of the most difficult things I have to do when meeting people out in real life is answer the question 'What do you do?' - it is much easier when I already know that the person asking the question has a technical background.

And again, going back to the concept of compartmentalisation - developers who only deal with developers and their managers/peers will build systems that work best for their peers and their managers. If these people are not the only users then they need to widen their communications. It is important for the developers that do not use their own systems to engage with the people who actually do. They should do this directly, without the potential for garbled dialogue via layers of protocol. This part needs managing in whatever space, both to avoid dominance by loud, disgruntled users and to mitigate anti-social behaviour. By and large, I am optimistic of this process, people tend to want to be thanked, and this simple feedback loop can be used to help motivate. Making this feedback more disproportionate (a small 'thank you' can lead to great effects) and adding in the notion of highscore can lead to all sorts of interaction and outcomes, most notably being the rapid reinforcement of any behaviour that led to a positive outcome.

Disproportionate feedback loops and Highscores drive human behaviour

I'll just digress quickly to cover what I mean be a disproportionate feedback loop: A disproportionate feedback loop is something that encourages a certain behaviour; the input to which is something small and inexpensive, in either time or effort but the output can be large and very rewarding. This pattern can be seen in very many interactions: playing the lottery, [good] video game controls, twitter and facebook, musical instruments, the 'who wants to be a millionaire' format, mashups, posting to a blog ('free' comments, auto rss updating, a google-able webpage for each post) etc.

The natural drive for highscores is also worth pointing out. At first glance, is it as simple as considering its use in videogames? How about the concept of getting your '5 fruit and veg a day'? http://www.5aday.nhs.uk/topTips/default.html Running in a marathon against other people? Inbox Zero (http://www.slideshare.net/merlinmann/inbox-zero-actionbased-email), Learning to play different musical scores? Your work being rated highly online? An innovation of yours being commented on by 5 different people in quick succession? Highscores can be very good drivers for human behaviour, addictive to some personalities.

Why not set up some software highscores? For example, in the world of repositories, how about 'Fastest UI for self-submission' - encouraging automatic metadata/datamining, a monthly prize for 'Most issue tickets handled' - to the satisfaction of those posting the tickets, and so on.

It is very easy to over-metricise this - some will purposefully abstain from this and some metrics are truely misleading. In the 90s, there was a push to have lines of code added as a metric to productivity. The false assumption is that lines of code have anything to do with producitivity - code should be lean, but not too lean to maintain.

So be very careful when adding means to record highscores - they should be flexible, and be fun - if they are no fun for the developers and/or the users, they become a pointless metric, more of an obstacle than a motivation.

The Dev8D event

People were free to roam and interact at the Dev8D event and there was no enforced schedule, but twitter and a loudhailer were used to make people aware of things that were going on. Talks and discussions were lined up prior to the event of course, but the event was organised on a wiki which all were free to edit. As experience has told us, the important and sometimes inspired ideas occur in relaxed and informal surroundings where people just talk and share information, such as in a typical social situation like having food and drink.

As a specific example, look at the role of twitter at the event. Sam Easterby-Smith (http://twitter.com/samscam) created a means to track 'developer happiness' and shared the tracking 'Happyness-o-meter' site with us all. This unplanned development inspired me to relay the infomation back to twitter and similarly led to me running an operating system/hardware survey in a very similar fashion.

To help break the ice and to encourage play, we instituted a number of ideas:

A wordcloud on each attendees badge, consisting of whatever we could find of their work online, be it their blog or similar so that it might provide a talking point, or allow people to spot people who write about things they might be interested in learning more about.

The poker chip game - each attendee was given 5 poker chips at the start of the event, and it was encouraged that chips were to be traded for help, advice or as a way to convey a thank you. The goal was that the top 4 people ranked by amounts of chips at the end of the third day would receive a Dell mini 9 computer. The balance to this was that each chip was also worth a drink at the bar on that day too.

We were well aware that we'd left a lot of play in this particular system, allowing for lotteries to be set up, people pooling their chips, and so on. As the sole purpose of this was to encourage people to interact, to talk and bargain with each other, and to provide that feedback loop I mentioned earlier, it wasn't too important how people got the chips as long as it wasn't underhanded. It was the interaction and the 'fun' that we were after. Just as an aside, Dave Flanders deserves the credit for this particular scheme.

Developer Decathlon

The basic concept of the Developer Decathlon was also reusing these ideas of play and feedback: "The Developer Decathlon is a competition at dev8D that enables developers to come together face-to-face to do rapid prototyping of software ideas. [..] We help facilitate this at dev8D by providing both 'real users' and 'expert advice' on how to run these rapid prototyping sprints. [..] The 'Decathlon' part of the competition represents the '10 users' who will be available on the day to present the biggest issues they have with the apps they use and in turn to help answer developer questions as the prototypes applications are being created. The developers will have two days to work with the users in creating their prototype applications."

The best two submissions will get cash prizes that go to the individual, not to the company or institution that they are affiliated with. The outcomes of which will be made public shortly, once the judging panel have done their work.

Summary

To foster innovation and to allow for creativity in software development:
  • Having play space is important
  • Being vague with aims and flexible with outcomes is not a bad thing and is vital for unexpected things to develop - e.g. A project's outcomes should be under continual re-negotiation as a general rule, not as the exception.
  • Encouraging and enabling free and easy communication is crucial.
  • Be aware of what drives people to do what they do. Push all feedback to be as disproportionate as possible, allowing both developers and users to benefit, with only putting a relatively trivial amount of input in (this pattern affects web UIs, development cycles, team interaction, etc)
  • Choose useful highscores and be prepared to ditch them or change them if they are no longer fun and motivational.

Sunday 22 February 2009

Handling Tabular data

"Storage"

I put the s-word in quotes because the storing of the item is actually a very straightforward process - we have been dealing with storing tabular data for computation for a very long time now. Unfortunately, this also means that there are very many ways to capture, edit and present tables of information.

One realisation to make with regards to preserving access to data coming from research is that there is a huge backlog of data in formats that we shall kindly call 'legacy'. Not only is there this issue, but data is being made with tools and systems that effectively 'trap' or lock-in a lot of this information - case in point being any research being recorded using Microsoft Access. While the tables of data can often be extracted with some effort, it is normally difficult to impossible to extract the implicit information; how tables interlink, how the Access Form adds information to the dataset, etc.

It is this implicit knowledge that is the elephant in the room. Very many serialisations, such as SQL 'dumps', csv, xsl and so on, rely on implicit knowledge that is either related to the particulars of the application used to open it, or is actually highly domain specific.

So, it is trivial and easy to specify a model for storing data, but without also encoding the implied information and without making allowances for the myriad of sources, the model is useless; it would be akin to defining the colour of storage boxes holding bric-a-brac. The datasets need to be characterised, and the implied information recorded in as good a way as possible.

Characterisation

The first step is to characterise the dataset that has been marked for archival and reuse. (Strictly, the best first step is to consult with the researcher or research team and help and guide them so that as much of the unsaid knowledge is known by all parties.)

Some serialisations so a good job of this themselves, *SQL-based serialisations include basic data type information inside the table declarations themselves. As a pragmatic measure, it seems sensible to accept SQL-style table descriptions as a reasonable beginning. Later, we'll consider the implicit information that also needs to be recorded alongside such a declaration.

Some others, such as CSV, leave it up to the parsing agent to guess at the type of information included. In these cases, it is important to find out or even deduce the type of data held in each column. Again, this data can be serialised in a SQL table declaration held alongside the original unmodified dataset.

(It is assumed that a basic data review will be carried out; does the csv have a consistent number of columns per row, is the version and operating system known for the MySQL that held the data, is there a PI or responsible party for the data, etc.

Implicit information

Good teachers are right to point out this simple truth: "don't forget to write down the obvious!" It may seem obvious that all your data is latin-1 encoded, or that you are using a FAT32 filesystem, or even that you are running in a 32-bit environment, the painful truth is that we can't guarantee that these aspects won't affect how the data is held, accessed or stored. There may be systematic issues that we are not aware of, such as the problems with early versions of ZFS causing [, at the time, detected] data corruption, or MySQL truncating fields when serialised in a way that is not anticipated or discovered until later.

In characterising the legacy sets of data, it is important to realise that there will be loss, especially with the formats and applications that blend presentation with storage. For example, it will require a major effort to attempt to recover the forms and logic bound into the various versions of MS Access. I am even aware of a major dataset, a highly researched dictionary of old english words and phrases, that the final output of which is a Macromedia Authorware application, and the source files are held by an unknown party (that is if they still exist at all) - the Joy of hiring Contractors. In fact, this warrants a slight digression:

The gap in IT support for research

If an academic researcher wishes to gain an external email account at their institution, there is an established protocol for this. Email is so commonplace, it sounds an easy thing to provide, but you need expertise, server hardware, multiuser configuration, adoption of certain access standards (IMAP, POP3, etc), and generally there are very few types of email (text or text with MIME attachments - NB the IM in MIME stands for Internet Mail)

If a researcher has a need to store tables of data, where do they turn? They should turn to the same department, who will handle the heavy lifting of guiding standards, recording the implicit information and providing standard access APIs to the data. What the IT departments seem to be doing currently is - to carry on the metaphor - handing the researcher the email server software and telling them to get on with it, to configure it as they want. No wonder the resulting legacy systems are as free-form as they are.

Practical measures - Curation

Back to specifics now, consider that a set of data has been found to be important, research has been based on it, and it's been recognised that this dataset needs to be looked after. [This will illustrate the technical measures. Licencing, dialogue with the data owners, and other non-technical analysis and administration is left out, but assumed.]

First task is to store the incoming data, byte-for-byte, as much as is possible - storing the iso image of the media the data is stored on, storing the SQL dump of a database, etc.

Analyse the tables of data - record the base types of each column (text, binary, float, decimal, etc) apeing the syntax of a SQL table declaration, as well as trying to identify the key columns.

Record the inter-table joins between primary and secondary keys, possibly by using a "table.column SAMEAS table.column;" declaration after the table declarations.

Likewise, attempt to add information concerning each column, information such as units or any other identifying material.

Store this table description alongside the recorded tabular data source.

Form a representation of this data in a well-known, current format such as a MySQL dump. For spreadsheets that are 'frozen', cells that are the results of embedded formula should be calculated and added as fixed values. It is important to record the environment, library and platform that these calculations are made with.

Table description as RDF (strictly, referencing cols/rows via the URI)

One syntax I am playing around with is the notion that by appending sensible suffixes to the base URI for a dataset, we can unique specify a row, a column, a region or even a single cell. Simply put:

http://datasethost/datasets/{data-id}#table/{table-name}/column/{column-id} to reference a whole column
http://datasethost/datasets/{data-id}#table/{table-name}/row/{column-id} to reference a whole row, etc

[The use of the # in the position it is in will no doubt cause debate. Suffice it to say, this is a pragmatic measure, as I suspect that an intermediary layer will have to take care of dereferencing a GET on these forms in any case.]

The purpose for this is so that the tabular description can be made using common and established namespaces to describe and characterise the tables of data. Following on from a previous post on extending the BagIt protocol with an RDF manifest, this information can be included in said manifest, alongside the more expected metadata without disrupting or altering how this is handled.

A possible content type for tabular data

By considering the base Fedora repository object model, or the BagIt model, we can apply the above to form a content model for a dataset:

As a Fedora Object:

  • Original data in whatever forms or formats it arrives in (dsid prefix convention: DATA*)
  • Binary/textual serialisation in a well-understood format (dsid prefix convention: DERIV*)
  • 'Manifest' of the contents (dsid convention: RELS-INT)
  • Connections between this dataset and other objects, like articles, etc as well as the RDF description of this item (RELS-EXT)
  • Basic description of dataset for interoperability (Simple dublin core - DC)

As a BagIt+RDF:

Zip archive -
  • /MANIFEST (list of files and checksums)
  • /RDFMANIFEST (RELS-INT and RELS-EXT from above)
  • /data/* (original dataset files/disk images/etc)
  • /derived/* (normalised/re-rendered datasets in a well known format)
Presentation - the important part

What is described above is the archival of the data. This is a form suited for discovery, but is not in a form suited for reuse. So, what is the possibility?

BigTable (Google) or HBase (Hadoop) provides a platform where tabular data can be put in a scalable manner. In fact, I would go on to suggest that HBase should be a basic service offered by the IT department of any institution. By providing this database as a service, it should be easier to normalise, and to educate the academic users in a manner that is useful to them, not just to the archivist. Google spreadsheet is an extremely good example of how such a large, scalable database might be presented to the end-user.

For archival sets with a good (RDF) description of the table, it should be possible to instantiate working versions of the tabular data on a scalable database platform like HBase on demand. Having a policy to put to 'sleep' unused datasets can provide a useful comprimise, avoiding having all the tables live but still providing a useful service.

It should also be noted that the adoption of popular methods of data access should be part of the responsibility of the data providers - this will change as time goes on, and protocols and methods for access alter with fashion. Currently, Atom/RSS feeds of any part of a table of data (the google spreadsheet model) fits very well with the landscape of applications that can reuse this information.

Summary
  • Try to record as much information as can be found or derived - from host operating system to column types.
  • Keep the original dataset byte-for-byte as you recieved it.
  • Try to maintain a version of the data in a well-understood format
  • Describe the tables of information in a reusable way, preferably by adopting a machine-readable mechanism
  • Be prepared to create services that the users want and need, not services that you think they should have.


Friday 20 February 2009

Pushing the BagIt manifest concept a little further

I really like the idea of BagIt - just enough framework to transfer files in a way that errors can be detected.

I really like the idea of RDF - just enough framework to detail, characterise and interlink resources in an extremely flexible and extendable fashion.

I really like the 4 rules of Linked Data - just enough rules to act as guides; follow the rules and your information will be much more useful to you and the wider world.

What I do not wish to go near is any format that requires a non-machine-readable profile to understand or a human to reverse-engineer - METS, being a good example of a framework giving you enough rope to hang yourself on.

So, what's my use-case? First, I'll outline what I digital objects I have, and why I handle and express them in the way I have.

I deal with lists of resources on a day-to-day basis, and what these resources are and the way these resources link together is very important. The metadata associated with the list is also important, as this conveys the perspective of the agent that constructed this list; the "who,what,where,when and why" of the list.

OAI-ORE is - at a basic level - a specification and a vocabulary, which can be used to depict a list of resources. This is a good thing. But here's the rub for me - I don't agree with how ORE semantically places this list. For me, the list is a subjective thing, a facet or perception of linkage between the resources listed. The list *always* implies a context through which the resources are to be viewed. This view leads me to the conclusion that any triples that are *asserted* by the list, such as triples containing an ordering predicate, such as 'hasNext' or 'hasLast', these triples must not be in the same graph as the factual triples which would enter the 'global' graph, such as list A is called (dc:title) "My photos" and contains resources a,b,c,d and e and was authored by Agent X.

This is easier to illustrate with an example with everyone's friends, Alice and Bob:


Now, while Alice and Bob may be 'aggregating' some of the same images, this doesn't mean we can infer much at all. Alice might be researching the development of a fruit fly's wings based on genetic degredation, and Bob might be researching the fruit fly's eye structure, looking for clear photos of the front of the fly. It could be even more unrelated in that Bob is actually looking for features on the electron microscope photos caused by dust or pollen.

So, to cope with contextual assertions (A asserts that <B> <verb C> <D>) there are a couple of well-discussed tactics: Reification, 'Promotion' (not sure of the correct term here) and Named Graphs.

Reification is a no-no. Very bad. Google will tell you all the reasons why this is bad.

'Promotion' (what the real term for this is, I hope someone will post in the comments.) 'Promotion' is just where a Classed node is introduced to allow contextual information to be added, very useful for adding information about a predicate. For example, consider <Person A> <researches> <ProjectX>. This, I'd argue is a bad triple for any system that will last more than <ProjectX>'s lifespan. We need to classify this triple with temporal information, and perhaps even location information too. So, one solution is to 'promote' the <researches> predicate to be of the following form: <Person A> <has_role> <_: type=Researcher>; <_:> <dtstart> <etc>, <_:> <researches> <ProjectX> ...

From the ORE camp, this promotion comes in the form of a Proxy for each aggregated resource that needs context. So in this way, they have 'promoted' the resource, as a kind of promotion to the act of aggregation. Tomayto, Tomarto. The way this works for ORE doesn't sit well for me though, and the convention for the URI schema here feels very awkward and heavy.

The third way (and my strong preference) is the Named Graph approach. Put all the triples that are asserted by, say Alice, into a resource at URI <Alices-NG> and say something like <Aggregation> <isProvidedContextBy> <Alices-NG>

For ease of reuse though, I'd suggest that the facts are left in the global graph, in the aggregation serialisation itself. I am sure that the semantic arguments over what should go where could rage on for eons, my take is that information that is factual or generally useful should be left in the global graph. Like resource mime-type, standards compliance ('conformsTo', etc), mirroring/alternate format information ('sha1_checksum', 'hasFormat' between a PDF, txt and Word doc versions, etc)

(There is the murky middle ground of course, like for licencing. But I'd suggest leave to the 'owning' aggregation to put it in the global graph.)

Enough of the digression on RDF!

So, how to extend BagIt, taking on board the things I have said above:

Add alongside the MANIFEST of BagIt (a simple list of files and checksums) an RDF serialisation - RDFMANIFEST.{format} (which in my preference is in N3 or turtle notation, .n3 or .turtle accordingly)

Copying the modelling of Aggregations from OAI-ORE, and we will say that one BagIt archive is equivalent to one Aggregation. (NB nothing wrong with a BagIt archive of BagIt archives!)

Re-use the Agent and ore:aggregates concepts and conventions from the OAI-ORE spec to 'list' the archive, and give some form of provenance. Add in a simple record for what this archive is meant to be as a whole (attached to the Aggregation class).

Give each BagIt a URI - in this case, preferably a resolvable URI from which you can download it, but for bulk transfers using SneakerNet or CarFullOfDrivesNet, use a info:BagIt/{id} scheme of your choice.

URIs for resources in transit are hierarchical, based on location in the archive: <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

Checksums, mimetypes and alternates should be added to the RDF Manifest:

NB <page1> == <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

<page1> <sha1> "9cbd4aeff71f5d7929b2451c28356c8823d09ab4";
<mimetype> "image/tiff";
<hasFormat> <info:BagIt/{book-id}/thumbnail_pages/page1/bookid_page1_0.jpg>;


Any assertions, such as page ordering in this case, should be handled as necessary. Just *please* do not use 'hasNext'! Use a named graph, use the built in rdf list mechanism, add an RSS 1.0 feed as a resource, anything but hasNext!

And that's about it for the format. One last thing to say about using info URIs though - I'd strongly suggest that they are used when the items do not have resolvable (http) URIs, and once transfered, I'd suggest that the info URIs are replaced with the http ones, and the info varients can be kept in a graph for provenance.

(Please note that I am biased in that this mirrors quite closely the way that the archives here and the way that digital items are held, but I think this works!)

Wednesday 18 February 2009

Tracking conferences (at Dev8D) with python, twitter and tags

There was so much going on at http://www.dev8d.org (#dev8d) that it might be foolish for me to attempt to write up what happened.

So, I'll focus on a small, but to my mind, crucial aspect of it - tag tracking with a focus on Twitter.

The Importance of Tags

First, the tag (#)dev8d was cloudburst over a number of social sites - Flickr(dev8d tagged photos), Twitter(dev8d feed), blogs such as the JISCInvolve Dev8D site, and so on. This was not just done for publicity, but as a means to track and re-assemble the various inputs to and outputs from the event.

Flickr has some really nice photos on it, shared by people like Ian Ibbotson (who caught an urban fox on camera during the event!) While there was an 'official' dev8d flickr user, I expect the most unexpected and most interesting photos to be shared by other people who kindly add on the dev8d tag so we can find them. For conference organisers, this means that there is a pool of images that we can choose from, each with their own provenance so we can contact the owner if we wanted to re-use, or re-publish. Of course, if the owner puts a CC licence on them, it makes things easier :)

So, asserting a tag or label for an event is a useful thing to do in any case. But, this twinned with using a messaging system like Twitter or Identi.ca, means that you can coordinate, share, and bring together an event. There was a projector in the Basecamp room, which was either the bar, or one of the large basement rooms at Birkbeck depending on the day. Initially, this was used to run through the basic flow of events, which was primarily organised through the use of a wiki, to which all of us and the attendees were members.

Projecting the bird's eye view of the event

I am not entirely sure whose idea it was initially to use the projector to follow the dev8d tag on twitter, auto-refreshing itself every minute, but it would be one or more of the following: Dave Flanders(@dfflanders), Andy McGregor(@andymcg) and Dave Tarrant(@davetaz) who is aka BitTarrant due to his network wizardry keeping the wifi going despite Birkbeck's network's best efforts at stopping any form of useful networking going.

The funny thing about the feed being there, was that it felt perfectly natural from the start. Almost like a mix of notice board, event liveblog and facebook status updates, but the overall effect was like it was the bird's eye view of the entire event, which you could dip into and out of at will, follow up on talks you weren't even attending, catch interesting links that people posted, and just follow the whole event while doing your own thing.

Then things got interesting.

From what I heard, a conversation in the bar about developer happiness (involving @rgardler?) lead to Sam Easterby-Smith (@samscam) to create a script that dug through the dev8d tweets looking for n/m (like 7/10) and to use that as a mark of happyness e.g.
" @samscam #dev8d I am seriously 9/10 happy http://samscam.co.uk/happier HOW HAPPY ARE YOU? " (Tue, 10 Feb 2009 11:17:15)



And computed the average happyness and overall happyness of those who tweeted how they were doing!

Of course, being friendly, constructive sorts, we knew the best way to help 'improve' his happyometer was to try to break it by sending it bad input... *ahem*.
" @samscam #dev8d based on instant discovery of bugs in the Happier Pipe am now only 3/5 happy " (Tue, 10 Feb 2009 23:05:05)
BUT things got fixed, and the community got involved and interested. It caused talk and debate, got people wondering how that it was done, how they could do the same thing and how to take it further.

At which point, I thought it might be fun to 'retweet' the happyness ratings as they change, to keep a running track of things. And so, a purpose for @randomdev8d was born:



How I did this was fairly simple: I grabbed his page every minute or so, used BeautifulSoup to parse the HTML, got the happyness numbers out and compared it to the last ones the script had seen. If there was a change, it tweeted it and seconds later, the projected tweet feed updated to show the new values - a disproportionate feedback loop, the key to involvement in games; you do something small like press a button or add 4/10 to a message, and you can affect the stock-market ticker of happyness :)

If I had been able to give my talk on the python code day, the code to do this would contain zero surprises, because I covered 99% of this - so here's my 'slides'[pdf] - basically a snapshot screencast.

Here's the crufty code though that did this:
import time
import simplejson, httplib2, BeautifulSoup
h = httplib2.Http()
h.add_credentials('randomdev8d','PASSWORD')
happy = httplib2.Http()
o = 130.9
a = 7.7
import urllib

while True:
print "Checking happiness...."
(resp, content) = happy.request('http://samscam.co.uk/happier/')
soup = BeautifulSoup.BeautifulSoup(content)
overallHappyness = soup.findAll('div')[2].contents
avergeHappyness = soup.findAll('div')[4].contents
over = float(overallHappyness[0])
ave = float(avergeHappyness[0])
print "Overall %s - Average %s" % (over, ave)
omess = "DOWN"
if over > o:
omess = "UP!"
amess = "DOWN"
if ave > a:
amess= "UP!"
if over == o:
omess = "SAME"
if ave == a:
amess = "SAME"
if not (o == over and a == ave):
print "Change!"
o = over
a = ave
tweet = "Overall happiness is now %s(%s), with an average=%s(%s) #dev8d (from http://is.gd/j99q)" % (overallHappyness[0], omess, avergeHappyness[0], amess)
data = {'status':tweet}
body = urllib.urlencode(data)
(rs,cont) = h.request('http://www.twitter.com/statuses/update.json', "POST", body=body)
else:
print "No change"
time.sleep(120)
(Available from http://pastebin.com/f3d42c348 with syntax highlighting - NB this was written beat-poet style, written from A to B with little concern for form. The fact that it works is a miracle, so comment on the code if you must.)

The grand, official #Dev8D survey!

... which was anything but official, or grand. The happyness-o-meter idea lead BitTarrant and I to think "Wouldn't it be cool to find out what computers people have brought here?" Essentially, finding out what computer environment developers choose to use is a very valuable thing - developers choose things which make our lives easier, by and large, so finding out which setups they use by preference to develop or work with could guide later choices, such as being able to actually target the majority of environments for wifi, software, or talks.

So, on the Wednesday morning, Dave put out the call on @dev8d for people to post the operating systems on the hardware they brought to this event, in the form of OS/HW. I then busied myself with writing a script that hit the twitter search api directly, and parsed it itself. As this was a more intended script, I made sure that it kept track of things properly, pickling its per-person tallys. (You could post up multiple configurations in one or more tweets, and it kept track of it per-person.) This script was a little bloated at 86 lines, so I won't post it inline - plus, it also showed that I should've gone to the regexp lesson, as I got stuck trying to do it with regexp, gave up, and then used whitespace-tokenising... but it worked fine ;)

Survey code: http://pastebin.com/f2c04719b

Survey results: http://spreadsheets.google.com/pub?key=pDKcyrBE6SJqToHzjCs2jaQ

OS:
Linux was the majority at 42% closely followed by Apple at 37% with MS-based OS at 18% with a stellar showing of one user of OpenSolaris (4%)!

Hardware type:
66% were laptops, with 25% of the machines there being classed as netbooks. 8% of the hardware there were iPhones too, and one person claimed to have brought Amazon EC2 with them ;)

The post hoc analysis

Now then, having gotten back to normal life, I've spent a little time grabbing stuff from twitter and digging through them. Here is the list of the 1300+ tweets with the #dev8d tag in them published via google docs, and here is some derived things posted by Tony Hirst(@psychemedia) and Chris Wilper(@cwilper) seconds after I posted this:

Tagcloud of twitterer's:
http://www.wordle.net/gallery/wrdl/549364/dev8_twitterers [java needed]

Tagcloud of tweeted words:
http://www.wordle.net/gallery/wrdl/549350/dev8d [java needed]

And a column of all the tweeted links:
http://spreadsheets.google.com/pub?key=p1rHUqg4g423-wWQn8arcTg

This lead me to dig through them and republish the list of tweets, but try to unminimise the urls and try to grab the &lt;title> tag of the html page it goes to, which you can find here:

http://spreadsheets.google.com/pub?key=pDKcyrBE6SJpwVmV4_4qOdg

(Which incidently, lead me to spot that there was one link to "YouTube - Rick Astley - Never Gonna Give You Up" which means the hacking was all worthwhile :))

Graphing Happyness

For one, I've re-analysed the happyness tweets and posted up the following:
It is easier to understand the averages as graphs over time of course! You could also use Tony Hirst's excellent write up here about creating graphs from google forms and spreadsheets. I'm having issues embedding the google timeline widget here, so you'll have to make do with static graphs.

Average happyness over the course of the event - all tweets counted towards the average.

Average happyness, but only the previous 10 tweets counted towards the average making it more reflective of the happyness at that time.

If you are wondering about the first dip, that was when we all tried to break Sam's tracker by sending it bad data, a lot of 0 happyness's were recorded therefore :) As for the second dip, well, you can see that from the log of happyness, yourselves :)