Sunday 22 February 2009

Handling Tabular data

"Storage"

I put the s-word in quotes because the storing of the item is actually a very straightforward process - we have been dealing with storing tabular data for computation for a very long time now. Unfortunately, this also means that there are very many ways to capture, edit and present tables of information.

One realisation to make with regards to preserving access to data coming from research is that there is a huge backlog of data in formats that we shall kindly call 'legacy'. Not only is there this issue, but data is being made with tools and systems that effectively 'trap' or lock-in a lot of this information - case in point being any research being recorded using Microsoft Access. While the tables of data can often be extracted with some effort, it is normally difficult to impossible to extract the implicit information; how tables interlink, how the Access Form adds information to the dataset, etc.

It is this implicit knowledge that is the elephant in the room. Very many serialisations, such as SQL 'dumps', csv, xsl and so on, rely on implicit knowledge that is either related to the particulars of the application used to open it, or is actually highly domain specific.

So, it is trivial and easy to specify a model for storing data, but without also encoding the implied information and without making allowances for the myriad of sources, the model is useless; it would be akin to defining the colour of storage boxes holding bric-a-brac. The datasets need to be characterised, and the implied information recorded in as good a way as possible.

Characterisation

The first step is to characterise the dataset that has been marked for archival and reuse. (Strictly, the best first step is to consult with the researcher or research team and help and guide them so that as much of the unsaid knowledge is known by all parties.)

Some serialisations so a good job of this themselves, *SQL-based serialisations include basic data type information inside the table declarations themselves. As a pragmatic measure, it seems sensible to accept SQL-style table descriptions as a reasonable beginning. Later, we'll consider the implicit information that also needs to be recorded alongside such a declaration.

Some others, such as CSV, leave it up to the parsing agent to guess at the type of information included. In these cases, it is important to find out or even deduce the type of data held in each column. Again, this data can be serialised in a SQL table declaration held alongside the original unmodified dataset.

(It is assumed that a basic data review will be carried out; does the csv have a consistent number of columns per row, is the version and operating system known for the MySQL that held the data, is there a PI or responsible party for the data, etc.

Implicit information

Good teachers are right to point out this simple truth: "don't forget to write down the obvious!" It may seem obvious that all your data is latin-1 encoded, or that you are using a FAT32 filesystem, or even that you are running in a 32-bit environment, the painful truth is that we can't guarantee that these aspects won't affect how the data is held, accessed or stored. There may be systematic issues that we are not aware of, such as the problems with early versions of ZFS causing [, at the time, detected] data corruption, or MySQL truncating fields when serialised in a way that is not anticipated or discovered until later.

In characterising the legacy sets of data, it is important to realise that there will be loss, especially with the formats and applications that blend presentation with storage. For example, it will require a major effort to attempt to recover the forms and logic bound into the various versions of MS Access. I am even aware of a major dataset, a highly researched dictionary of old english words and phrases, that the final output of which is a Macromedia Authorware application, and the source files are held by an unknown party (that is if they still exist at all) - the Joy of hiring Contractors. In fact, this warrants a slight digression:

The gap in IT support for research

If an academic researcher wishes to gain an external email account at their institution, there is an established protocol for this. Email is so commonplace, it sounds an easy thing to provide, but you need expertise, server hardware, multiuser configuration, adoption of certain access standards (IMAP, POP3, etc), and generally there are very few types of email (text or text with MIME attachments - NB the IM in MIME stands for Internet Mail)

If a researcher has a need to store tables of data, where do they turn? They should turn to the same department, who will handle the heavy lifting of guiding standards, recording the implicit information and providing standard access APIs to the data. What the IT departments seem to be doing currently is - to carry on the metaphor - handing the researcher the email server software and telling them to get on with it, to configure it as they want. No wonder the resulting legacy systems are as free-form as they are.

Practical measures - Curation

Back to specifics now, consider that a set of data has been found to be important, research has been based on it, and it's been recognised that this dataset needs to be looked after. [This will illustrate the technical measures. Licencing, dialogue with the data owners, and other non-technical analysis and administration is left out, but assumed.]

First task is to store the incoming data, byte-for-byte, as much as is possible - storing the iso image of the media the data is stored on, storing the SQL dump of a database, etc.

Analyse the tables of data - record the base types of each column (text, binary, float, decimal, etc) apeing the syntax of a SQL table declaration, as well as trying to identify the key columns.

Record the inter-table joins between primary and secondary keys, possibly by using a "table.column SAMEAS table.column;" declaration after the table declarations.

Likewise, attempt to add information concerning each column, information such as units or any other identifying material.

Store this table description alongside the recorded tabular data source.

Form a representation of this data in a well-known, current format such as a MySQL dump. For spreadsheets that are 'frozen', cells that are the results of embedded formula should be calculated and added as fixed values. It is important to record the environment, library and platform that these calculations are made with.

Table description as RDF (strictly, referencing cols/rows via the URI)

One syntax I am playing around with is the notion that by appending sensible suffixes to the base URI for a dataset, we can unique specify a row, a column, a region or even a single cell. Simply put:

http://datasethost/datasets/{data-id}#table/{table-name}/column/{column-id} to reference a whole column
http://datasethost/datasets/{data-id}#table/{table-name}/row/{column-id} to reference a whole row, etc

[The use of the # in the position it is in will no doubt cause debate. Suffice it to say, this is a pragmatic measure, as I suspect that an intermediary layer will have to take care of dereferencing a GET on these forms in any case.]

The purpose for this is so that the tabular description can be made using common and established namespaces to describe and characterise the tables of data. Following on from a previous post on extending the BagIt protocol with an RDF manifest, this information can be included in said manifest, alongside the more expected metadata without disrupting or altering how this is handled.

A possible content type for tabular data

By considering the base Fedora repository object model, or the BagIt model, we can apply the above to form a content model for a dataset:

As a Fedora Object:

  • Original data in whatever forms or formats it arrives in (dsid prefix convention: DATA*)
  • Binary/textual serialisation in a well-understood format (dsid prefix convention: DERIV*)
  • 'Manifest' of the contents (dsid convention: RELS-INT)
  • Connections between this dataset and other objects, like articles, etc as well as the RDF description of this item (RELS-EXT)
  • Basic description of dataset for interoperability (Simple dublin core - DC)

As a BagIt+RDF:

Zip archive -
  • /MANIFEST (list of files and checksums)
  • /RDFMANIFEST (RELS-INT and RELS-EXT from above)
  • /data/* (original dataset files/disk images/etc)
  • /derived/* (normalised/re-rendered datasets in a well known format)
Presentation - the important part

What is described above is the archival of the data. This is a form suited for discovery, but is not in a form suited for reuse. So, what is the possibility?

BigTable (Google) or HBase (Hadoop) provides a platform where tabular data can be put in a scalable manner. In fact, I would go on to suggest that HBase should be a basic service offered by the IT department of any institution. By providing this database as a service, it should be easier to normalise, and to educate the academic users in a manner that is useful to them, not just to the archivist. Google spreadsheet is an extremely good example of how such a large, scalable database might be presented to the end-user.

For archival sets with a good (RDF) description of the table, it should be possible to instantiate working versions of the tabular data on a scalable database platform like HBase on demand. Having a policy to put to 'sleep' unused datasets can provide a useful comprimise, avoiding having all the tables live but still providing a useful service.

It should also be noted that the adoption of popular methods of data access should be part of the responsibility of the data providers - this will change as time goes on, and protocols and methods for access alter with fashion. Currently, Atom/RSS feeds of any part of a table of data (the google spreadsheet model) fits very well with the landscape of applications that can reuse this information.

Summary
  • Try to record as much information as can be found or derived - from host operating system to column types.
  • Keep the original dataset byte-for-byte as you recieved it.
  • Try to maintain a version of the data in a well-understood format
  • Describe the tables of information in a reusable way, preferably by adopting a machine-readable mechanism
  • Be prepared to create services that the users want and need, not services that you think they should have.


No comments: