Less Talk, More Code: October 2008

Friday 24 October 2008

"Expressing Argumentative Discussions in Social Media sites" - and why I like it

At the moment, there are a lot of problems with the way information is cited or referenced in paper-based research articles, etc and I am deeply concerned that this model is applied to digitally held content. I am by no means first to say the following, and people have put a lot more thought into this field but I can't find a way out of the situation by using the information that is captured. I find it easier to see the holes when I express it to myself in rough web metaphors:

a reference list is a big list of 'bookmarks', ordered generally by the order in which they appear in the research paper.
Sometimes a bookmark is included, because everyone else in the field includes the same one, a core text for example.
There are no tags, no comments and no numbers on how often a given bookmark is used.
There is no intentional way to find out how often the bookmark appears in other people's lists. This has to be reverse engineered.
There is no way to see from the list what the intent of the reference is, whether the author agrees, disagrees, refutes, or relies on the thing in question.
There are no anchor tags in the referenced articles (normally), so there is little way to reliably refer to a single quote in a text. Articles are often referenced as a whole, rather than to the line, chart, or paragraph.
The bookmark format varies from publisher to publisher, and from journal to journal
Due to the coase grained citation, a single reference will sometimes be used when the author refers to multiple parts of a given piece of work

Now, on a much more positive note, this issue is being tackled. At the VoCamp in Oxford, I talked with CaptSolo about the developments with SIOC and their idea to extend the vocabularies to deal with argumentative discourse. Their paper is now online at the SDoW2008 site (or directly to the pdf) The essence of this, is an extension to the SIOC vocab, recording the intent of a statement, such as Idea, Issue, Elaboration as well as recording an Argument.

I (maybe naïvely) have felt that there is a direct parallel to social discourse and academic discourse, to the point where I used the sioc:has_reply property to connect links made in blogs to items held in the archive (using trackbacks and pingbacks, a system in hiatus until I get time to beef up the antispam/moderation facilities) So, to see an argumentation vocab developing makes me more happy :) Hopefully, we can extend this vocab's intention with more academic-focussed terms.

What about the citation vocabularies that exist? I think that those that I have looked at suffer from the same issue - they are built to represent what exists in the paper-world, rather than what could exist in the web-world.

I also want to point out the work of the Spider project, which aims to semantically markup a single journal article, as they have taken significant steps towards showing what could be possible with enhanced citations. Take a look at their enhanced article, all sorts of very useful examples of what is possible. Pay special attention to how the references are shown, how they can be reordered, typed and so. Note that I am able to link to the references section in the first place! The part I really find useful is demonstrated by the two references in red in the 2nd paragraph of the introduction. Hover over them to find out what I mean. Note that even though the two references are the same in the reference list (due to this starting as a paper version article) they have been enhanced to popup the reasons and sections referred to in each case.

In summary then, please think twice when compiling a sparse reference list! Quote the actual section of text if you can and harvard format be damned ;)

Monday 20 October 2008

Modelling and storing a phonetics database inside a store

Long story short, a researcher asked us to store and dissmeninate a DVD containing ~600+ audio files and associated analyses comprising a phonetics database, focussed on the beat of speech.

This was the request that let me start on something I had planned for a while; a databank of long tail data. This is data that is too small to fit into the plans of Big Data (which have IMO a vanishingly small userbase for reuse) and too large and complex to sit as archived lumps on the web. The system supporting databank is a Fedora-commons install, with a basic Solr implemented for indexing.

As I haven't gotten an IP address for the databank machine, I cannot link to things yet, but I will walk through the modelling process. (I'll add in links to the raw objects once I have a networked VM for the databank)

Analysis: "What have we got here?"

The dataset has been given to us by a researcher called Dr. Greg Kochanski, the data having been burnt onto a DVD-R. He called it the "2008 Oxford Tick1 corpus". A quick look at the contents showed that it was a collection of files, grouped by file folder into a hierarchy of some sort. First things first, though - this is a DVD-R and very much prone to degradation. As it was the files on the disk that are important, rather than the disc image itself, a quick "tar -cvzf Tick1_backup.tar.gz /media/cdrom0/" made a zipped archive of the files. Remember, burnt DVDs have an integrity halflife of around 1 1/2 -> 2 years (according to a talk I heard at the last SUN-PAsig) and I myself have lost data to unreadable discs.

Disc contents: http://pastebin.com/f74aadacc

Top-level directory listing:

ch ej lp ps rr sh ta DB.fiat DBsub.fiat README.txt
cw jf nh rb sb sl tw DBsent.fiat LICENSE.txt

Each one of the two letter directories holds a number of files, each file seemingly having a subdirectory for it, containing ~6+ data files in a custom format.

The .fiat top-level files, DB.fiat, etc are in a format roughly described here - the documentation about the data held within each file being targeted for humans. In essence, it looks like a way to add comments and information to a csv file, but it doesn't seem to support/encourage the line syntax that csv enjoys, like quotes, likely due to it not using any standard csv library. For instance, the same data could be captured and use standard csv libs without any real invention, but I digress.

Ultimately, by parsing the fiat files, I can get the text spoken in each audio file, some metadata about each one, and some information about how some of the audio is interrelated (by text spoken, type, etc)

Modelling: "How to turn the data into a representation of objects"

There are very many ways to approach this, but I shall outline the aims I keep in mind, and also the fact that this will always be a compromise between different aims, in part due to the origins of the data.

I am a fan of sheer curation; I think that this not only is a great way to work, but also the only practical way to deal with this data in the first place. The researcher knows their intentions better than a post-hoc curator. Injecting data modelling/reuse considerations into the working lives of researchers is going to take a very long time. I have other projects focussed on just this, but I don't see it being the workflow through which the majority of data is piped any time soon.

In the meantime, the way I believe is best for this type of data is to capture and to curate by addition. Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time. We need to start putting (cleared) data out there, in parallel to capturing the information necessary for understanding and then preserving the information. By putting the data out there quickly, the researcher feels that they are having an effect and are able to see that what we are doing for them is worth it. This can favourably aid dialogue that you need to have with the researcher (or other related researchers or analysts) to further characterise this data.

So, step one for me is to describe a physical view of the filesystem in terms of objects and then to represent this description in terms of RDF and Fedora objects. Luckily, this is straightforward, as it's already pretty intuitive.

There are groupings

a top-level containing 10 group folders and a description of the files
Each group folder contains a folder for each recording in its group
Each recording folder contains

There is a .wav file for each recording
There are 6 analysis .dat files (sound analyses like RMS, f0, etc), dependent and exclusive to each audio file
There are some optional files, dependent and exclusive to each audio file

Each audio file is symbolically linked into its containing group folder.

So, from this, I have 3 types of object: the top-level 'dataset' object, a grouping object of some kind, and a recording object, containing the files pertinent to a single recording. (In fact, the two grouping classes are preexisting classes in the archival system here, albeit informally.)

We can get a crude, but workable representation by using three 'different' (marked different in the RELS-EXT ds) fedora objects, and by using the dcterms 'isPartof' property to indicate the groupings (recording --- dcterms:isPartOf --> grouping)

Curation by addition

The way I've approached this with Fedora objects is to use a datastream with a reserved ID to capture the characteristics of the data being held. At the moment, I am using full RDF stored in a datastream called RELS-INT. (NB I fully expect someone to look at this dataset later in its life and say 'RDF? that's so passé'; the curation of the dataset will be a long-term process that may not end entirely.) RELS-INT is to contain any RDF that cannot be contained by the RELS-EXT datastream. (yes, having two sources for the RDF where one should do is not desirable, but it's a compromise between how Fedora works, and how i would like it to work.)

To indicate that the RELS-INT should also be considered when viewing the RELS-EXT, I add an assertion (which slightly abuses the intended range of the dcterms requires property, but reuse before reinvention):

<info:fedora/{object-id}> <http://dublincore.org/documents/dcmi-terms/#terms-requires> <info:fedora/{object-id}/RELS-INT> .

Term Name: requires
URI: http://purl.org/dc/terms/requires
Definition: A related resource that is required by the described resource to support its function, delivery, or coherence.

I am also using the convention of storing timestamped notes in iCal format within a datastream called EVENTS (I am doing this in a much more widespread fashion throughout) These notes are intended to document the curational/archivist why's behind changes to the objects, rather than the technical ones, which Fedora can keep track of. As the notes are available to be read, they are intended to describe how this dataset has evolved and why it has been changed or added to.

An assertion to add then is that the EVENTS datastream contains information pertinent to the provenance of the whole object (into the RELS-INT in this case) I am not happy with the following method, but I am open to suggestions.

<info:fedora/{object-id}> <http://dublincore.org/documents/dcmi-terms/#terms-provenance>
[ a http://purl.org/dc/terms/ProvenanceStatement .
dc:source <info:fedora/{object-id}/EVENTS> ];

Term Name: provenance
URI: http://purl.org/dc/terms/provenance
Definition: A statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity, and interpretation.
Comment: The statement may include a description of any changes successive custodians made to the resource.

From this point on, the characteristics of the files attached to each object can be recorded in a sensible and extendable manner. My next steps are to add simple dublin core metadata for what I can (both the objects and the individual files) and to indicate how the data is interrelated. I will also add (as an object in the databank) the basic description of the custom data format, which seems to be based loosely on the NASA FITS format, but not based well enough for FITS tools to work on, or to be able to validate the data.

Data Abstracts: "Binding data to traditional formats of research (articles, etc)

It should be possible to cite a single piece of data, as well as a grouping or even an arbitrary set of data and groupings of data. From a re-use point of view, this citation is a form of data currency that is passed around and re-cited, so it makes sense to make this citation as simple as possible; a data citation as a URL.

I start from the point of view that a single, generic, perfectly modelled citation format for data will take more time and resources to develop than I have. What I believe I can do though, is enable the more practically focussed case for re-use and the sharing of citations. If a single URL(URI) is created, one which serves as an anchor node to bind together resources and information. It should provide a means for the original citing author to indicate why they had selected this grouping of information, what this grouping of information means at that time and for what reason. I can imagine modelling the simple facts of such a citation in RDF assertions (person, date of citation, etc) but it's beyond me to imagine a generic but useful way to indicate author intention and perception in the same way. The best I can do is to adopt the model that researchers/academics are most comfortable with, and allow them to record a data 'abstract' in the native language.

Hopefully, this will prove a useful hook for researchers to focus on and to link to or from related or derived data. Whilst groupings are typically there to make it easier to understand the underlying data as a whole, the data abstract is there to record an author's perception of a grouping, based on whatever reasoning they choose to express.

Thursday 16 October 2008

News and updates Oct 2008

Right, I haven't forgotten about this blog, just getting all my ducks in a line as it were. Some updates:

The JISC bid for eAdministration was successful, titled "Building the Research Information Infrastructure (BRII)". The project will categorise the research information structure, build vocabularies if necessary, and populate it with information. It will link research outputs (text and data), people, projects, groups, departments, grant/funding information and funding bodys together, using RDF and as many pre-existing vocabularies as is suitable. The first vocab gap we've hit is one for funding, and I've made a draft RDF schema for this which will be openly published once we've worked out a way to make it persistent here at Oxford (trying to get a vocab.ox.ac.uk address)

One of the final outputs will be a 'foafbook' which will re-use data in the BRII store - it will act as a blue book of researchers. Think Cornell's Vivo, but with the idea of Linked Data firmly in mind.
We are just sorting out a home for this project, and I'll post up an update as soon as it is there.

Forced Migration Online (FMO) have completed their archived document migration from a crufty, proprietary store to a ORA-style store (Fedora/Solr) - you can see their preliminary frontend at http://fmo.qeh.ox.ac.uk. Be aware that this is a work in progress. We provide the store as a service to them, giving them a Fedora and a Solr to use. They contracted a company called Aptivate to migrate their content, and I believe also to create their frontend. This is a pilot project to show that repositories can be treated in a distributed way, given out like very smart, shared drive space.
We are working to archive and migrate a number of library and historical catalogs. A few projects have a similar aim to provide an architecture and software system to hold medieval catalog research - a record of what libraries existed, and what books and works they held. This is much more complex that a normal catalog, as each assertion is backed by a type of evidence, ranging from the solid (first-hand catalog evidence), to the more loose (handwriting on the front page looks like a certain person who worked at a certain library.) So modelling this informational structure is looking to be very exciting, and we will have to try a number of ways to represent this, starting with RDF due to the interlinked nature of the data. This is related to the kinds of evidence that genealogy uses, and so related ontologies may be of use.
The work on storing and presenting scanned imagery is gearing up. We are investigating storing the sequence of images and associated metadata/ocr text/etc as a single tar file as part of a Fedora object (i.e. a book object will have a catalog record, technical/provenance information and an attached tar file and and a list of file to offset information.)

This is due to us trying to hit the 'sweet spot' for most file systems. A very large number of highly compressed images and little pieces of text does not fit well with most FS internals. We estimate that for a book there will be around [4+PDFs+2xPages] files, or 500+ typically. Just counting up the various sources of scanned media we already have, we are pressing for about 1/2 million books from one source, 200,000 images from another, 54,000 from yet another... it's adding up real fast.

We are starting to deal with archiving/curating the 'long-tail' of data - small, bespoke datasets that are useful to many, but don't fall into the realm of Big Data, or Web data. I don't plan on touching Access/FoxPro databases any time soon though! I am building a Fedora/Solr/eXist box to hold and disseminate these, which should live at databank.ouls.ox.ac.uk very, very shortly. (Just waiting on a new VMware host to grow into, our current one is at capacity.)

To give a better idea of the structure, etc, I am writing it up in a second blog post to follow shortly - currently named "Modelling and storing a phonetics database inside a store"

I am in the process of integrating the Google-analytics-style statistics package at http://piwik.org with the ORA interface, to give relatively live hit counts on a per-item and to build per-collection reports.

Right now, piwik is capturing the hits and downloads from ORA, but I have yet to add in the count display on each item page, so halfway there :)

We are just waiting on a number of departments here to upgrade the version of EPrints they are using for their internal, disciplinary repositories, so that we can begin archiving surrogate copies of the work they wish to put up for this service. (Using ORE descriptions of their items) By doing so, their content becomes exposed in ORA, mirror copies are made (working on a good way to maintain these as content evolves), but they retain the content control, ORA will also act as a registry for their content. It's only when their service drops do the users get redirected to the mirror copies that ORA holds (think google cache, but a 100% copy).
In the process of battle-testing the Fedora-Honeycomb connection, but as mentioned above, just waiting for a little more hardware before I set to it. Also, we are examining a number of other storage boxes that should plug in under Fedora, using the Honeycomb software, such as the new and shiny Thumper box, "Thor" Sun Fire Xsomething-or-other. Also, getting pretty interested at the idea of MAID storage - massive array of idle disks. Hopefully, this will act like tape, but have a sustainable access speed of disk. Also, a little more green than a tower of spinning hardware.
Planning out the indexer service at the moment. It will use the Solr 1.3 multicore functionality, with a little parsing magic at the ingest side of things to make a generic indexer-as-a-service type system. One use-case is to be able to bring up VM machines with multicore solr on to act as indexers/search engines as needed. An example aim? "Economics want an index that facets on their JEL codes." POST a schema and ingest indexer to the nearest free indexer, and point the search interface at it once an XMPP message comes back that it is finished.
URI resolvers - still investigating what can be put in place for this, as I strongly wish to avoid coding this myself. Looking at OCLC's OpenURL and how I can hack it to feed it info:fedora uris and link them to their disseminated location. Also, using a tinyurl type library + simple interface might not be a bad idea for a quick PoC.
Just to let you all know that we are building up the digital team here, most recently held interviews for the futureArch project but we are looking for about 3 others to hire, due to all the projects we are doing. We will be putting out job adverts as and when we feel up to battling with HR :)

That's most of the more interesting hot topics and projects I am doing at the moment.... phew :)

Less Talk, More Code

Friday 24 October 2008

"Expressing Argumentative Discussions in Social Media sites" - and why I like it

Monday 20 October 2008

Modelling and storing a phonetics database inside a store

Thursday 16 October 2008

News and updates Oct 2008

Dopplr

Subscribe Now

Mugshot

Additional links

Labels

Blog Archive

About Me