Monday 30 March 2009

Early evaluation and serialisation of preservation policy decisions.

(Apologies, as this has been a draft when I thought it published. I have updated it to reflect changes that have been made since we started doing this.)

It may be policy to make sure that the archive's materials are free of computer malware - one part of enacting this policy is running anti-virus and anti-spyware scans of the content. However, malware may be stored in the archive a number of months before it is widely recognised as such. So, the enactment of the policy of 'no malware' would mean that content is scanned on ingest, in 3 to 6 months after ingest and one final scan a year later.

Given that it is possible to monitor when changes occur to the preservation archive, it is not necessary to run continual sweeps of the content held in the archive to assess whether a preservation action is needed or not. Most actions that result from a preservation policy choice can be pre-assigned to an item when it iundergoes a change of state (creation, modification, deletion, extension)

These decisions for actions (and also the auditable record of them occurring) are recorded inside the object itself and in a bid to reuse a standard rather than reinvent, this serialisation uses the iCal standard. iCal already has a proven capability to mark and schedule events and even handle reoccurring events, and to attach arbitrary information to these individual events.

For the archive to self-describe and be preservable for the longer term, it is necessary for the actions taken to be archivable in some way too. A human-readable description of the action, alongside a best-effort attempt to describe this in machine-readable terms should be archived and referenced by any event that is an instance of that action. ('best-effort' due to the underwhelming nature of the current semantics and schemas for describe these preservation processes)

In the Oxford system, an iCal calendar implementation called Darwin Calendar server was initially used to provide a queriable index of the preservation actions, along with a report of what events needed to be queued to be performed on a given day. These actions are queued in the short-term job queues (technically, being held in persistent AMQP queues) for later processing. However, the various iCal server implementations were not lightweight nor stable enough to be easily reused so from this point on, simple indexes are created as needed and retained from the serialised iCal to be used in its stead.

Preservation actions such as scanning (virii, file-format, etc) are not the only systems to benefit from monitoring the state of an item. Text- and data-mining and analysis, indexing for search indices, dissemination copy production, and so on are all actions that can be driven and kept in line with the content in this way. For example, it is likely that the indices will be altered or benefit from refreshing on a periodic basis and the event of last-indexing can be included in this iCal file as a VJOURNAL event.

NB no effort has been made to intertwine the iCal-serialised information with the packaging standard used, as this is heavily expected to both take considerable time and effort, and also severely limit our ability to reuse or migrate content from this packaging standard to a later, newer format. It is being stored as a separate Fedora datastream within the same object it describes, and is registered to be an iCal file containing preservation event information using information stored in the internal RDF manifest.

Thursday 19 March 2009

We need people!

(UPDATE - Grrr.... seems that the concept of persistent URLs is lost on the admin - link below has been removed - see google cached copy here)

http://www.admin.ox.ac.uk/ps/oao/ar/ar3979j.shtml - job description.

Essentially, we need smart people who are willing to join us to do good, innovative stuff; work that isn't by-the-numbers with room for initiative and ideas.

Help us turn our digital repository into a digital library, it'll be fun! Well, maybe not fun, but it will be very interesting at least!

bulletpoints: python/ruby frameworks, REST, a little SemWeb, ajax, jQuery, AMQP, Atom, JSON, RDF+RDFa, Apache WSGI deployment, VMs, linux, NFS, storage, RAID, etc.