Monday, 10 December 2007

The 'rules' of [digital] preservation

These are not rules of how digital preservation should be done, but more like rules or statements of how preservation is or, more importantly, is not being done well and what I think might be done about it.

The rules:

First rule of preservation is that no creator is worried about preservation. Well, it's less of a rule and more of a cold, hard fact. People just don't think about how something they create digitally is going to be preserved, even though they make decision after decision that could significantly affect how the resource can be used later on. The user couldn't care less about preserving it, they just want it to work right now with the minimum of fuss.

Second rule of preservation is that best summed up by the simple statement "garbage in, garbage out". This is where the majority of the significant problems arise and it is also the one where the technological answer may not be good enough. The people doing the creating don't really care if it looks a bit ropey or if something they rely on is a proprietary toy that may not be around for much longer - As long as what they can produce does the job for the time they need it to, they are happy.

Third rule of preservation is that everyone gets very excited about file formats, especially about the spectre of file format obsolescence. I really, truly do think it's just a spectre, and that there are far more real obstacles to overcome right now. (See rule #2) The people having to deal with the garbage that comes in, are focusing on technological solutions that take a simple view of the items coming in - e.g. Word-processor doc = Bad, PDF = Good - rather than a more internalised, detailed view of what is coming in, assessing along the lines of - PDF with tabular data held as images, or illegal/custom font = Bad, Word-proc file with unicode used throughout = Good. People thinking about preservation tend to look at the outside of a file, rather than at its contents.

Fourth rule of preservation is that everyone seems to divide into two camps - Nursemaids and Tyrants. (Yes, there are likely better, more known terms for what I describe below. Please use the comments below to point them out.)
  • The nursemaids will seek to care for ailing formats, writing things like migration tools, to take something from one version of the format to the latest version, Java applet viewers for old documents, and emulators/shims for other more esoteric formats.
    • To completely take the nursemaid approach will involve a vast amount of work and detailed knowledge of the formats in question, and there is the distinct possibility that certain forms of support are utterly intractable or even illegal (DRM).
  • The tyrants will dictate that all file formats should be mapped into their essential information, and this information will be put into a universal format. Often, the word 'semantic' appears at some point.
    • To take the tyrants path, to normalise everything, also requires a vast amount of work and file format knowledge, but one or more 'universal' formats have to be selected, formats which can both hold this data and present it with the same context as the original.
So, what to do?

#1 - "Educate the user" is a simple enough solution to say, but educate them how? The route I am taking is to inform them about bad encodings, how to properly type with different character sets, why open formats are good, and how unicode and open standards will help ensure that the work they are producing now can be read or watched in tens of years time.

#2 "Stop people creating garbage". More users have to be made aware that the fact that the majority of people need to be trained to use software products effectively and that this applies to them also. Hopefully this will help curb the numbers of flow chart diagrams written in an MS Excel spreadsheet, or the number of diagrams submitted as encapsulated postscript, or the number of documents using fonts to make the normal text look like coptic, greek, or russian, rather than changing how those words are entered in the first place.

#3 "Focus on what's inside the file, rather than the package you got it in." Whilst detecting when a certain file format is going to be a problem for those downloading it, the key point is that something will then need to be done. If the file is full of garbage, then migration is not going to be easy or even possible. For example, examine the number of classicists using the fonts Normyn, SPIonic, GreekKeys, Athenian and other more custom fonts in their documents. The thing that unites all of these fonts is that they all have a custom way of mapping the letters A-z into greek or latin. As time goes on, these mappings and the fonts themselves get harder and harder to find. Good luck migrating those!

#4 "The problem is not with the files, but PEBKAC" - Problem Exists Between Keyboard And Chair - the user. (There is a good argument that user's poor choices are to do with the computing environment they are given, but since the environment isn't going to change without user demand...) A large set of the problems will arise from users not using the tools they have properly. A second large set of problems arise from DRM and other forms of locked in format, such as Microsoft Word. If someone can hand me working technical solutions to these problems, then that will be fantastic, but until that time I cannot say whether one methodology is better than the other. I will be seeking to educate users, to stop the garbage coming in in the first place. And when I get garbage in? Pragmatism will dictate the next moves.

1 comment:

Sean said...

I really like these rules- perhaps because I think they are right.

Wondering what side of the tyrant/nursemaid fence you sit?