Monday 18 August 2008

The four rules of the web and compound documents

A real quirk that truly interests me is the difference in aims between the way documents are typically published and the way that the information within them is reused.

A published document is normally in a single 'format' - a paginated layout, and this may comprise text, numerical charts, diagrams, tables of data and so on.

My assumption is that, to support a given view or argument, a reference to the entirety of an article is not necessary; The full paper gives the context to the information, but it is much more likely that a small part of this paper contains the novel insight being referenced.

In the paper-based method, it is difficult to uniquely identify parts of an article as items in their own right. You could reference a page number, give line numbers, or quote a table number, but this doesn't solve this issue that the author hadn't put time to considering that a chart, a table or a section of text would be reused.

So, on the web, where multiple representations of the same information is getting to be commonplace (mashups, rss, microblogs, etc), what can we do to help better fulfill both aims, to show a paginated final version of a document, and also to allow each of the components to exist as items in their own right, with their own URIs (or better, URLs containing some notion of the context e.g. if /store/article-id gets to the splash page of the article, /store/article-id/paragraph-id will resolve to the text for that paragraph in the article.)

Note that the four rules of the web (well, of Linked Data) are in essence:
  • give everything a name,
  • make that name a URL ...
  • ...which results in data about that thing,
  • and have it link to other related things.
[From TimBL's originating article. Also, see this presentation - a remix of presentations from TimBL and the speaker, Kingsley Idehen - given at the recent Linked Data Planet conference]

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author's know that we may present alternate versions, based on a user's demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

The system holding the articles needs to be able to clearly indicate versions and show multiple versions for a single record.

When a compound document is submitted to the archive, a second parallel version should be made by fragmenting the document into paragraphs of text, individual diagrams, tables of data, and other natural elements. One issue that has already come up in testing, is that documents tend to clump multiple, separate diagrams together into a single physical image. It is likely that the only solution to breaking these up to this is going to be a human one, either author/publisher education(unlikely) or by breaking them up by hand.

I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly.

Summary:

1) Break apart any compound document into its constituent elements (paragraph level is suggested for text)
2) Make sure that each one of these parts are clearly expressed in the context they are in, using hierarchical URLs, /article/paragraph or even better, /article/page/chart
3) On the article's splashpage, make a clear distinction between the real article and the broken up version. I would suggest a scheme like Google search's 'View [PDF, PPT, etc] as HTML'. I would assert that many people intuitively understand that this view is not like the original and will look or act differently.

Some related video blogs from the Crigshow trip
Finding and reusing algorithms from published articles
OCR'ing documents; Real documents are always complex
Providing a systematic overview of how a Research paper is written - giving each component and each version of a component would have major benefits here

5 comments:

Anonymous said...

Ben,

I assuem you know you've just explained "Linked Data" in a nutshell via it's fundamental principles.

I am making this comment primary to relink this post to the broad discourse around the subject of "Linked Data on the Web".

Also see:
1. My Linked Data Presentation (a remix of presentations from TimBL and I from the recent Linked Data Planet conference)
2. http://linkeddata.org

Anonymous said...

I mean't "assume" no "assuem", in my last post :-)

Ben O'Steen said...

I believe this is known as throwing stones from within a glass house... talking about Linked Data, without actually linking to the documents about it...

Thanks for adding the links, I'll edit the post to promote it further.

pkeane said...

"I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly."

Does OAI-ORE have a role here? Seems like a natural fit. (Of course there would need to be an easy/standard way to convert the serialized ORE to html is it wasn't html already (i.e., RDFa).

Unknown said...

There is a lot of scope in this however there are some things to be wary of.

1) When sub-linking you still need to be able to resolve that the link is relevant to the first class item for statistical and download metrics purposes.
2) You also need to be careful with page numbers, it would be much better to reference section numbers and figure numbers.

It's a problem which isn't going to go away fast, but it would be nice to get a PDF reader supporting highlighting of referenced sections pdf?highlight=2.3.1:20-30 anyone?