Monday 7 July 2008

Archiving Webpages with ORE

(Idea presented is from the school of "write it down, and then see how silly/workable it is")


Following on from the example by pkeane on the OAI-ORE mailing list, about constructing an Atom 'feed', listing the resources linked to by a webpage. Well, it was more a post wondering what ORE provides that we didn't have before, which for me is the idea of an abstract model with multiple possible serialisations. But anyway, I digress.

(pkeane++ for an actual code example too!)

For me, this could be the start of a very good, incremental method for archiving static/semi-static (wiki) pages.

Archiving:
  1. Create 'feed' of page (either as an Atom feed or an RDF serialisation)
    • It should be clearly asserted in the feed which one of the resources are the (X)HTML resource that is the page being archived.
  2. Walk through the resources, and work out which ones are suitable for archiving
    • Ignore adverts, video perhaps and maybe also some remote resources (but decisions here based on policy and the process is an incremental one.Step 2 can be revisited with new policy decisions, such as remote PDF harvesting and so on.)
  3. For each resource selected for archiving,
    1. Copy it by value to a new, archived location.
    2. Add this new resource to the feed.
    3. Indicate that this new resource is a direct copy of the original in the feed as well (using the new rdf-in-atom syntax, or just plain rdf in the graph.)
Presentation: (Caching-reliant)
  1. A user queries the service to give a representation for an archived page.
  2. Service recovers ORE map for requested page from internal store
  3. Resource determination Again, policy based - suggestions:
    • Last-Known-Good: Replaces all URIs in (X)HTML source
      with their archived duplicates, and sends the page to the user.
      (Assumes dupes are RESTful - archived URIs can be GET'ed)
    • Optimistic: Wraps embedded resources with javascript, to attempt to get the original resources or timeout to get the archived versions.
Presentation: (CPU-reliant)
  1. Service processes ORE map on completion of resource archiving
  2. Resource determination Again, policy based - suggestions:
    • Last-Known-Good: Replaces all URIs in (X)HTML source
      with their archived duplicates, and sends the page to the user.
      (Assumes dupes are RESTful - archived URIs can be GET'ed)
    • Optimistic: Wraps embedded resources with javascript, to attempt to get the original resources or timeout to get the archived versions.
  3. Service stores a new version of the (X)HTML with the URI changes, adds this to the feed, and indicates that this is the archived version.
  4. A user queries the service to give a representation for an archived page and gets it back simply.
So, one presentation method relies on caching, but doesn't need a lot of CPU power to get up and running. The archived pages are also quick to update, and this route may even be a nice way to 'discover' pages for archiving - i.e. the method for archive url submission is the same as the request. Archiving can continue in the background, while the users get a progressively more archived view of the resource.

The upshot of having a dynamic URI swapping on page request is that there can be multiple copies in potentially mobile locations for each resource, and the service can 'round-robin' or pick the best copies to serve as replacement URIs. This is obviously a lot more difficult to implement with static 'archived' (X)HTML, and would involve URI lookup tables embedded into the DNS or resource server.

1 comment:

Anonymous said...

Glad to see someone noticed that message! I fear I have been too ORE-cranky on the mailing list, when in fact I am pretty excited about it. It's the data model, as you say, that's quite valuable.