Monday 14 July 2008

OSCELOT Open Source Day III - views

The event was held at the Nevada gaming institute, and was overall, a well-structured day. The driving ideology was that of the unconference - "... a facilitated, face-to-face, and participant-driven conference centered around a theme or purpose."

However, it seemed that the theme or purpose of the event was not about Open Source - it was as if it were a Blackboard self-help group, trying to solve the issues and failings of this proprietary software. Some of the issues were a little shocking - someone proposed that they had "a need to search the content of [Blackboard Vista] repository" - it came as some surprise to me that this wasn't already possible in such a mature product.

I was pleased that we were able to help and inform the other attendees about more open technology and standards, such as OAuth, resource-orientated architecture, creative commons licensing and more.

One session I lead on was titled - controversially - "Why [bother with] Portals?" - in which I wanted to get a discussion on what students actually use. The point I wanted to make was that URLs are the base currency of the internet - search engines produce lists of them, people bookmark them, and URLs are used when sharing information between people.

This means that there is a very large responsibility on the content providers not to change URLs, or they will devalue the very resources they are trying to get people to use. This is the reason why persistent URLs are a crucial thing to aim for.

I hope that we were able to bring extra value to the meeting, due to the fact that, unlike the vast majority of attendees, we do not have a Blackboard background.

However, I do think that the event needed to have more emphasis on real-world open source projects such as Sakai and Moodle, and examine how best to intergrate their systems with external systems.

Tuesday 8 July 2008

Open replacements for Twitter and more importantly, Tinyurl

I hope that you all already know about http://identi.ca and the software stack laconi.ca that it runs - in short, it's a Twitter-like micro-blogging service, that is geared to be open. It provides the possibility of a distributed micro-blogging set of services that can talk to each other. Pretty cool.

But the less well known release, was that of the lilurl service, a Tinyurl replacement, again, geared to be very open. For example, the database of links the service holds can be downloaded by any user! BUT it lacks an API to create these links on the fly...

I think you've already guessed the end of that statement, I've made an API for it as the base code for the service is open source. Hearty thanks to Evan Prodromou!

So, changes from the source (which is at: http://ur1.ca/ur1-source.tar.gz)

Firstly, change the .htaccess rewrite rules:

From:
  • RewriteRule (.*) index.php
To:
  • RewriteRule s/(.*) index.php


This requires a few cosmetic changes to the index.php to serve correct lil'urls:

Line 41 in index.php:
From:
  • $url = 'http://'.$_SERVER['SERVER_NAME'].'/'.$lilurl->get_id($longurl);
To:
  • $url = 'http://'.$_SERVER['SERVER_NAME'].'/s/'.$lilurl->get_id($longurl);
And then, in the root directory for the app, add in api.php, which is currently pastebinned:

http://pastebin.com/f29465399 - api.php

How it works - Creating lilurls:

POST to /api.php with parameters of longurl=desired url

This will be the response:

HTTP/1.1 201 Created
Date: Tue, 08 Jul 2008 16:51:13 GMT
Server: Apache/2.0.52 (Red Hat)
X-Powered-By: PHP/5.1.4
Content-Length: 36
Connection: close
Content-Type: text/html; charset=utf-8

http://somehost.com/s/1


The message body of the response will contain the URL. Similarly, you can lookup a given lilurl, by GET /api.php?id=lilurl_id

GET /api.php?id=1 HTTP/1.1
Host: somehost.com
content-type: text/plain
accept-encoding: compress, gzip
user-agent: Basic Agent

HTTP/1.1 200 OK
Date: Tue, 08 Jul 2008 16:55:00 GMT
Server: Apache/2.0.52 (Red Hat)
X-Powered-By: PHP/5.1.4
Content-Length: 24
Connection: close
Content-Type: text/html; charset=utf-8

http://ora.ouls.ox.ac.uk


So... er, yeah. Job done ;)

Monday 7 July 2008

Archiving Webpages with ORE

(Idea presented is from the school of "write it down, and then see how silly/workable it is")


Following on from the example by pkeane on the OAI-ORE mailing list, about constructing an Atom 'feed', listing the resources linked to by a webpage. Well, it was more a post wondering what ORE provides that we didn't have before, which for me is the idea of an abstract model with multiple possible serialisations. But anyway, I digress.

(pkeane++ for an actual code example too!)

For me, this could be the start of a very good, incremental method for archiving static/semi-static (wiki) pages.

Archiving:
  1. Create 'feed' of page (either as an Atom feed or an RDF serialisation)
    • It should be clearly asserted in the feed which one of the resources are the (X)HTML resource that is the page being archived.
  2. Walk through the resources, and work out which ones are suitable for archiving
    • Ignore adverts, video perhaps and maybe also some remote resources (but decisions here based on policy and the process is an incremental one.Step 2 can be revisited with new policy decisions, such as remote PDF harvesting and so on.)
  3. For each resource selected for archiving,
    1. Copy it by value to a new, archived location.
    2. Add this new resource to the feed.
    3. Indicate that this new resource is a direct copy of the original in the feed as well (using the new rdf-in-atom syntax, or just plain rdf in the graph.)
Presentation: (Caching-reliant)
  1. A user queries the service to give a representation for an archived page.
  2. Service recovers ORE map for requested page from internal store
  3. Resource determination Again, policy based - suggestions:
    • Last-Known-Good: Replaces all URIs in (X)HTML source
      with their archived duplicates, and sends the page to the user.
      (Assumes dupes are RESTful - archived URIs can be GET'ed)
    • Optimistic: Wraps embedded resources with javascript, to attempt to get the original resources or timeout to get the archived versions.
Presentation: (CPU-reliant)
  1. Service processes ORE map on completion of resource archiving
  2. Resource determination Again, policy based - suggestions:
    • Last-Known-Good: Replaces all URIs in (X)HTML source
      with their archived duplicates, and sends the page to the user.
      (Assumes dupes are RESTful - archived URIs can be GET'ed)
    • Optimistic: Wraps embedded resources with javascript, to attempt to get the original resources or timeout to get the archived versions.
  3. Service stores a new version of the (X)HTML with the URI changes, adds this to the feed, and indicates that this is the archived version.
  4. A user queries the service to give a representation for an archived page and gets it back simply.
So, one presentation method relies on caching, but doesn't need a lot of CPU power to get up and running. The archived pages are also quick to update, and this route may even be a nice way to 'discover' pages for archiving - i.e. the method for archive url submission is the same as the request. Archiving can continue in the background, while the users get a progressively more archived view of the resource.

The upshot of having a dynamic URI swapping on page request is that there can be multiple copies in potentially mobile locations for each resource, and the service can 'round-robin' or pick the best copies to serve as replacement URIs. This is obviously a lot more difficult to implement with static 'archived' (X)HTML, and would involve URI lookup tables embedded into the DNS or resource server.