Wednesday, 9 January 2008

Conclusions on UUIDs and local ids in Fedora

I mentioned earlier about the possibility of using UUIDs as Fedora identifiers. I'll write my conclusion first and then the reasoning later for all you lazy people out there :)

Conclusions

Fedora repositories that wish to use UUIDs as identifiers should have the namespace 'uuid' added to the list of <retainPids> in fedora.fcfg.

The 32 character hex string representing the UUID is entered as the object id to the uuid namespace. For example:

Fedora pid - uuid:34b706b4-f080-4655-8695-641a0a8acb25

Benefits

  1. Using the uuid scheme for identifiers, (those who retain the 'uuid' namespace), administrators will be able to painlessly transfer objects from one instance of Fedora to another, or even to have a distributed set of Fedora instances as a single 'repository'; No fiddling with pid changes, changing RELS-EXT datastreams or others, or changing metadata datastream identifiers.
  2. Fedora pids will fit into the RFC 4122 mechanism for the uuid urn namespace easily -> urn:pid will be a valid URI.
  3. The command 'getNextPID()' could be used to provide a local 'id', which can be added as a FOXML field, such as label or even added as the Alternate ID on ingest/migration.
  4. Given a URI resolver that is updated when an object migrates from one fedora repository to another, a distributed set of Fedora instances could have cross-repository relationships in RDF that stay valid regardless of where the objects reside.
  5. Use of a distributed set of Fedora instances with a federated search tool such as the Apache Solr is quite an attractive prospect for large scale implementations.
Reasoning

I've thought about the logistics of actually using them, and also of the fact that some people are happier with object ids that they can type (although for the life of me, I can't work out why; when was the last time that you, as a normal user, typed in a full URL and didn't just go to a discovery tool like google or a site search to get to a specific item? 99% of the time, I rely on my browser's address bar defaulting to google for anything that doesn't look like a url.) But I digress...

I prefer to deal with situations as they are, as opposed to what might be possible later, so let's recap what pids Fedora allows or needs:

Fedora pid = namespace : id

or more formally - (from http://www.fedora.info/definitions/identifiers/
object-pid    = namespace-id ":" object-id
namespace-id = ( [A-Z] / [a-z] / [0-9] / "-" / "." ) 1+
object-id = ( [A-Z] / [a-z] / [0-9] / "-" / "." / "~" / "_" / escaped-octet ) 1+
escaped-octet = "%" hex-digit hex-digit
hex-digit = [0-9] / [A-F]
e.g. Anything that fits the following regular expression:
^([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|\.|~|_|(%[0-9A-F]{2}))+$

As I said before, I am interested in using UUIDs (or something like them) because they need no real scheme to be unique and persistantly unique; UUIDs are designed so the chances of creating two ids that are the same is vanishingly small. So, what does one look like?

(from wikipedia:)

In its canonical form, a UUID consists of 32 hexadecimal digits, displayed in 5 groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters. For example:
550e8400-e29b-41d4-a716-446655440000
Regular expressions:

[A-Fa-f0-9]{8}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{12}
matches: 550e8400-e29b-41d4-a716-446655440000

^((?-i:0x)?[A-Fa-f0-9]{32}
matches: 0x550e8400e29b41d4a716446655440000

So a UUID can't be used as it is in place of a Fedora pid. However, according to RFC 4122, there is a uuid urn namespace which makes me more hopeful. The above uuid can be represented as urn:uuid:550e8400-e29b-41d4-a716-446655440000 for example.

So, how about if we make the reasonable assumption that a pid is a "valid" urn namespace, but a namespace that may or may not be registered yet? For example, I am currently using the fedora namespace ora for items in the Oxford repository. Would it be to far fetched to say that ora:1234 is understandable as urn:ora:1234?

So, all we need to do is make sure that the namespace 'uuid' is one of the ones in the <retainPid> element of fedora.fcfg and we are set to go. Looks like the pid format 'restriction' as I thought it, was quite handy after all :) So to state it clearly:

Fedora pids that follow the UUID scheme should be in the form of:
object-pid    = "uuid:" object-id
object-id = 8-digit-hex '-' 4-digit-hex '-' 4-digit-hex '-' 4-digit-hex
'-' 4-digit-hex '-' 12-digit-hex
8-digit-hex = ( hex-digit ) 8
4-digit-hex = ( hex-digit ) 4
12-digit-hex = ( hex-digit ) 12
hex-digit = [0-9] / [A-F] / [a-f]
e.g. "uuid:34b706b4-f080-4655-8695-641a0a8acb25"

(NB forgive any syntactically slips above, I hope it's clear as it is.)

I mentioned before that some people want human-typeable fedora pids.... urgh. No really sure what purpose it serves. In fact, let me have a little rant...

<rant>

A 'Cool URL' is one that doesn't change. Short pids make for pretty URLs and guarantee little more than that.

</rant>

Right, that's out of my system. Now to accommodate the request...

Firstly, I'd just like to point out that I will ignore any external search and discovery services; essentially any type of resolver from search engine to the Handles system. This is because I feel that the format of the fedora pid is quite irrelevent to these services. (I am aware that certain systems made use of a nasty hack where the object id part of the fedora pid was used as the Handle id, after the institution namespace and believe me, this hack does have me quite worried. I can understand the reasoning behind this as the Handles system doesn't seem to have a simple way to query for the next available id, but I think this is a potential problem on the horizon.)
My suggestion is that the pid itself is the uuid as defined above, but that the repository system has a notion of local 'id'; the Fedora call of 'getNextPid()' could be used to create these 'tinypids' with whatever namespace is deemed pleasant.
They can be stored in the FOXML in fields such as Label or stored as an Alternate ID (fields which I personally have no use of). Fedora will index these for its basic search service, and could be used as a mechanism to look up the real pid given the local id.

For example, with the RDF triplestore turned on, the following iTQL query should be enough:
"select $object from <#ri>
where $object <info:fedora/fedora-system:def/model#label> 'ora:1234'"
The tuple that is returned will be something like "uuid:34b706b4-f080-4655-8695-641a0a8acb25"

But it still isn't great, and I don't think the benefits outweigh the work involved in implementing it but it's a workable solution I think for those that need it.

1 comment:

Anonymous said...

I am excited to see the possibility of using uuid's as identifiers in Fedora, but I see the mistake made repeatedly concerning what the appropriate scheme looks like for the UUID "URN" namespace that is registered:

Per the specification...
http://www.ietf.org/rfc/rfc4122.txt

The following is an example of the string representation of a UUID as
a URN:

urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6

UUID URI are URN's and "uuid:" is a namespace in the URN scheme not a scheme of its own like "info:"

This means:

uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6

is not a registered URI, while

urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6

is a valid registered URI.

I think this is an important misinterpretation that results in inconsistency in representing such URI. If it can be fixed, I would recommend it.

Cheers,
Mark Diggory
MIT Libraries.