Thursday, 17 April 2008

Ditching the DB-based blog for a semantic one

Why Yet-another-blog-engine?

Well, blog engines tend to do the same things, their functionality is derived by simple views on a relational DB. To a large extent, I think that this RDB reliance has shaped the scope of what you can do with a blog and also I really feel it has guided how the blog (and related publishing) technology has developed.

Things like blog export, and interlinking of blogs and the bloggers who power the system are seen as extras, features added as plugins or as an additional service - this needs to change! Let's see what naturally happens when we try to build a system with a more interesting backend.

So, what to replace the RDB with?

Simply put, the data which a blog needs to function normally can be modelled in an objectstore, by linking items together by RDF predicates. Specifically, there is a namespace created by the SIOC project, aimed at defining social networks and their inter-linking in a semantic way - http://www.sioc-project.org/. I have a strong hunch that by using this work, a whole load of extra possibilities will emerge.

(Aside from the obvious benefits of simple export and reuse of objects, being able to make a single comment on more than one blog post from more than one blog, and so on.)

Okay... what's the plan?

So, my coding bias for objectstore and framework language is FedoraCommons and python, so no surprise there. It also means that I'll be reusing my code, so each object will have a OAI-ORE aggregration and good search intergration (via Apache Solr).

Modelling the blog:

Luckily, the good folks at the SIOC project have done a good load of the work for me, and having read through their work, I can say that I see no problem with it for my purposes. This means that using their namespace (http://rdfs.org/sioc/ns#), I can adopt the structure of classes, helpfully illustrated here

The first class objects from or subclassed from SIOC therefore will be as follows (a first class object has a 1-to-1 parity with the underlying Fedora objects):

User [contains or links to FOAF record] - Post [text and zero or more attachments] - Forum(Blog) [Dublin Core] - Post(Comment) [text] - Site [Dublin Core]

Other first class objects:

Link [DC record] and Petition [ Later ;) ]

Link speaks for itself. If a post contains a link, the link is promoted to an object and the post will connect to the link object. If someone else uses that link, the previous object is reused.

Petition is a social experiment which I will go into later ;)

Now for the behaviour of the blog - I am envisioning an academically focussed blog, so the social network, persistance, trust and discourse are important features to consider.

Users - accounts and blogs

Frankly, I am tired of writing authentication systems. So is everyone else. Users are tired of having an account per site too. Thank god for OpenID then :)

Right, so we have a working, live system for authenticating people, but what about authorising? Authenticate to comment, that's self-explanatory. But this is where we can do some interesting things:

The blog engine is 'seeded' by a blog author or authors, most likely the same people that installed the engine. Borrowing a common idea, these seeds can invite other people to have a publishing account. This is done by an indication of trust - at a technical level <uri-inviter> <trust:trust10> <uri-invitee>, with the <uri-invitee> being the User object created to correspond to a given OpenID. The next time they log in, the ability to create a blog should be apparent.

(trust namespace: http://trust.mindswap.org/ont/trust.owl)

Why not <foaf:knows>? Because that is saved for later :) <trust:trust10> is a predicate intended as a way for someone to fully vouch for someone else - you may know people, but that doesn't mean they should automatically gain blogging rights just because you have those rights.

Any post (post, comment or link) can be tagged as interesting to a User, via the <foaf:interest> predicate - declaring "I am interested in this thing" The payoff to the user is that their page (every User is an object, remember) will display the things they have marked as interesting. If a User declares that another User's blog is 'interesting' then all the latter User's posts will be accessible as well from here.

There are two forms of free-text tags; a trusted tag and a normal tag. Trusted tags are those placed on a Post, Blog or User, by the author/owner of that object - a statement about what the author feels is the subject of the object. An author can also tag themselves, and this is to give extra indication about what they normally blog or comment on, in addition to the tags they've put on their own items.

A normal tag can be placed by any authenticated User on any Post or Blog. Expected functionality, really.

I haven't tripped across a suitable ontology for these two, and I'd really like to use an existing one so feel free to add a comment if you have an idea of one.

Now, I mentioned a curious thing called a Petition - this is a second method to gain trust in the social network. A User can make a Petition, stating a brief summary of what they'll like to write about, and then tag the Petition with its subjects.

This is where it gets semantically fun - a Petition will be visible to existing blog posters with the following filters available:
  • Show Petitions that have tags in common with mine, from Users I trust
  • Show Petitions that have tags in common with mine, from Users I know
  • Show Petitions that have tags in common with mine, from Users my friends know
  • Show Petitions from Users who I've shown interest in
  • Show Petitions from Users who my friends show interest in
  • Show Petitions that have tags in common with mine
  • Show most recent Petitions
It doesn't take much more time to see that the information already in the triplestore can be used to create some really interesting filters.

Now, a User can then decide to place a level of trust in a given Petitioner. The actual mechanics of what is required to elevate the trust placed in the Petitioner is up to the system installer, but a few interesting things can be used: A requirement for the combined <trust:trust> given to a User equivalent to a certain level, maybe a tiered system (2 trust9's are equivalent to 1 trust10, etc)

The bottom line is that this trust system means that there is no requirement for a super User as the network should be self-regulating - take the trust away, and the Petitioner loses their Blog. (It also doesn't mean that having a Super user is a bad idea!)

Posts and Posting

I hate the word blog... I've been using it as a crutch, as what I'd like this to be is a site to have a voice on. By using semantic relationships, it is quite possible to view it as you might a blog, but you might also view it like a forum, with posts and threaded comments. The underlying information and connections are the same, but the way you can view and present this become a lot more flexible.

So when I say Post, I mean it in a twitter/blogger/thread-reply kind of way ;)

The Post objects have a summary (200 char limit) and an optional body (a blog 'post') - no title!. They can be tagged of course and the post can also hold attachments for download, or embedding here or elsewhere. (All resources have dereferenceable URIs, and so can be linked to directly)

So a Post can be a 'tweet', or a 'blog' post - it's all the same thing. However, on the user's jumpoff page, the summaries are listed.

I'll have a go at putting something together, to see what works and what doesn't, so watch this space.

9 comments:

Stephen said...

Great post!

I'd love to read an update on the implementation?

Bruce said...

I've been thinking about and dabbling with something similar, though at once more general, and also (at least initially) much more minimalist: a kind of personal website to publish notes, articles, etc., built with just rdflib, rdfaclchemy, and web.py.

I don't know much about fedora commons, but it seems like a really heavy-weight solution for this sort of thing. What would be the advantage that would outweigh the problems for those who don't already have it installed?

Bruce said...

Just as a quick followup, it seems to me this should the sort of thing that one could easily deploy on things like the Google App Engine platform.

Ben O'Steen said...

@bruce The advantage for me is simply that by using Fedora-Commons, the backend is the same as for the archives here. Yep, it certainly is heavyweight, and I am working to slot it over the top of some equally heavyweight storage.

As for RDFLib+RDFalchemy with web.py, that sounds great too, but you might find yourself being a little limited by web.py when you start to experiment with what you have.

One piece of advice is to try to think about versioning and rebuilding - something that I get for free with Fedora - do you need it? How would you do it? and so on.

There is also the Talis Platform, which looks to be a very powerful and interesting service, very attractive - a tutorial on how to get things done with this is http://n2.talis.com/wiki/Kniblet_Tutorial

In a similar vein, there is http://smob.sioc-project.org/ which is something I found today.

Ben O'Steen said...

And for your second comment, I can definitely see a place for apps in the cloud, using something like Amazon SC or for more semantic apps, the Talis platform, with lightweight apps on Google Apps, or other, as yet to be released, WSGI app servers.

(And yes, I have my app engine invite ready to try out just such a thing :) )

Bruce said...

@ben, re:

"One piece of advice is to try to think about versioning and rebuilding - something that I get for free with Fedora - do you need it? How would you do it?"

I'm thinking of using git. So for the moment, all my content as flat files, stored and versioned in a distributed SCM, with the web application built out of that.

As I said, minimalist ;-)

Anonymous said...

There is a SIOC plugin available for Wordpress. Also triplify.org provides a small plugin (currently only in PHP) which reveals the semantic structures encoded in relational databases by making database content available as RDF, JSON or Linked Data.

Shure, a Blog/CMS system natively based on RDF/Triplestore would be nicer :) but I guess that a lot of people do not want to change their prefered (and long used) blog system.

Anonymous said...

[…] catch: Jesper Rønn-Jensen is disabling all spam filters for a day on December 15, to see how much time actually is wasted sorting out spam comments, and how Akismet […]

Stephen said...

You're not the only one interested in this; see this question: Is there an RDF ontology for blogs? on Stack overflow