Monday 18 August 2008

Trackbacks, and spammers, and DDoS, oh my!

The Idea

Before I give you all the dark news about this, let me set out my position: I really, really think that repositories communicating the papers that are cited and referenced to each other is a really good thing. If a paper was deposited in the Oxford archive, and it referenced a paper held in a different repository, say in Southampton's EPrints archive, I think that it is a really fantastic idea to let the Oxford archive tell the Southampton one about it.

And I decided to do something about it - I added two linkback facilities to the archive's user interface, allowing both trackbacks and pingbacks to be archived by the system. I adopted the pre-existing "standards" - really, they are just rough api's - because I think we have all learned our lessons about making up new APIs for basic tasks.

What is Trackback?

Trackback is an agreed technique from the blogging world. Many blogging systems have it built in, and it enables one blog post to explicitly reference and talk about another post, made on a remote blog somewhere. It does this by POSTing a number of form-encoded parameters to a specific URL, specific to the item that is being referenced. The parameters include things like title, abstract and URL of the item making the reference.

So on the surface, it appears that this trackback idea performs exactly what I was looking for.

BUT! Trackback has massive, gaping flaws, akin to the flaws in the email system which is full of spam. For one, all trackbacks are trusted in the basic specifications. No checking that the URL exists, no checking of the text for relevance, etc.

Pingback is a slightly different system, in that all that is passed, is the URL of the referencing item. It is then up to the remote server to go and get the requested page and parse that out to find the reference. (The next version of the specification is crying out to recommend microformats et al, in my opinion)

So, these systems, trackback and pingback, have been on trial in the live system for about 4 or 5 months, and I am sure you all want to hear my conclusions:

  • Don't implement Trackback as it is defined in its specifications... seriously. It is a poorly designed method, with so much slack that it is a spammers goldmine.
  • Even after adding in some safeguards to the Trackback method, such as parsing the supposed referencing page and checking for HTML and the presence of the supposed link, it was still possible for spammers to get through.
  • When I implemented Trackbacks, I did so with the full knowledge that I might have to stand at a safe distance and nuke the lot. Here is the Trackback model used in the Fedora repository - A DC datastream containing the POSTed information mapped to simple dublin core and a RELS-EXT RDF entry asserting that this Fedora Object <info:fedora/trackback-id>, referenced <dcterms:references> the main item in the archive <info:fedora/item-id>. As the user interface for the archive gets the graph for that object, it was easy to get the trackbacks out as well. Having separate objects for the trackbacks and not changing the referenced item at all, made it very easy to remove the trackbacks at the end.
  • The Trackback system did get hit, once the spammers found a way around my safeguards. So, yes, the trackbacks got 'nuked' and the system turned off.
  • Currently, the system is under a sort of mini-DDoS, from the spammer's botnet trying to make trackbacks and overloading the session tracking system.
  • The Pingback system, utilising XML-RPC calls, was never hit by spam. I still turned it off, as the safeguards on this system were equivalent to the Trackback system.
So, how do we go on from this quagmire of spam?

Well, for one, if I had time (and resources) to pass all requests through spamassassin or pay for akismet, that would have cut down the number drastically. Also, if I had time to sit and moderate all the linkbacks, again, spam would be nipped in the bud.

So, while I truly believe that this type of system is the future, it certainly isn't the case that it can be a system that can just be turned on and the responsibility for maintaining it added to an already full workload.

Alternatives?

White-listing sites may be one method. To limit the application to sharing references between institutions, you could use the PGP idea - a web of trust; a technique of encrypting the passed information with a private key that resolves to a public key from a white-listed institution. This would ensure that the passed reference really was from a given institution. This should be more flexible than requiring a single IP address to accept references from.

(There is always the chance that the private key could be leaked and made not-so-private by the institution, but that would have to be their responsibility. Any spam from a mistake of this sort would be directly attributed to those at fault!)

A slower, far less accurate but more traditional method, would be for a given institution to harvest references from all the other repositories it knows about. I really don't think this is workable, but has the pro that a harvester can be sure that a reference links to a given URL, (barring the more and more common DNS poisoning attacks)

3 comments:

Anonymous said...

It's worth asking akismet if you can have a key for free. It is free for certain kinds of entities/purposes. It was unclear to me if that applied to some local stuff at my library, so I emailed them, they said yes. It may likely also apply to a non-profit repository application.

There is also a free (even for commercial use) clone of akismet, but I forget the name. It was what I was going to use if akismet said no; not sure if it's quality is as good as akismet.

Mark R Diggory said...

I'm curious what your thoughts are on the linkbacks approacht hat blogger employs? Yes, spam is a serious issue in trackbacks and I'm concerned about its application in a repository.

Ben O'Steen said...

@jonathan rochkind - Yes, akismet might be a good solution for ignoring the spam, but it's just a matter of resources :)

@Mark Diggory - Google's Backlinks (as used on Blogger) use a quintessential google technique: scan the entire web *that google has access to* and look for links to the page in question :) I was hoping to widen this idea of active linkbacks to cover things which aren't strictly links, such as http citations inside PDF/other files.