Friday 26 March 2010

Usage Statistics parsing and querying with redis and python

This is an update of my previous dabblings with chomping through log files. To summarise where I am now:

I have a distributable workflow, loosely coordinated using Redis and Supervisord - redis is used in two fashions: firstly using its lists as queues, buffering the communication between the workers, and secondly as a store, counting and associating the usage with the items and the metadata entities (people, subjects, etc) of those items.

I have written a very small python logger, that pushes loglines directly onto a redis list, providing me with live updating abilities, as well as manual log file parsing. This is currently switched on for testing in the live repository.

Current code base is here: http://github.com/benosteen/UsageLogAnalysis - it has a good number of things hardcoded to the perculiarities of my log files and repository. However, as part of the PIRUS 2 project, I am turning this into an easily reusable codebase, adding in the ability to push out OpenURLs to PIRUS statistics gatherers.

Overview:

Loglines -- lpush'd to 'q:loglines'

workers - 'debot.py' - pulls lines from this queue and parses them up, separating them into 4 categories:

  1. Any hit by a recognised Bot or spider

  2. Any view or download made by a real person on an item in the repository

  3. Any 404, etc

  4. And anything else


and the lines are moved onto 4 (5) queues respectively, q:bothits, q:objectviews (and q:count simultaneously), q:fof, and q:other. I am using prefixes as a convention when working with Redis keys - "q:" will almost always be a queue of some sort. These four queues are consumed by loggers, who commit the logs to disc, segregated into their categories.

The q:count queue is consumed by a further worker called - count.py. This does a number of jobs, and is the part that actually does the analysis.

For each repository item logged event, it finds the ID of the item and also whether this was a download of an item's files. With my repository, both these facts are deducible from the URL itself.

Given the ID, it checks redis to see if this item has had its metadata analysed before. If it hasn't, it grabs the metadata for the item from the repositories index (hosted by an instance of Apache Solr) and starts to add connections between metadata entity and ID to the redis index:

eg say item "pid:1" has the simple metadata of author_name='Ben' and subjects='foo, bar'

create unique IDs from the text by hashing the text and prefix it with the type of the field they came from:

Prefixes:

  • name => "n:"

  • institution => "i:"

  • faculty => "f:"

  • subjects => "s:"

  • keyphrases => "k:"

  • content type => "type:"

  • collection => "col:"

  • thesis type => "tt:"


eg

>>> from hashlib import md5

>>> md5("Ben").hexdigest()

'092f2ba9f39fbc2876e64d12cd662f72'

So, the hashkey of the 'name' 'Ben' is 'n:092f2ba9f39fbc2876e64d12cd662f72'

Now to make the connections in Redis:

  • Add ID to the set 'objectitems' - to keep track of all the IDs (SADD objectitems {ID})

  • Set 'n:092f2....' to 'Ben' (so we can keep a reverse mapping)

  • Add 'n:092f2...' to 'names' set (to make it clearer. KEYS n:* should return an equivalent set)

  • Add 'n:092f2...' to 'e:{id}' eg "e:pid:1" - (e -> prefix for collections of entities. e:{id} is a set of all entities that occur in id)

  • Add 'e:pid:1' to 'e:n:092f2....' (gathers a list of item ids in which this entity 'Ben' occurs in)


Repeat for any entity you wish to track.

To make this more truth-manageable, you should include the id of record with the text when you generate the hashkey. That way, 'Ben' appearing in one record will have a different key than 'Ben' occuring in another. The assertion that these two entities are the same can easily take place in a different set, (I'm using b: as the prefix for these bundles of asserted equivalence)

Once you have made these assertions, you can set about counting :)

Conventions for tracking hits:

d[v|d|o]:{id} - set of the dates on which {id} was viewed (v), downloaded from (d) or any other page action (o)

eg dv:pid:1 -> set of dates on which pid:1 had page views.


YYYY-MM-DD:{id}:[v|d|o] - set of IP clients that accessed a particular item on a given day - v,d,o as above

eg 2010-02-03:pid:1:d - set of IP clients that downloaded a file from pid:1 on 2010-02-03


t:views:{hashkey}, t:dls:{hashkey}, t:other:{hashkey}

Grand totals of views, downloads or other accesses on a given entity or id. Good for quick lookups.


Let's walk through an example: consider that a client of IP 1.2.3.4 visits the record page for this 'pid:1' on 2010-01-01:

ID = pid:1

Add the User Agent string ("mozilla... etc") to the 'ua:{IP}' set, to keep track of the fingerprints of the visitors.

Try to add the IP address to the set - in this case "2010-01-01:pid:1:v"

If the IP isn't already in this set (the client hasn't accessed this page already today) then:

  • make sure that "2010-01-01" is a part of the 'dv:pid:1' set

  • go through all the entities that are part of pid:1 (n:092... etc) and increment their totals by one.

    • INCR t:views:n:092...

    • INCR t:views:pid:1




Now, what about querying?

Say we wish to look up the activity on a given entity, say for 'Ben'?

First, find the hashkey(s) that exist that are equivalent - either directly using the simple md5sum hash, or by checking which bundles are for this entity.

You can get the grand totals by simply querying "t:views:key", "t:dls..." for each key and summing them together.

You can get more refined answers by getting the set of IDs that this entity is associated with, and querying that to gather all the daily IP sets for them, and summing the answer. This gives me a nice way to generate data suitable for a daily activity sparkline, like:



I have added another set of keys to the store, of the form 'geocode:{IP}' that record country code to IP address, which gives me a nice way to plot out graphs like the following also using the google chart API:



Python logging to Redis

This functionality is mainly in one file in the github repo: redislogger.py

As you can see, most of that file is taken up with a demonstration of how to invoke it! The file that holds the logging configuration which this demo uses is in logging.conf.example.

NB The usage analysis code and UI is very much a WIP

but, I just wanted to post quickly on the rough overview on how it is set up and working.


Thursday 25 March 2010

Curating content from one repository to put into another

First you need a little code that I've written:

sudo easy_install recordsilo oaipmhscraper

(This should install all the dependencies for the following)

To harvest some OAI-PMH records from say... http://eprints.soton.ac.uk/perl/oai2 :

First, take a look at the Identify page for the OAI-PMH endpoint: http://eprints.soton.ac.uk/perl/oai2?verb=Identify

The example identifier indicates that the record identifiers start with: "oai:eprints.soton.ac.uk:" - we'll need this in a bit. Maybe not need, but it'll make the local storage more... elegant?

Go to a nice clean directory, with enough storage to handle whatever you want to harvest.

Start a python commandline:

>>> from oaipmhscraper import OAIPMHScraper

---> NB OAIPMHScraper(storage_dir, base_oai_url, identifier_uri_prefix)

>>> oaipmh = OAIPMHScraper("myrepo",
"http://eprints.soton.ac.uk/perl/oai2",
"oai:eprints.soton.ac.uk:")

Let's have a look at what could be found out about the OAI-PMH endpoint then:

>>> oaipmh.state

{'lastidentified': '2010-03-25T15:57:15.670552', 'identify': {'deletedRecord': 'persistent', 'compression': [], 'granularity': 'YYYY-MM-DD', 'baseURL': 'http://eprints.soton.ac.uk/perl/oai2', 'adminEmails': ['mailto:eprints@soton.ac.uk'], 'descriptions': ['........'], 'protocolVersion': '2.0', 'repositoryName': 'e-Prints Soton', 'earliestDatestamp': '0001-01-01 00:00:00'}}

>>> oaipmh.getMetadataPrefixes()

{'oai_dc': ('http://www.openarchives.org/OAI/2.0/oai_dc.xsd', 'http://www.openarchives.org/OAI/2.0/oai_dc/'), 'uketd_dc': ('http://naca.central.cranfield.ac.uk/ethos-oai/2.0/uketd_dc.xsd', 'http://naca.central.cranfield.ac.uk/ethos-oai/2.0/')}

Let's grab all the oai_dc from all the objects:

>>> oaipmh.getRecords('oai_dc')
...

Go make a cup of coffee or tea.... you'll get lots of stuff like:

INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1267 found with datestamp 2004-04-27T00:00:00 - storing.
2010-03-25 16:01:11,807 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1268 found with datestamp 2005-04-22T00:00:00 - storing.
INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1268 found with datestamp 2005-04-22T00:00:00 - storing.
2010-03-25 16:01:11,813 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1269 found with datestamp 2004-04-07T00:00:00 - storing.
INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1269 found with datestamp 2004-04-07T00:00:00 - storing.
2010-03-25 16:01:11,819 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1270 found with datestamp 2004-04-07T00:00:00 - storing.
INFO:OAIPMH Harvester:New object: oai:eprints.soton.ac.uk:1270 found with datestamp 2004-04-07T00:00:00 - storing.
2010-03-25 16:01:11,824 - OAIPMH Harvester - INFO - New object: oai:eprints.soton.ac.uk:1271 found with datestamp 2004-04-14T00:00:00 - storing.

...

My advice is to hop to a different terminal window and start to poke around with the content you are getting. The underlying store is a take on the CDL's Pairtree microspec (pairtree being a minimalist specification for how to structure the access to object-orientated items on a hierarchical filesystem) This model on top of pairtree I've called a Silo (in the RecordSilo library I've written) and constitutes a basic object model, where each object has a persistent JSON state (r/w-able) and can store any file or file in a subdirectory. It has crude object-level versioning, rather than file-versioning, so you can clone one version, delete/alter/add to it to create a second, curated version for reuse elsewhere without affecting the original.

What makes pairtree attractive is that the files themselves are not altered in form, so normal posix tools can be used on the files without unwrapping, depacking, etc.

Let's have a look around at what's been harvested so far into the "myrepo" silo:

>>> from recordsilo import Silo
>>> s = Silo("myrepo")
>>> s.state
{'storage_dir': 'myrepo', 'identifier_uri_prefix': 'oai:eprints.soton.ac.uk:', 'uri_base': 'oai:eprints.soton.ac.uk:', 'base_oai_url': 'http://eprints.soton.ac.uk/perl/oai2'}'}

>>> len(s) # NB this can be a time-consuming operation
1100
>>> len(s)
1200

Now let's look at a record: I'm sure I saw '6102' whizz past as it was harvesting...

>>> obj = s.get_item("oai:eprints.soton.ac.uk:6102")

>>> obj
{'files': {'1': ['oai_dc']}, 'subdir': {'1': []}, 'versions': ['1'], 'date': '2004-06-24T00:00:00', 'currentversion': '1', 'metadata_files': {'1': ['oai_dc']}, 'item_id': 'oai:eprints.soton.ac.uk:6102', 'version_dates': {'1': '2004-06-24T00:00:00'}, 'metadata': {'identifier': 'oai:eprints.soton.ac.uk:6102', 'firstSeen': '2004-06-24T00:00:00', 'setSpec': ['7374617475733D707562', '7375626A656374733D51:5148:5148333031', '7375626A656374733D47:4743', '74797065733D61727469636C65', '67726F75703D756F732D686B']}}

>>> obj.files
['oai_dc']
>>> obj.versions
['1']
>>> obj.clone_version("1","workingcopy")
'workingcopy'
>>> obj.versions
['1', 'workingcopy']
>>> obj.currentversion
'workingcopy'
>>> obj.set_version_cursor("1")
True
>>> obj.set_version_cursor("workingcopy")
True
>>> obj.files
['oai_dc']
>>> with obj.get_stream("oai_dc") as oai_dc_xml:
... print oai_dc_xml.read()
...
<metadata xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Population biology of Hirondellea sp. nov. (Amphipoda: Gammaridea: Lysianassoidea) from the Atacama Trench (south-east Pacific Ocean)</dc:title>
<dc:creator>Perrone, F.M.</dc:creator>
<dc:creator>Dell'Anno, A.</dc:creator>
<dc:creator>Danovaro, R.</dc:creator>
<dc:creator>Groce, N.D.</dc:creator>
<dc:creator>Thurston, M.H.</dc:creator>
<dc:subject>QH301 Biology</dc:subject>
<dc:subject>GC Oceanography</dc:subject>
<dc:description/>
<dc:publisher/>
<dc:date>2002</dc:date>
<dc:type>Article</dc:type>
<dc:type>PeerReviewed</dc:type>
<dc:identifier>http://eprints.soton.ac.uk/6102/</dc:identifier></oai_dc:dc></metadata>

You can add bytestreams as strings:

>>> obj.put_stream("foo.txt", "Some random text!")

or as file-like objects:

>>> with open("README", "r") as readmefile:
... obj.put_stream("README", readmefile)
...
>>> obj.files
['oai_dc', 'foo.txt', 'README']
>>> obj.set_version_cursor("1")
True
>>> obj.files
['oai_dc']

This isn't the easiest way to browse or poke around the files. It would be nice to see these through a web UI of some kind:

Grab the basic UI code from http://github.com/benosteen/siloserver

(You'll need to install web.py and Mako: sudo easy_install mako web.py)

Then edit the silodirectory_conf.py file to point to the location of the Silo - if the directory structure looks like the following:

myrepo
|
--- Silo directory stuff...
SiloServer
|
- dropbox.py
etc

You need to change data_dir to "../myrepo" and then you can start the server by running 'python dropbox.py'

Point a browser at http://localhost:8080/ and wait a while - that start page loads *every* object in the Silo.

And let's revisit our altered record, at http://localhost:8080/oai:eprints.soton.ac.uk:6102
So, from this point, I can curate the records as I wish, add files to each item - perhaps licences, PREMIS files, etc - and then push them onto another repository, such as Fedora.