Recently, it has let me cut through access logs munging like a hot knife through butter, all with multiprocessing goodness.
Key things:
Using sets to manage botlists:
>>> from redis import Redis
>>> r = Redis()
>>> for bot in r.smembers("botlist"):
... print bot
...
lycos.txt
non_engines.txt
inktomi.txt
misc.txt
askjeeves.txt
oucs_bots
wisenut.txt
altavista.txt
msn.txt
googlebotlist.txt
>>> total = 0
>>> for bot in r.smembers("botlist"):
... total = total + r.scard(bot)
...
>>> total
3882
So, I have 3882 different IP addresses that I have built up that I consider bots.
Keeping counts and avoiding race-conditions
By using the Redis INCR command, it's easy to write little workers that run in their own process but which atomically increment counts of hits.
What does the stat system look like?
I am treating each line of the Apache-style log as a message that I am passing through a number of workers.
Queues
All in the same AMQP exchange: ("stats")
Queue "loglines" - msg's = A single log line in the Apache format. Can be sourced from either local logs or from the live service.
loglines is listened to by a debot.py worker, just one at the moment. This worker feeds three queues:
Queue "bothits" - log lines from a request that matches a bot IP
Queue "objectviews" - log lines from a request that was a record page view or item download
Queue "other" - log lines that I am presently not so interested in.
[These three queues are consumed by 3 loggers and these maintain a copy of the logs, pre-separated. These are designed to be temporary parts of the workflow, to be discarded once we know what we want from the logs.]
objectviews is subscribed to by a count.py worker which does the heavy crunching as shown below.
Debot.py
The first worker is 'debot.py' - this does the broad separation and checking of a logged event. In essence, it uses the Redis SISMEMBER command to see if the IP address is in the blacklists and if not, applies a few regex's to see if it is a record view and/or a download or something else.
Broad Logging
There are three logger workers that debot.py feeds for "bothits", "objectviews", and "other" - these workers just sit and listen on the relevant queue for an apache log line and appends it to the logfile it has open. Saves me having to open/close logger objects or pass anything around.
The logfiles are purely as a record of the processing and so I can skip redoing it if I want to do any further analysis, like tracking individuals, etc.
The loggers also INCR a key in Redis for each line they see - u:objectviews, u:bothits, and u:other as appropriate - these give me a rough idea of how the processing is going.
(And you can generate pretty charts from it too:)
(data sourced at a point during the processing - 10million bot hits vs 360k object views/dls)
Counting hits (metadata and time based)
Most of the heavy lifting is in count.py - this is fed from the object views/downloads stream coming from the debot.py worker. It does a number of procedural steps for the metadata:
- Get metadata from ORA's Solr endpoint (as JSON)
- Specifically, get the 'authors' (names), subjects/keyphrases, institutions, content types, and collections things appear in.
- These fields correspond to certain keys in Redis. Eg names = 'number:names' = number of unique names, 'n:...' = hits to a given name, etc
- For each view/dl:
- INCR 'ids:XXXXX' where XXXXX is 'names', 'subjects', etc. It'll return the new value for this, eg 142
- SET X:142 to be equal to the text for this new entity, where X is the prefix for the field.
- SADD this id (eg X:142) to the relevant set for it, like 'names', 'subjects', etc - This is so we can have an accurate idea of the entities in use even after removing/merging them.
- Reverse lookup: Hash the text for the entity (eg md5("John F. Smith")) and SET r:X:{hash} to be equal to "X:142"
- SET X:views:142 to be equal to 1 to get the ball rolling (or X:dl:142 for downloads)
- If the name is not new:
- Hash the text and lookup r:{hash} to get the id (eg n:132)
- INCR the item's counter (eg INCR n:views:132)
- Time-based and other counts:
- INCR t:{object id} (total hits on that repository object since logs began)
- INCR t:MMYY (total 'proper' hits for that month)
- INCR t:MMYY:{object id} (total 'proper' hits for that repo item that month)
- INCR t:MMYY:{entity id} (Total hits for an entity, say 'n:132' that month)
A lot of pressure is put on Redis by count.py but it seems to be coping fine. A note for anyone else thinking about this: Redis keeps its datastore in RAM - running out of RAM is a Bad Thing(tm).
I know that I could also just use the md5 hashes as ids, rather than using a second id - I'm still developing this section and this outline just states it how it is now!
Also, it's worth noting that if I needed to, I can put remote redis 'shards' on other machines and they can just pull log lines from the main objectview queue to process. (It'll still need to create the id <-> entity name mapping on the main store though or a slave of the main store.)
But why did I do this?
I thought that it would mean I could handle both legacy logs and live data and have a framework I could put against other systems and in a way that would mean I would write less code and for the system to be more reliable.
So far, I still think this is the case. If people are interested, I'll abstract out a class or two (eg the metadata lookup function, etc) and stick it on google code. It's not really a lot of code so far, I think even this outline post is longer....