Less Talk, More Code: February 2009

Wednesday, 25 February 2009

Developer Happiness days - why happyness is important

Creativity and innovation

One of the defining qualities of a good innovative developer is creativity and a pragmatic attitude; someone with the 'rough consensus, running code' mentality that pervades good software innovation. This can be seen as the drive to experiment, to turn inspiration and ideas into real, running code or to pathfind by trying out different things. Innovation can often happen when talking about quite separate, seemingly unrelated things, even to the point that most of the time, the 'outcomes' of an interaction are impossible to pin down.

Play, vagueness and communication

Creativity, inspiration, innovation, ideas, fun, and curiousity are all useful and important when developing software. These words convey concepts that do not thrive in situations that are purely scheduled, didactic, and teacher-pupil focussed. There needs to be an amount of 'play' in the system (see 'Play'.) While this 'play' is bad in a tightly regimented system, it is an essential part in a creative system, to allow for new things to develop, new ideas to happen and for 'random' interactions to take place.

Alongside this notion of play in an event, there also needs to be an amount of blank space, a vagueness to the event. I think that we can agree that much of the usefulness of normal conferences comes from the 'coffee breaks' and 'lunch breaks', which are blank spaces of a sort. It is the recognition of this that is important and to factor it in more.

Note that if a single developer could guess at how things should best be developed in the academic space, they would have done so by now. Pre-compartmentalisation of ideas into 'tracks' can kill potential innovation stone-dead. The distinction between CMSs, repositories and VLE developers is purely semantic and it is detrimental for people involved in one space to not overhear the developments, needs, ideas and issues in another. It is especially counter-productive to further segregate by community, such as having simultaneous Fedora, DSpace and EPrints strands at an event.

While the inherent and intended vagueness provides the potential for cross-fertilisation of ideas, and the room for play provides the space, the final ingredient is that of speech, or any communication that takes place with the same ease and at the same speed of speech. While some may find the 140 character limit on twitter or identi.ca a strange constraint, this provides a target for people to really think about what they wish to convey and keeps the dialogue from becoming a series of monologues - much like the majority of emails of mailing lists - and keeps it as a dialogue between people.

Communication and Developers

One of the dichotomies in the necessity of communication to development is that developers can be shy, initially preferring the false anonymity of textual communication to spoken words between real people. There is a need to provide means for people to break the ice, and to strike up conversations with people that they can recognise as being of like minds. Asking that people's public online avatars are changed to be pictures of them can help people at an event find those that they have been talking to online and to start talking, face to face.

On a personal note, one of the most difficult things I have to do when meeting people out in real life is answer the question 'What do you do?' - it is much easier when I already know that the person asking the question has a technical background.

And again, going back to the concept of compartmentalisation - developers who only deal with developers and their managers/peers will build systems that work best for their peers and their managers. If these people are not the only users then they need to widen their communications. It is important for the developers that do not use their own systems to engage with the people who actually do. They should do this directly, without the potential for garbled dialogue via layers of protocol. This part needs managing in whatever space, both to avoid dominance by loud, disgruntled users and to mitigate anti-social behaviour. By and large, I am optimistic of this process, people tend to want to be thanked, and this simple feedback loop can be used to help motivate. Making this feedback more disproportionate (a small 'thank you' can lead to great effects) and adding in the notion of highscore can lead to all sorts of interaction and outcomes, most notably being the rapid reinforcement of any behaviour that led to a positive outcome.

Disproportionate feedback loops and Highscores drive human behaviour

I'll just digress quickly to cover what I mean be a disproportionate feedback loop: A disproportionate feedback loop is something that encourages a certain behaviour; the input to which is something small and inexpensive, in either time or effort but the output can be large and very rewarding. This pattern can be seen in very many interactions: playing the lottery, [good] video game controls, twitter and facebook, musical instruments, the 'who wants to be a millionaire' format, mashups, posting to a blog ('free' comments, auto rss updating, a google-able webpage for each post) etc.

The natural drive for highscores is also worth pointing out. At first glance, is it as simple as considering its use in videogames? How about the concept of getting your '5 fruit and veg a day'? http://www.5aday.nhs.uk/topTips/default.html Running in a marathon against other people? Inbox Zero (http://www.slideshare.net/merlinmann/inbox-zero-actionbased-email), Learning to play different musical scores? Your work being rated highly online? An innovation of yours being commented on by 5 different people in quick succession? Highscores can be very good drivers for human behaviour, addictive to some personalities.

Why not set up some software highscores? For example, in the world of repositories, how about 'Fastest UI for self-submission' - encouraging automatic metadata/datamining, a monthly prize for 'Most issue tickets handled' - to the satisfaction of those posting the tickets, and so on.

It is very easy to over-metricise this - some will purposefully abstain from this and some metrics are truely misleading. In the 90s, there was a push to have lines of code added as a metric to productivity. The false assumption is that lines of code have anything to do with producitivity - code should be lean, but not too lean to maintain.

So be very careful when adding means to record highscores - they should be flexible, and be fun - if they are no fun for the developers and/or the users, they become a pointless metric, more of an obstacle than a motivation.

The Dev8D event

People were free to roam and interact at the Dev8D event and there was no enforced schedule, but twitter and a loudhailer were used to make people aware of things that were going on. Talks and discussions were lined up prior to the event of course, but the event was organised on a wiki which all were free to edit. As experience has told us, the important and sometimes inspired ideas occur in relaxed and informal surroundings where people just talk and share information, such as in a typical social situation like having food and drink.

As a specific example, look at the role of twitter at the event. Sam Easterby-Smith (http://twitter.com/samscam) created a means to track 'developer happiness' and shared the tracking 'Happyness-o-meter' site with us all. This unplanned development inspired me to relay the infomation back to twitter and similarly led to me running an operating system/hardware survey in a very similar fashion.

To help break the ice and to encourage play, we instituted a number of ideas:

A wordcloud on each attendees badge, consisting of whatever we could find of their work online, be it their blog or similar so that it might provide a talking point, or allow people to spot people who write about things they might be interested in learning more about.

The poker chip game - each attendee was given 5 poker chips at the start of the event, and it was encouraged that chips were to be traded for help, advice or as a way to convey a thank you. The goal was that the top 4 people ranked by amounts of chips at the end of the third day would receive a Dell mini 9 computer. The balance to this was that each chip was also worth a drink at the bar on that day too.

We were well aware that we'd left a lot of play in this particular system, allowing for lotteries to be set up, people pooling their chips, and so on. As the sole purpose of this was to encourage people to interact, to talk and bargain with each other, and to provide that feedback loop I mentioned earlier, it wasn't too important how people got the chips as long as it wasn't underhanded. It was the interaction and the 'fun' that we were after. Just as an aside, Dave Flanders deserves the credit for this particular scheme.

Developer Decathlon

The basic concept of the Developer Decathlon was also reusing these ideas of play and feedback: "The Developer Decathlon is a competition at dev8D that enables developers to come together face-to-face to do rapid prototyping of software ideas. [..] We help facilitate this at dev8D by providing both 'real users' and 'expert advice' on how to run these rapid prototyping sprints. [..] The 'Decathlon' part of the competition represents the '10 users' who will be available on the day to present the biggest issues they have with the apps they use and in turn to help answer developer questions as the prototypes applications are being created. The developers will have two days to work with the users in creating their prototype applications."

The best two submissions will get cash prizes that go to the individual, not to the company or institution that they are affiliated with. The outcomes of which will be made public shortly, once the judging panel have done their work.

Summary

To foster innovation and to allow for creativity in software development:

Having play space is important
Being vague with aims and flexible with outcomes is not a bad thing and is vital for unexpected things to develop - e.g. A project's outcomes should be under continual re-negotiation as a general rule, not as the exception.
Encouraging and enabling free and easy communication is crucial.
Be aware of what drives people to do what they do. Push all feedback to be as disproportionate as possible, allowing both developers and users to benefit, with only putting a relatively trivial amount of input in (this pattern affects web UIs, development cycles, team interaction, etc)
Choose useful highscores and be prepared to ditch them or change them if they are no longer fun and motivational.

Sunday, 22 February 2009

Handling Tabular data

"Storage"

I put the s-word in quotes because the storing of the item is actually a very straightforward process - we have been dealing with storing tabular data for computation for a very long time now. Unfortunately, this also means that there are very many ways to capture, edit and present tables of information.

One realisation to make with regards to preserving access to data coming from research is that there is a huge backlog of data in formats that we shall kindly call 'legacy'. Not only is there this issue, but data is being made with tools and systems that effectively 'trap' or lock-in a lot of this information - case in point being any research being recorded using Microsoft Access. While the tables of data can often be extracted with some effort, it is normally difficult to impossible to extract the implicit information; how tables interlink, how the Access Form adds information to the dataset, etc.

It is this implicit knowledge that is the elephant in the room. Very many serialisations, such as SQL 'dumps', csv, xsl and so on, rely on implicit knowledge that is either related to the particulars of the application used to open it, or is actually highly domain specific.

So, it is trivial and easy to specify a model for storing data, but without also encoding the implied information and without making allowances for the myriad of sources, the model is useless; it would be akin to defining the colour of storage boxes holding bric-a-brac. The datasets need to be characterised, and the implied information recorded in as good a way as possible.

Characterisation

The first step is to characterise the dataset that has been marked for archival and reuse. (Strictly, the best first step is to consult with the researcher or research team and help and guide them so that as much of the unsaid knowledge is known by all parties.)

Some serialisations so a good job of this themselves, *SQL-based serialisations include basic data type information inside the table declarations themselves. As a pragmatic measure, it seems sensible to accept SQL-style table descriptions as a reasonable beginning. Later, we'll consider the implicit information that also needs to be recorded alongside such a declaration.

Some others, such as CSV, leave it up to the parsing agent to guess at the type of information included. In these cases, it is important to find out or even deduce the type of data held in each column. Again, this data can be serialised in a SQL table declaration held alongside the original unmodified dataset.

(It is assumed that a basic data review will be carried out; does the csv have a consistent number of columns per row, is the version and operating system known for the MySQL that held the data, is there a PI or responsible party for the data, etc.

Implicit information

Good teachers are right to point out this simple truth: "don't forget to write down the obvious!" It may seem obvious that all your data is latin-1 encoded, or that you are using a FAT32 filesystem, or even that you are running in a 32-bit environment, the painful truth is that we can't guarantee that these aspects won't affect how the data is held, accessed or stored. There may be systematic issues that we are not aware of, such as the problems with early versions of ZFS causing [, at the time, detected] data corruption, or MySQL truncating fields when serialised in a way that is not anticipated or discovered until later.

In characterising the legacy sets of data, it is important to realise that there will be loss, especially with the formats and applications that blend presentation with storage. For example, it will require a major effort to attempt to recover the forms and logic bound into the various versions of MS Access. I am even aware of a major dataset, a highly researched dictionary of old english words and phrases, that the final output of which is a Macromedia Authorware application, and the source files are held by an unknown party (that is if they still exist at all) - the Joy of hiring Contractors. In fact, this warrants a slight digression:

The gap in IT support for research

If an academic researcher wishes to gain an external email account at their institution, there is an established protocol for this. Email is so commonplace, it sounds an easy thing to provide, but you need expertise, server hardware, multiuser configuration, adoption of certain access standards (IMAP, POP3, etc), and generally there are very few types of email (text or text with MIME attachments - NB the IM in MIME stands for Internet Mail)

If a researcher has a need to store tables of data, where do they turn? They should turn to the same department, who will handle the heavy lifting of guiding standards, recording the implicit information and providing standard access APIs to the data. What the IT departments seem to be doing currently is - to carry on the metaphor - handing the researcher the email server software and telling them to get on with it, to configure it as they want. No wonder the resulting legacy systems are as free-form as they are.

Practical measures - Curation

Back to specifics now, consider that a set of data has been found to be important, research has been based on it, and it's been recognised that this dataset needs to be looked after. [This will illustrate the technical measures. Licencing, dialogue with the data owners, and other non-technical analysis and administration is left out, but assumed.]

First task is to store the incoming data, byte-for-byte, as much as is possible - storing the iso image of the media the data is stored on, storing the SQL dump of a database, etc.

Analyse the tables of data - record the base types of each column (text, binary, float, decimal, etc) apeing the syntax of a SQL table declaration, as well as trying to identify the key columns.

Record the inter-table joins between primary and secondary keys, possibly by using a "table.column SAMEAS table.column;" declaration after the table declarations.

Likewise, attempt to add information concerning each column, information such as units or any other identifying material.

Store this table description alongside the recorded tabular data source.

Form a representation of this data in a well-known, current format such as a MySQL dump. For spreadsheets that are 'frozen', cells that are the results of embedded formula should be calculated and added as fixed values. It is important to record the environment, library and platform that these calculations are made with.

Table description as RDF (strictly, referencing cols/rows via the URI)

One syntax I am playing around with is the notion that by appending sensible suffixes to the base URI for a dataset, we can unique specify a row, a column, a region or even a single cell. Simply put:

http://datasethost/datasets/{data-id}#table/{table-name}/column/{column-id} to reference a whole column
http://datasethost/datasets/{data-id}#table/{table-name}/row/{column-id} to reference a whole row, etc

[The use of the # in the position it is in will no doubt cause debate. Suffice it to say, this is a pragmatic measure, as I suspect that an intermediary layer will have to take care of dereferencing a GET on these forms in any case.]

The purpose for this is so that the tabular description can be made using common and established namespaces to describe and characterise the tables of data. Following on from a previous post on extending the BagIt protocol with an RDF manifest, this information can be included in said manifest, alongside the more expected metadata without disrupting or altering how this is handled.

A possible content type for tabular data

By considering the base Fedora repository object model, or the BagIt model, we can apply the above to form a content model for a dataset:

As a Fedora Object:

Original data in whatever forms or formats it arrives in (dsid prefix convention: DATA*)
Binary/textual serialisation in a well-understood format (dsid prefix convention: DERIV*)
'Manifest' of the contents (dsid convention: RELS-INT)
Connections between this dataset and other objects, like articles, etc as well as the RDF description of this item (RELS-EXT)
Basic description of dataset for interoperability (Simple dublin core - DC)

As a BagIt+RDF:

Zip archive -

/MANIFEST (list of files and checksums)
/RDFMANIFEST (RELS-INT and RELS-EXT from above)
/data/* (original dataset files/disk images/etc)
/derived/* (normalised/re-rendered datasets in a well known format)

Presentation - the important part

What is described above is the archival of the data. This is a form suited for discovery, but is not in a form suited for reuse. So, what is the possibility?

BigTable (Google) or HBase (Hadoop) provides a platform where tabular data can be put in a scalable manner. In fact, I would go on to suggest that HBase should be a basic service offered by the IT department of any institution. By providing this database as a service, it should be easier to normalise, and to educate the academic users in a manner that is useful to them, not just to the archivist. Google spreadsheet is an extremely good example of how such a large, scalable database might be presented to the end-user.

For archival sets with a good (RDF) description of the table, it should be possible to instantiate working versions of the tabular data on a scalable database platform like HBase on demand. Having a policy to put to 'sleep' unused datasets can provide a useful comprimise, avoiding having all the tables live but still providing a useful service.

It should also be noted that the adoption of popular methods of data access should be part of the responsibility of the data providers - this will change as time goes on, and protocols and methods for access alter with fashion. Currently, Atom/RSS feeds of any part of a table of data (the google spreadsheet model) fits very well with the landscape of applications that can reuse this information.

Summary

Try to record as much information as can be found or derived - from host operating system to column types.
Keep the original dataset byte-for-byte as you recieved it.
Try to maintain a version of the data in a well-understood format
Describe the tables of information in a reusable way, preferably by adopting a machine-readable mechanism
Be prepared to create services that the users want and need, not services that you think they should have.

Friday, 20 February 2009

Pushing the BagIt manifest concept a little further

I really like the idea of BagIt - just enough framework to transfer files in a way that errors can be detected.

I really like the idea of RDF - just enough framework to detail, characterise and interlink resources in an extremely flexible and extendable fashion.

I really like the 4 rules of Linked Data - just enough rules to act as guides; follow the rules and your information will be much more useful to you and the wider world.

What I do not wish to go near is any format that requires a non-machine-readable profile to understand or a human to reverse-engineer - METS, being a good example of a framework giving you enough rope to hang yourself on.

So, what's my use-case? First, I'll outline what I digital objects I have, and why I handle and express them in the way I have.

I deal with lists of resources on a day-to-day basis, and what these resources are and the way these resources link together is very important. The metadata associated with the list is also important, as this conveys the perspective of the agent that constructed this list; the "who,what,where,when and why" of the list.

OAI-ORE is - at a basic level - a specification and a vocabulary, which can be used to depict a list of resources. This is a good thing. But here's the rub for me - I don't agree with how ORE semantically places this list. For me, the list is a subjective thing, a facet or perception of linkage between the resources listed. The list *always* implies a context through which the resources are to be viewed. This view leads me to the conclusion that any triples that are *asserted* by the list, such as triples containing an ordering predicate, such as 'hasNext' or 'hasLast', these triples must not be in the same graph as the factual triples which would enter the 'global' graph, such as list A is called (dc:title) "My photos" and contains resources a,b,c,d and e and was authored by Agent X.

This is easier to illustrate with an example with everyone's friends, Alice and Bob:

Now, while Alice and Bob may be 'aggregating' some of the same images, this doesn't mean we can infer much at all. Alice might be researching the development of a fruit fly's wings based on genetic degredation, and Bob might be researching the fruit fly's eye structure, looking for clear photos of the front of the fly. It could be even more unrelated in that Bob is actually looking for features on the electron microscope photos caused by dust or pollen.

So, to cope with contextual assertions (A asserts that <B> <verb C> <D>) there are a couple of well-discussed tactics: Reification, 'Promotion' (not sure of the correct term here) and Named Graphs.

Reification is a no-no. Very bad. Google will tell you all the reasons why this is bad.

'Promotion' (what the real term for this is, I hope someone will post in the comments.) 'Promotion' is just where a Classed node is introduced to allow contextual information to be added, very useful for adding information about a predicate. For example, consider <Person A> <researches> <ProjectX>. This, I'd argue is a bad triple for any system that will last more than <ProjectX>'s lifespan. We need to classify this triple with temporal information, and perhaps even location information too. So, one solution is to 'promote' the <researches> predicate to be of the following form: <Person A> <has_role> <_: type=Researcher>; <_:> <dtstart> <etc>, <_:> <researches> <ProjectX> ...

From the ORE camp, this promotion comes in the form of a Proxy for each aggregated resource that needs context. So in this way, they have 'promoted' the resource, as a kind of promotion to the act of aggregation. Tomayto, Tomarto. The way this works for ORE doesn't sit well for me though, and the convention for the URI schema here feels very awkward and heavy.

The third way (and my strong preference) is the Named Graph approach. Put all the triples that are asserted by, say Alice, into a resource at URI <Alices-NG> and say something like <Aggregation> <isProvidedContextBy> <Alices-NG>

For ease of reuse though, I'd suggest that the facts are left in the global graph, in the aggregation serialisation itself. I am sure that the semantic arguments over what should go where could rage on for eons, my take is that information that is factual or generally useful should be left in the global graph. Like resource mime-type, standards compliance ('conformsTo', etc), mirroring/alternate format information ('sha1_checksum', 'hasFormat' between a PDF, txt and Word doc versions, etc)

(There is the murky middle ground of course, like for licencing. But I'd suggest leave to the 'owning' aggregation to put it in the global graph.)

Enough of the digression on RDF!

So, how to extend BagIt, taking on board the things I have said above:

Add alongside the MANIFEST of BagIt (a simple list of files and checksums) an RDF serialisation - RDFMANIFEST.{format} (which in my preference is in N3 or turtle notation, .n3 or .turtle accordingly)

Copying the modelling of Aggregations from OAI-ORE, and we will say that one BagIt archive is equivalent to one Aggregation. (NB nothing wrong with a BagIt archive of BagIt archives!)

Re-use the Agent and ore:aggregates concepts and conventions from the OAI-ORE spec to 'list' the archive, and give some form of provenance. Add in a simple record for what this archive is meant to be as a whole (attached to the Aggregation class).

Give each BagIt a URI - in this case, preferably a resolvable URI from which you can download it, but for bulk transfers using SneakerNet or CarFullOfDrivesNet, use a info:BagIt/{id} scheme of your choice.

URIs for resources in transit are hierarchical, based on location in the archive: <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

Checksums, mimetypes and alternates should be added to the RDF Manifest:

NB <page1> == <info:BagIt/{book-id}/raw_pages/page1/bookid_page1_0.tiff>

<page1> <sha1> "9cbd4aeff71f5d7929b2451c28356c8823d09ab4";
<mimetype> "image/tiff";
<hasFormat> <info:BagIt/{book-id}/thumbnail_pages/page1/bookid_page1_0.jpg>;

Any assertions, such as page ordering in this case, should be handled as necessary. Just *please* do not use 'hasNext'! Use a named graph, use the built in rdf list mechanism, add an RSS 1.0 feed as a resource, anything but hasNext!

And that's about it for the format. One last thing to say about using info URIs though - I'd strongly suggest that they are used when the items do not have resolvable (http) URIs, and once transfered, I'd suggest that the info URIs are replaced with the http ones, and the info varients can be kept in a graph for provenance.

(Please note that I am biased in that this mirrors quite closely the way that the archives here and the way that digital items are held, but I think this works!)

Wednesday, 18 February 2009

Tracking conferences (at Dev8D) with python, twitter and tags

There was so much going on at http://www.dev8d.org (#dev8d) that it might be foolish for me to attempt to write up what happened.

So, I'll focus on a small, but to my mind, crucial aspect of it - tag tracking with a focus on Twitter.

The Importance of Tags

First, the tag (#)dev8d was cloudburst over a number of social sites - Flickr(dev8d tagged photos), Twitter(dev8d feed), blogs such as the JISCInvolve Dev8D site, and so on. This was not just done for publicity, but as a means to track and re-assemble the various inputs to and outputs from the event.

Flickr has some really nice photos on it, shared by people like Ian Ibbotson (who caught an urban fox on camera during the event!) While there was an 'official' dev8d flickr user, I expect the most unexpected and most interesting photos to be shared by other people who kindly add on the dev8d tag so we can find them. For conference organisers, this means that there is a pool of images that we can choose from, each with their own provenance so we can contact the owner if we wanted to re-use, or re-publish. Of course, if the owner puts a CC licence on them, it makes things easier :)

So, asserting a tag or label for an event is a useful thing to do in any case. But, this twinned with using a messaging system like Twitter or Identi.ca, means that you can coordinate, share, and bring together an event. There was a projector in the Basecamp room, which was either the bar, or one of the large basement rooms at Birkbeck depending on the day. Initially, this was used to run through the basic flow of events, which was primarily organised through the use of a wiki, to which all of us and the attendees were members.

Projecting the bird's eye view of the event

I am not entirely sure whose idea it was initially to use the projector to follow the dev8d tag on twitter, auto-refreshing itself every minute, but it would be one or more of the following: Dave Flanders(@dfflanders), Andy McGregor(@andymcg) and Dave Tarrant(@davetaz) who is aka BitTarrant due to his network wizardry keeping the wifi going despite Birkbeck's network's best efforts at stopping any form of useful networking going.

The funny thing about the feed being there, was that it felt perfectly natural from the start. Almost like a mix of notice board, event liveblog and facebook status updates, but the overall effect was like it was the bird's eye view of the entire event, which you could dip into and out of at will, follow up on talks you weren't even attending, catch interesting links that people posted, and just follow the whole event while doing your own thing.

Then things got interesting.

From what I heard, a conversation in the bar about developer happiness (involving @rgardler?) lead to Sam Easterby-Smith (@samscam) to create a script that dug through the dev8d tweets looking for n/m (like 7/10) and to use that as a mark of happyness e.g.

" @samscam #dev8d I am seriously 9/10 happy http://samscam.co.uk/happier HOW HAPPY ARE YOU? " (Tue, 10 Feb 2009 11:17:15)

And computed the average happyness and overall happyness of those who tweeted how they were doing!

Of course, being friendly, constructive sorts, we knew the best way to help 'improve' his happyometer was to try to break it by sending it bad input... *ahem*.

" @samscam #dev8d based on instant discovery of bugs in the Happier Pipe am now only 3/5 happy " (Tue, 10 Feb 2009 23:05:05)

BUT things got fixed, and the community got involved and interested. It caused talk and debate, got people wondering how that it was done, how they could do the same thing and how to take it further.

At which point, I thought it might be fun to 'retweet' the happyness ratings as they change, to keep a running track of things. And so, a purpose for @randomdev8d was born:

How I did this was fairly simple: I grabbed his page every minute or so, used BeautifulSoup to parse the HTML, got the happyness numbers out and compared it to the last ones the script had seen. If there was a change, it tweeted it and seconds later, the projected tweet feed updated to show the new values - a disproportionate feedback loop, the key to involvement in games; you do something small like press a button or add 4/10 to a message, and you can affect the stock-market ticker of happyness :)

If I had been able to give my talk on the python code day, the code to do this would contain zero surprises, because I covered 99% of this - so here's my 'slides'[pdf] - basically a snapshot screencast.

Here's the crufty code though that did this:

import time
import simplejson, httplib2, BeautifulSoup
h = httplib2.Http()
h.add_credentials('randomdev8d','PASSWORD')
happy = httplib2.Http()
o = 130.9
a = 7.7
import urllib

while True:
print "Checking happiness...."
(resp, content) = happy.request('http://samscam.co.uk/happier/')
soup = BeautifulSoup.BeautifulSoup(content)
overallHappyness = soup.findAll('div')[2].contents
avergeHappyness = soup.findAll('div')[4].contents
over = float(overallHappyness[0])
ave = float(avergeHappyness[0])
print "Overall %s - Average %s" % (over, ave)
omess = "DOWN"
if over > o:
omess = "UP!"
amess = "DOWN"
if ave > a:
amess= "UP!"
if over == o:
omess = "SAME"
if ave == a:
amess = "SAME"
if not (o == over and a == ave):
print "Change!"
o = over
a = ave
tweet = "Overall happiness is now %s(%s), with an average=%s(%s) #dev8d (from http://is.gd/j99q)" % (overallHappyness[0], omess, avergeHappyness[0], amess)
data = {'status':tweet}
body = urllib.urlencode(data)
(rs,cont) = h.request('http://www.twitter.com/statuses/update.json', "POST", body=body)
else:
print "No change"
time.sleep(120)

(Available from http://pastebin.com/f3d42c348 with syntax highlighting - NB this was written beat-poet style, written from A to B with little concern for form. The fact that it works is a miracle, so comment on the code if you must.)

The grand, official #Dev8D survey!

... which was anything but official, or grand. The happyness-o-meter idea lead BitTarrant and I to think "Wouldn't it be cool to find out what computers people have brought here?" Essentially, finding out what computer environment developers choose to use is a very valuable thing - developers choose things which make our lives easier, by and large, so finding out which setups they use by preference to develop or work with could guide later choices, such as being able to actually target the majority of environments for wifi, software, or talks.

So, on the Wednesday morning, Dave put out the call on @dev8d for people to post the operating systems on the hardware they brought to this event, in the form of OS/HW. I then busied myself with writing a script that hit the twitter search api directly, and parsed it itself. As this was a more intended script, I made sure that it kept track of things properly, pickling its per-person tallys. (You could post up multiple configurations in one or more tweets, and it kept track of it per-person.) This script was a little bloated at 86 lines, so I won't post it inline - plus, it also showed that I should've gone to the regexp lesson, as I got stuck trying to do it with regexp, gave up, and then used whitespace-tokenising... but it worked fine ;)

Survey code: http://pastebin.com/f2c04719b

Survey results: http://spreadsheets.google.com/pub?key=pDKcyrBE6SJqToHzjCs2jaQ

OS:
Linux was the majority at 42% closely followed by Apple at 37% with MS-based OS at 18% with a stellar showing of one user of OpenSolaris (4%)!

Hardware type:
66% were laptops, with 25% of the machines there being classed as netbooks. 8% of the hardware there were iPhones too, and one person claimed to have brought Amazon EC2 with them ;)

The post hoc analysis

Now then, having gotten back to normal life, I've spent a little time grabbing stuff from twitter and digging through them. Here is the list of the 1300+ tweets with the #dev8d tag in them published via google docs, and here is some derived things posted by Tony Hirst(@psychemedia) and Chris Wilper(@cwilper) seconds after I posted this:

Tagcloud of twitterer's:
http://www.wordle.net/gallery/wrdl/549364/dev8_twitterers [java needed]

Tagcloud of tweeted words:
http://www.wordle.net/gallery/wrdl/549350/dev8d [java needed]

And a column of all the tweeted links:
http://spreadsheets.google.com/pub?key=p1rHUqg4g423-wWQn8arcTg

This lead me to dig through them and republish the list of tweets, but try to unminimise the urls and try to grab the <title> tag of the html page it goes to, which you can find here:

http://spreadsheets.google.com/pub?key=pDKcyrBE6SJpwVmV4_4qOdg

(Which incidently, lead me to spot that there was one link to "YouTube - Rick Astley - Never Gonna Give You Up" which means the hacking was all worthwhile :))

Graphing Happyness

For one, I've re-analysed the happyness tweets and posted up the following:

A full log of happyness with timeline attached to it,
The running average, with accompanying timeline,
and the average of the last 10 tweets in much the same way as before.

It is easier to understand the averages as graphs over time of course! You could also use Tony Hirst's excellent write up here about creating graphs from google forms and spreadsheets. I'm having issues embedding the google timeline widget here, so you'll have to make do with static graphs.

Average happyness over the course of the event - all tweets counted towards the average.

Average happyness, but only the previous 10 tweets counted towards the average making it more reflective of the happyness at that time.

If you are wondering about the first dip, that was when we all tried to break Sam's tracker by sending it bad data, a lot of 0 happyness's were recorded therefore :) As for the second dip, well, you can see that from the log of happyness, yourselves :)

Less Talk, More Code

Wednesday, 25 February 2009

Developer Happiness days - why happyness is important

Sunday, 22 February 2009

Handling Tabular data

Friday, 20 February 2009

Pushing the BagIt manifest concept a little further

Wednesday, 18 February 2009

Tracking conferences (at Dev8D) with python, twitter and tags

Dopplr

Subscribe Now

Mugshot

Additional links

Labels

Blog Archive

About Me