Analytics/Kraken/Researcher analysis

Background
I (Oliver) have been asked by Toby to play around with our hadoop cluster and see how it handles common research situations - or, at least, those that are common from my point of view. This is the write-up of how that went.

Use cases for Kraken
Kraken (as I understand it) is envisioned to be a data warehouse for vast volumes of information. This information can come from a variety of sources; examples of the sort of thing Kraken might store are:
 * 1) Reams of eventlogging data;
 * 2) Pageviews data;
 * 3) Copies of the MW tables from all production wikis, unified.

This is pretty advantageous, for a variety of reasons: with pageviews data, there's currently no way to easily get at it (it's simply not available due to storage constraints). With MW tables, they are currently available, but aren't unified, requiring a run across multiple analytics slaves to gather global data. Kraken's existence, and Kraken's storage of this data, dramatically reduces the barrier to global queries and effectively obliterates the barrier to pageview data availability.

My background/limitations
I'm approaching this review from the perspective of an "itinerant researcher". In training and experience terms, I've never been formally trained to conduct research or handle code. Instead I have taught myself (with assistance from a couple of friends) SQL and R, and am happily proceeding towards knowing my way around PHP, too, which is nice. I've been doing this kind of thing for just over a year.

Generally-speaking, my work falls into three categories. The first is to simply retrieve a number - this might be "the number of users on vector, globally" or "how many edits have there been in the last 30 days with the VisualEditor" - something that can be done through the terminal with the analytics slaves. The second is to take aggregate data and graph or map it with post-processing ("show me how many edits there have been in the last 30 days with the VisualEditor, and iterate for each previous 30-day cycle"); this usually involves the analytics slaves queried via R. The third is to take non-aggregated data ("predict revert rates from edit summary") and aggregate and map it using complex post-processing, also accomplished with a mix of the slaves and R.

My approach to hadoop/hive is to test whether these tasks can be completed with as-great or greater efficiency than the current setup, for both manual and automatic runs. I'm aided in this by analytics cluster access and Prokopp's "The Free Hive Book".