Analytics/Kraken/Researcher analysis

Background
I (Oliver) have been asked by Toby to play around with our hadoop cluster and see how it handles common research situations - or, at least, those that are common from my point of view. This is the write-up of how that went.

Use cases for Kraken
Kraken (as I understand it) is envisioned to be a data warehouse for vast volumes of information. This information can come from a variety of sources; examples of the sort of thing Kraken might store are:
 * 1) Reams of eventlogging data;
 * 2) Pageviews data;
 * 3) Copies of the MW tables from all production wikis, unified.

This is pretty advantageous, for a variety of reasons: with pageviews data, there's currently no way to easily get at it (it's simply not available due to storage constraints). With MW tables, they are currently available, but aren't unified, requiring a run across multiple analytics slaves to gather global data. Kraken's existence, and Kraken's storage of this data, dramatically reduces the barrier to global queries and effectively obliterates the barrier to pageview data availability.

My background/limitations
Skillsets at the WMF around analytics tend to fall into several categories:
 * 1) Analytics Engineers: people tasked with and concerned about the infrastructure for gathering and analysing data (i.e., you people)
 * 2) Formal researchers: people formally trained (or highly experienced at) research and analysis. Dependent on the work of analytics engineers, their primary role for the Foundation is in providing data to support or necessitate particular decisions, be it in Product or Programs.
 * 3) Informal researchers: people informally trained (or moderately experienced at) research and analysis. Also dependent on the work of analytics engineers, providing the sort of quantitative data we store is a secondary or tertiary part of their job; they might be a general analyst, tasked with getting both qual and quant information, or a data-informed designer, or a product manager who knows their way around SQL.

I'm approaching this review from the perspective of an "itinerant researcher". In training and experience terms, I've never been formally trained to conduct research or handle code. Instead I have taught myself (with assistance from a couple of friends) SQL and R, and am happily proceeding towards knowing my way around PHP, too, which is nice. I've been doing this kind of thing for just over a year. My regular use-cases:


 * 1) Querying single database for simple data. This might be a number, or a list of users; examples of having to do this would be finding out who has been testing a particular piece of software, or what templates are most used in a particular namespace. These can be manually run or synced to a cron job and automated if I want non-static data.
 * 2) Querying a single database for complex data - in other words, data that requires post-processing in R. Examples could be anything from "using regular expressions to parse log entries" to visualising the results of simple queries.
 * 3) Querying multiple databases for simple or complex data. Examples would be, say, user_properties data from all wikis. Regardless of data complexity, this needs pre-processing.