Analytics/Kraken/Researcher analysis

Background
I (Oliver) have been asked by Toby to play around with our hadoop cluster and see how it handles common research situations - or, at least, those that are common from my point of view. This is the write-up of how that went.

Use cases for Kraken
Kraken (as I understand it) is envisioned to be a data warehouse for vast volumes of information. This information can come from a variety of sources; examples of the sort of thing Kraken might store are:
 * 1) Reams of eventlogging data;
 * 2) Pageviews data;
 * 3) Copies of the MW tables from all production wikis, unified.

This is pretty advantageous, for a variety of reasons: with pageviews data, there's currently no way to easily get at it (it's simply not available due to storage constraints). With MW tables, they are currently available, but aren't unified, requiring a run across multiple analytics slaves to gather global data. Kraken's existence, and Kraken's storage of this data, dramatically reduces the barrier to global queries and effectively obliterates the barrier to pageview data availability.

My background/limitations
I'm approaching this review from the perspective of an "itinerant researcher". In training and experience terms, I've never been formally trained to conduct research or handle code. Instead I have taught myself (with assistance from a couple of friends) SQL and R, and am happily proceeding towards knowing my way around PHP, too, which is nice. I've been doing this kind of thing for just over a year.

Generally-speaking, my work falls into three categories. The first is to simply retrieve a number - this might be "the number of users on vector, globally" or "how many edits have there been in the last 30 days with the VisualEditor" - something that can be done through the terminal with the analytics slaves. The second is to take aggregate data and graph or map it with post-processing ("show me how many edits there have been in the last 30 days with the VisualEditor, and iterate for each previous 30-day cycle"); this usually involves the analytics slaves queried via R. The third is to take non-aggregated data ("predict revert rates from edit summary") and aggregate and map it using complex post-processing, also accomplished with a mix of the slaves and R.

My approach to hadoop/hive is to test whether these tasks can be completed with as-great or greater efficiency than the current setup, for both manual and automatic runs. I'm aided in this by analytics cluster access and Prokopp's "The Free Hive Book", along with the standard hive language documentation.

Data retrieval
The first task is simply being able to conveniently retrieve numbers that are useful prima facie from Kraken - a test of syntax ease as much as anything else As my test case, I picked "find out how many pageviews the article William Blackstone has received in 2013". Steps in this;


 * 1) Find the pageviews table;
 * 2) Understand its structure;
 * 3) Query, retrieving a sum of a set of numerical fields, with restrictions.

Hadoop/hive syntax is easy to understand, from my position; it's SQL with knobs on. Many elements of it are as- or more-useful than SQL, particularly the accuracy with which it identifies real locations for errors rather than perceived locations, and the way it actively reports on job progression. I easily found the relevant database and table via SHOW, and understood the table's structure via DESCRIBE. With the ease of syntax, constructing the query was simple.

Once it started running I twiddled my thumbs for a long while waiting for it to process - the query (SELECT * FROM pagecounts WHERE year = 2013 AND month = 10 AND project = 'enwiki'SELECT SUM(views) FROM pagecounts WHERE year = 2013 AND project = 'enwiki' AND page = 'William Blackstone';) took 55 minutes to process before returning a null. I (correctly) diagnosed this as a formatting problem (William Blackstone or William_Blackstone or William*Blackstone?) but was unable to verify what the correct page format was before consulting one of the analytics engineers. The 'comment' field for these tables is empty, there is no documentation for the format of each field (beyond "int" or "char"), and an attempt to run "SELECT * FROM pagecounts LIMIT 1;" as a way of getting the same information, generated "FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "pagecounts" Table "pagecounts"".

Having eventually found the correct format, the query ran for approximately an hour (15 hours of CPU time) to again produce a NULL. An obvious problem might be "there's only data for 2012!", something I'd willingly test if SELECT DISTINCT(year) didn't produce the same SemanticException.

With the help of Dan I solved for the SemanticException bug and tried a second hypothesis, namely that the data for William Blackstone didn't exist. Instead I identified a page I could definitely find within the cluster (namely "Branches", which seemed apropos for a distributed computing system's testing) and ran a query for that. Again, it returned null. Eventually I discovered the appropriate naming convention for projects in the pageviews data (which, while matching the database dumps, does not match any internal data sources) and succeeded in running a query that returned usable results.

Retrieval and post-processing
The next task is retrieving aggregated data and processing it through an analytics-biased language, such as R or Python. As an example I decided to pick out the monthly data for the Barack_Obama article on wikipedia, and export it to a TSV file. I was successful in exporting it, and then went to build a pretty ggplot2 plot with the data.

Problem the first: the analytics portal (analytics1011.eqiad.wmnet) doesn't actually have R installed. It does have Python installed, but lacks Pandas, NumPy, SciPy and other key data analysis modules. The ability to process data on the same machine you retrieve it from (or, on a machine integrated tightly enough with the retrieval machine to allow for interactions) is key. Without that, automated queries with post-processing involved simply cannot be run; manual queries require someone to come in and shift the files over, which, in the case of non-linked machines, is both a pain and something that potentially increases the risk of security problems. The obvious way around this is to install the pertinent modules and packages on analytics1011 - cat /proc/cpuinfo indicates the machine can certainly take it.

Problem the second: there's no direct link from analytics1011 to stat1, which would be a way around the lack of base language packages on analytics1011. Eventually I opened both up in gigolo and transferred manually. Once it was over on stat1, processing was both trivial and based on the status quo, since I've spent the past 12 months making sure stat1 has absolutely everything I need. Result: graph on the right.

Conclusion
There are a lot of positives to our hive/hadoop infrastructure. Syntax is easy to learn (very similar to SQL, albeit slightly stricter) as is the rules around performance (queries to more precise partitions are faster than queries to wide partitions, as with WHERE clauses that exclude big chunks of a MySQL table). Error messages are highly informative, as are the progress reports provided with in-hive execution of queries, which allow for useful evaluation of query speed and progress.

There are also a lot of negatives.

Positives
 * Syntax is easy to learn
 * Error messages for problems with the query proper are more distinct (har har) and easier to parse into an actual, fixable problem.
 * Informative progress reports, allowing for time estimate and an awareness of when a query is Broken or 'broken'

Negatives
 * No documentation of field formatting/what is/is not a partition.

To-dos