Analytics/Kraken

Rationale
The Wiki Movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

Timeline

 * Rough Teamwide Milestones: http://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals#Analytics
 * Analytics Team Roadmap: https://www.mediawiki.org/wiki/Analytics/2012-2013_Roadmap

Docs



 * Dataflow diagram: https://upload.wikimedia.org/wikipedia/mediawiki/3/38/Kraken_flow_diagram.png
 * Test Cluster setup notes: http://etherpad.wikimedia.org/AnalyticsKrakenTestClusterSetup
 * Hadoop setup notes: http://etherpad.wikimedia.org/AnalyticsHadoopSetupNotes
 * Hardware Planning docs: https://www.mediawiki.org/wiki/Analytics/2012-2013_Roadmap/Hardware