Analytics/Kraken

Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.

= Status =

= Rationale =

The Wiki Movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

= Documentation =




 * Cluster Dataflow Diagram
 * Data
 * Product Codes
 * Data Streams
 * Data Formats
 * Pixel Service Endpoint
 * Request Logging
 * Hardware Planning
 * Notes
 * Test Cluster Setup Notes
 * Hadoop Setup Notes
 * Blurbs About Kraken

= Components =

Pixel Service and Request Logging
Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.


 * Request Logging
 * System Recommendation
 * Distributed Logging Systems Research
 * Feature Comparison Spreadsheet
 * Pixel Service Endpoint

Tasks


 * Work with Ori, Patrick, & ops to figure out Kafka producer situation for production [otto, dsc, ori, preilly, ops]
 * MediaWiki EventLogging integration

ETL Topology
The Storm topology that processes incoming data (aka, "Extract-Transform-Load" in data warehousing jargon).


 * ETL Topology
 * Data Formats

Tasks


 * Maven + Nexus setup w/ storm pom [dsc]
 * Storm Bolts:
 * Consume from Kafka (by topic?)
 * Serialize & canonicalize record using Avro schema
 * GeoIP annotation
 * Mobile carrier annotation (by IP)
 * IP anonymization (where does the salt live? how often should we change it?)
 * Append record to per-hour files in HDFS (path using timestamp, topic)
 * Notifier Bolt to update Kafka consumer checkpoint (and/or publish event for external hooks)
 * Define Avro schemas:
 * WebRequest
 * EventData
 * Figure out convention for EventData packets to self-identify with a more specific schema

Core Jobs
Core data processing jobs for processing web requests and event data, maintained by the Analytics team.


 * Data Formats

Tasks


 * Figure out request tagging -- how can we avoid mutating records?
 * Reportcard Oozie workflow [DvL]
 * Pig Macro for "typical request analytics"
 * Figure out indexing scheme:
 * Aggregated time buckets
 * Referrer chains
 * Event aggregations
 * Funnel Analysis
 * A/B testing event data

Data Tools
Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.


 * Data Formats

Tasks


 * Setup regular sqoop imports [DvL]
 * Avro De/Serialization storage format compatibility:
 * WebRequest
 * EventData
 * Build re-usable pig/hive library, outsource this to analysts as much as possible [DvL]
 * KV-Pairs parsing tools
 * Conversion funnel analysis tools
 * A/B testing analysis tools
 * MW EventLogging extension integration
 * Investigate Ori's Mediawiki extension for instrumentation, sending event/edit data

Infrastructure
Bucket for general cluster infrastructure and maintenance tasks.
 * Kraken Puppetization
 * Hardware Planning
 * JMX Monitoring Research

Tasks


 * Fix the one busted Cisco machine (an07) [otto]
 * Continue puppetization [otto]
 * JMX Monitoring [otto, dsc]
 * Investigate / experiment / benchmark Hadoop stack [DvL]
 * Data owner services -- export, dashboard, visualizations (on Hue?) [dsc, dan]
 * Hue plugin for Limn integration? [dsc]
 * Get analysts experimenting with Hive/Pig [DvL]
 * General purpose job-runner using Kafka+Storm
 * Setup Hadoop FairScheduler [DvL]

= Research =

Bundling
Investigate size of savings of structure sharing via a short-term (between 1-30sec window) queue.


 * Reorder within window by timestamp? Seq number?
 * Parameters of experiment:
 * Buffer time: 1s, 2s, 5s, 10s
 * Sharded by server
 * Bundle by IP, UA, IP+UA
 * Pull-up fields for structure sharing:
 * Invariant: ip, ua, accept-lang, carrier
 * Often invariant: method, status, referer

Software

 * Hbase
 * Hive
 * Hue plugins
 * Cascading -- do we even need this as we have Oozie?