Analytics/Kraken

Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.

= Status =

= Rationale =

The Wiki Movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

= Documentation =




 * Cluster Dataflow Diagram
 * Product Codes
 * Data Formats
 * Pixel Service Endpoint
 * Request Logging
 * Hardware Planning
 * Notes
 * Test Cluster Setup Notes
 * Hadoop Setup Notes
 * Blurbs About Kraken

= Components =

Pixel Service and Request Logging
Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.


 * Request Logging
 * System Recommendation
 * Distributed Logging Systems Research
 * Feature Comparison Spreadsheet
 * Pixel Service Endpoint

Tasks


 * Product Code wiki page [dsc]
 * Prototype
 * on bits multicast stream -> udp2log 1:1 on analytics cluster [otto]
 * ...until bits caches are setup, we'll work with a dummy endpoint on an01 [otto, dsc]
 * Kafka consumes udp2log, creating topic per product-code [otto]
 * cron for kafka-hadoop consumer w/ path of product-code+datetime [otto]
 * Work with ops to figure out Kafka producer situation for production [otto, dsc, ops]
 * Event Data Conventions for proxied timestamp, referrer; normal web request components [dsc, e3]

ETL Topology
The Storm topology that processes incoming data (aka, "Extract-Transform-Load" in data warehousing jargon).


 * ETL Topology
 * Data Formats

Tasks


 * Maven + Nexus setup w/ storm pom [dsc]
 * Setup Storm [otto]
 * Bolts:
 * Consume from Kafka (by topic?)
 * Serialize & canonicalize record using Avro schema
 * GeoIP annotation
 * Mobile carrier annotation (by IP)
 * IP anonymization (where does the salt live? how often should we change it?)
 * Append record to per-minute files in HDFS (path using timestamp, topic)
 * Notifier Bolt to update Kafka consumer checkpoint (and/or publish event for external hooks)
 * Define Avro schemas:
 * WebRequest
 * EventData
 * Research Crane (Java ETL framework used by FB and Twitter)

Core Jobs
Core data processing jobs for processing web requests and event data, maintained by the Analytics team.


 * Data Formats

Tasks


 * Figure out request tagging -- how can we avoid mutating records?
 * Reportcard Oozie workflow [DvL]
 * Figure out indexing scheme:
 * Aggregated time buckets
 * Referrer chains
 * Event aggregations

Data Tools
Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.


 * Data Formats

Tasks


 * Setup regular sqoop imports [DvL]
 * Avro De/Serialization storage format compatibility:
 * WebRequest
 * EventData
 * Build re-usable pig/hive library, outsource this to analysts as much as possible [DvL]
 * KV-Pairs parsing tools
 * Conversion funnel analysis tools
 * A/B testing analysis tools
 * JS Event logging library
 * Investigate E3 client logging library [dsc, e3]
 * Mediawiki extension for instrumentation, sending event/edit data

Infrastructure
Bucket for general cluster infrastructure and maintenance tasks.
 * Kraken Puppetization
 * Hardware Planning
 * JMX Monitoring Research

Tasks


 * Reinstall Dells [otto]
 * Fix Cisco machines (an02, an07) [otto]
 * Continue puppetization [otto]
 * Ganglia monitoring [otto]
 * JMX Monitoring [otto, dsc]
 * Investigate / experiment / benchmark Hadoop stack [DvL]
 * Data owner services -- export, dashboard, Limn integration (on Hue?) [dsc, dan]
 * Hue plugin for Limn integration? [dsc]
 * Get analysts experiment with Hive/Pig [DvL]
 * General purpose job-runner using Kafka+Storm
 * Setup Hadoop FairScheduler

= Research =
 * Hbase
 * Hive
 * Hue plugins
 * Cascading -- do we even need this as we have Oozie?
 * Crane -- http://www.theregister.co.uk/2010/06/29/twitter_to_open_source_crane/ -- given this was 2010, tho i've heard of it mentioned since, it seems we shouldn't count on it being open-sourced anytime soon.