Analytics/Kraken

Rationale
The Wiki Movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

Documentation



 * Cluster Dataflow Diagram
 * Request Logging &mdash; capturing the incoming firehose from our front-end servers.
 * System Recommendation
 * Distributed Logging Systems Research
 * Feature Comparison Spreadsheet
 * Pixel Service Endpoint
 * Product Codes
 * Hardware Planning
 * Notes
 * Test Cluster Setup Notes
 * Hadoop Setup Notes

Planning
This task list represents all known "near-future" work for the cluster, broken out by sub-component.

Blockers

 * Waiting for replacement hardware (need more space)
 * 2 remaining Ciscos (an02, an07) have mysterious problems

Pixel Service
Public data import endpoint.


 * Product Code wiki page [dsc]
 * Prototype
 * /event.gif on bits multicast stream -> udp2log 1:1 on analytics cluster [otto]
 * until bits caches are setup, we'll work with a dummy endpoint on an01 [otto, dsc]
 * kafka consumes udp2log, creating topic per product-code [otto]
 * cron for kafka-hadoop consumer w/ path of product-code+datetime [otto]
 * Work with ops to figure out Kafka producer situation for production [otto, dsc, ops]
 * Event Data Conventions for proxied timestamp, referrer; normal web request components [dsc, e3]

ETL
Storm topology for processing incoming data.


 * Maven + Nexus setup w/ storm pom [dsc]
 * Setup Storm [otto]
 * Bolts:
 * Consume from Kafka (by topic?)
 * Serialize & canonicalize record using Avro schema
 * GeoIP annotation
 * Mobile carrier annotation (by IP)
 * IP anonymization (where does the salt live? how often should we change it?)
 * **Append** record to per-minute files in HDFS (path using timestamp, topic)
 * Notifier Bolt to update Kafka consumer checkpoint (and/or publish event for external hooks)
 * Define Avro schema for messages
 * Research Crane (Java ETL framework used by FB and Twitter)

Core Jobs
Core data processing jobs maintained by the Analytics team.


 * Figure out request tagging -- how can we avoid mutating records?
 * Setup regular sqoop imports [DvL]
 * Reportcard Oozie workflow [DvL]
 * Figure out indexing scheme:
 * Aggregated time buckets
 * Referrer chains
 * Event aggregations

Data Tools
Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.


 * Avro Request Message de/serialization storage format
 * Build re-usable pig/hive library, outsource this to analysts as much as possible [DvL]
 * KV-Pairs parsing tools
 * Conversion funnel analysis tools
 * A/B testing analysis tools
 * JS Event logging library
 * Investigate E3 client logging library [dsc, e3]
 * Mediawiki extension for instrumentation, sending event/edit data

General
General cluster tasks.


 * Reinstall Dells [otto]
 * Fix Cisco machines (an02, an07) [otto]
 * Continue puppetization [otto]
 * Ganglia monitoring [otto]
 * JMX Monitoring [otto, dsc]
 * Investigate / experiment / benchmark Hadoop stack [DvL]
 * Data owner services -- export, dashboard, Limn integration (on Hue?) [dsc, dan]
 * Hue plugin for Limn integration? [dsc]
 * Get analysts experiment with Hive/Pig [DvL]
 * General purpose job-runner using Kafka+Storm
 * Setup Hadoop FairScheduler

See the Analytics Team Roadmap for milestones and timeline.

Research

 * Hbase
 * Hive
 * Hue plugins
 * Cascading -- do we even need this as we have Oozie?
 * Crane -- http://www.theregister.co.uk/2010/06/29/twitter_to_open_source_crane/ -- given this was 2010, tho i've heard of it mentioned since, it seems we shouldn't count on it being open-sourced anytime soon.