Analytics/Infrastructure
| Group: | Analytics |
| Start: | 2011-05-22 |
| End: | |
| Team: | Andrew Otto, David Schoonover, Diederik van Liere |
| Status: | See updates |
Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.
Contents |
Status [edit]
We're also now importing 1:1000 traffic streams, enabling us to migrate reports from our legacy analytics platform, WikiStats, onto our big data cluster, Kraken. In the future, this will make it easier for us to publish data and visualize reports using our newer infrastructure.
We have implemented secure login to the User Metrics API via SSL. We've also introduce a new metric called <code|pages_created, allowing us to count the number of pages created by a specific editor.
We improved the accuracy of the udp2log monitoring and upgraded the machines to Ubuntu Precise in order to make the system more robust.Rationale [edit]
The Wiki Movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.
Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.
Documentation [edit]
- Getting Access
- Overview
- Data
- Pixel Service Endpoint
- Request Logging
- Hardware Planning
- Notes
- Meeting Notes
Components [edit]
Pixel Service and Request Logging [edit]
Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.
Tasks
- Work with Ori, Patrick, & ops to figure out Kafka producer situation for production [otto, dsc, ori, preilly, ops]
- MediaWiki EventLogging integration
ETL Topology [edit]
The Storm topology that processes incoming data (aka, "Extract-Transform-Load" in data warehousing jargon).
Tasks
- Maven + Nexus setup w/ storm pom [dsc]
- Storm Bolts:
- Consume from Kafka (by topic?)
- Serialize & canonicalize record using Avro schema
- GeoIP annotation
- Mobile carrier annotation (by IP)
- IP anonymization (where does the salt live? how often should we change it?)
- Append record to per-hour files in HDFS (path using timestamp, topic)
- Notifier Bolt to update Kafka consumer checkpoint (and/or publish event for external hooks)
- Define Avro schemas:
- WebRequest
- EventData
- Figure out convention for EventData packets to self-identify with a more specific schema
Core Jobs [edit]
Core data processing jobs for processing web requests and event data, maintained by the Analytics team.
Tasks
- Figure out request tagging -- how can we avoid mutating records?
- Reportcard Oozie workflow [DvL]
- Pig Macro for "typical request analytics"
- Figure out indexing scheme:
- Aggregated time buckets
- Referrer chains
- Event aggregations
- Funnel Analysis
- A/B testing event data
Data Tools [edit]
Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.
Tasks
- Setup regular sqoop imports [DvL]
- Avro De/Serialization storage format compatibility:
- WebRequest
- EventData
- Build re-usable pig/hive library, outsource this to analysts as much as possible [DvL]
- KV-Pairs parsing tools
- Conversion funnel analysis tools
- A/B testing analysis tools
- MW EventLogging extension integration
- Investigate Ori's Mediawiki extension for instrumentation, sending event/edit data
Infrastructure [edit]
Bucket for general cluster infrastructure and maintenance tasks.
Meeting Notes
Tasks
- Fix the one busted Cisco machine (an07) [otto]
- Continue puppetization [otto]
- JMX Monitoring [otto, dsc]
- Investigate / experiment / benchmark Hadoop stack [DvL]
- Data owner services -- export, dashboard, visualizations (on Hue?) [dsc, dan]
- Hue plugin for Limn integration? [dsc]
- Get analysts experimenting with Hive/Pig [DvL]
- General purpose job-runner using Kafka+Storm
- Setup Hadoop FairScheduler [DvL]
Research [edit]
Bundling [edit]
Investigate size of savings of structure sharing via a short-term (between 1-30sec window) queue.
- Reorder within window by timestamp? Seq number?
- Parameters of experiment:
- Buffer time: 1s, 2s, 5s, 10s
- Sharded by server
- Bundle by IP, UA, IP+UA
- Pull-up fields for structure sharing:
- Invariant: ip, ua, accept-lang, carrier
- Often invariant: method, status, referer
Software [edit]
- Hbase
- Hive
- Hue plugins
- Cascading -- do we even need this as we have Oozie?