Analytics/Archive/Roadmap

Our overarching vision is to give the wiki movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity's knowledge to power applications, mash up into websites, and stream to devices. It must be powerful enough to keep pace with our ample institutional motivation and energy, and robust enough to service needs that are as-of-yet hidden from view.

This page is the clearinghouse for our planning in both broad strokes and small tasks.

Assumes we have Dells back and set up by Dec 1! Dells are back and installed as of 12 Nov! Woo!

= Projects =

Kraken

 * Pixel Service
 * First Prototype: udp2log -> product code topic filters -> Kafka -> hadoop [otto] (Nov 9) [DONE]
 * Pixel Service documentation [dsc] (Nov) [DONE]
 * Work out Event Data conventions for proxied fields / normal web request components [dsc, e3] (Nov) [DONE]
 * Product Code wiki [dsc] (Nov) [DONE]
 * Email stakeholders and gather feedback [dsc] (Dec)
 * Reinstall Dells [otto] (Nov) [DONE]
 * Fix Cisco machines [otto] (Nov) (Awaits ops help) [DONE]
 * Puppetize Kraken [otto] (Ongoing)
 * Monitoring [otto] (Nov)
 * Ganglia Monitoring [otto] (Nov) [DONE]
 * Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Nov) [DONE]
 * Solution Research page [dsc] (Nov) [DONE]
 * Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Dec)
 * WMF Maven parent pom [dsc] (Oct) [DONE]
 * Walkthrough to get up and running with Maven in Eclipse [dsc] (Dec)
 * Storm pom [dsc] (Dec)
 * Core Jobs: aggregation, bucketing [dsc, DvL, others] (Dec)
 * Data Formats wiki page [dsc + diederik] (Nov) [DONE]
 * Avro Schemas for Request & Event Data records [dsc + diederik] (Dec) [DONE]
 * Set up full pixel/firehose (w/o ETL) via  [otto] (Dec) (Ready to go, holding off on flipping switch)
 * Figure out varnishncsa log format for event stream [otto + dsc] (Nov) [DONE]
 * Get to consensus with Ops regarding logging of the firehose [dsc + otto] [DONE]
 * Research needed: test running cli JVM producers uses acceptable resources [otto] (Oct) [not doing]
 * Get Storm set up [dsc + otto] (Dec)
 * Storm ETL bolts for (some of) GeoIP, Anonymization, HDFS Import, Kafka Checkpointing [dsc/otto] (Dec)
 * Consume some fraction (1:1000, 1:10000?) of web access logs [otto] (Dec)
 * Consume 1:1 web access logs into HDFS with ETL + Bucketing + Tagging [otto] (Jan)

Limn
Major release (0.8) planned for 5 Dec (prior to Metrics Meeting on 6 Dec)
 * Refactor charting to use d3 [dan + dsc]
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Dec)
 * Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Dec)
 * Mirror GitHub to Gerrit [dsc] (Jan)
 * Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Jan)
 * Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Jan)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Jan)
 * UI support for remote datasets via proxy [dsc + dan] (Dec)
 * Migrate Dario's dashboards to Limn [dsc] (Dec)
 * Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Dec)
 * Support the Global Dev dashboard [evan] (Ongoing)
 * Support the Gerrit Stats dashboard [diederik] (Ongoing)

Legacy Log Collection

 * Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept) [Status: Made a Baby Instead???]
 * udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Oct) [Status: At Least We Think It's a Baby?]
 * udp-filter to filter by http status. [otto] (Oct) [Status: Maybe It's a T-Rex!!!]

WikiStats
(Stale -- will consult Stefan)
 * Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
 * Repair data errors in wikistats, and add process for checking data integrity [ezachte] (Sept)
 * Make wikistats more robust (MoM validations) [ezachte] (Oct)
 * Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)

Infrastructure

 * Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * Maintenance of oxygen/emery/locke [otto] (Ongoing)
 * Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Dec)

Data Releases

 * Start pushing datasets to AWS [diederik] (Nov) [DONE]
 * Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct) [Status: BABYPOCALYPSE 2k12]
 * Finalize scripts to massively compact dammit.lt data [erik] (Oct) [???]
 * Create Data Release Practices Task Force [diederik] (Sept) [Status: Baaaaaby!]
 * Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)

= Team Planning =


 * 2012-2013 Team Roadmap
 * Cluster Hardware Planning
 * Meeting Notes