Analytics/Archive/Roadmap

Our overarching vision is to give the wiki movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity's knowledge to power applications, mash up into websites, and stream to devices. It must be powerful enough to keep pace with our ample institutional motivation and energy, and robust enough to service needs that are as-of-yet hidden from view.

This page is the clearinghouse for our planning in both broad strokes and small tasks.

Assumes we have Dells back and set up by Dec 1!

= Projects =

Kraken

 * First pixel/firehose prototype: udp2log -> product code topic filters -> Kafka -> hadoop [otto] (Nov 9)
 * Product Code wiki & email stakeholders [dsc] (Nov)
 * Reinstall Dells, Fix Cisco machines [otto] (Nov)
 * Puppetize Kraken [otto] (Ongoing)
 * Monitoring [otto] (Nov)
 * Ganglia Monitoring [otto] (Nov)
 * Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Nov)
 * Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Nov)
 * WMF Maven parent pom [dsc] (Oct) [DONE]
 * Walkthrough to get up and running with Maven in Eclipse [dsc] (Nov)
 * Storm pom [dsc] (Nov)
 * Avro Schema for Request records [dsc] (Dec)
 * Core Jobs: aggregation, bucketing [dsc, DvL, others] (Dec)
 * Get to consensus with Ops regarding logging of the firehose [dsc + otto]
 * Research needed: test running cli JVM producers uses acceptable resources [otto] (Oct)
 * Set up full pixel/firehose (w/o ETL) [otto] (Dec)
 * Get Storm set up [dsc + otto] (Dec)
 * Storm ETL bolts for (some of) GeoIP, Anonymization, HDFS Import, Kafka Checkpointing [dsc] (Dec)
 * Consume some fraction (1:1000, 1:10000?) of web access logs [otto] (Dec)
 * Consume 1:1 web access logs into HDFS with ETL + Bucketing + Tagging [otto] (Jan)

Limn

 * Refactor charting to use d3 [dan + dsc]
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
 * Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
 * Mirror GitHub to Gerrit [dsc] (Oct)
 * Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
 * Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Oct)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Nov)
 * UI support for remote datasets via proxy [dsc + dan] (Nov)
 * Migrate Dario's dashboards to Limn [dsc] (Nov)
 * Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Nov)
 * Support the Global Dev dashboard [evan] (Ongoing)
 * Support the Gerrit Stats dashboard [diederik] (Ongoing)

Legacy Log Collection

 * Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
 * udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
 * udp-filter to filter by http status. [otto] (Oct)

WikiStats

 * Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
 * Repair data errors in wikistats, and add process for checking data integrity [ezachte] (Sept)
 * Make wikistats more robust (MoM validations) [ezachte] (Oct)
 * Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)

Infrastructure

 * Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
 * Maintenance of oxygen/emery/locke [otto] (Ongoing)

Data Releases

 * Create Data Release Practices Task Force [diederik] (Sept)
 * Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
 * Start pushing datasets to AWS [diederik] (Oct)
 * Finalize scripts to massively compact dammit.lt data [erik] (Oct)
 * Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)

= Team Planning =


 * 2012-2013 Team Roadmap
 * Cluster Hardware Planning
 * Meeting Notes