Analytics/Archive/Roadmap

Our overarching vision is to give the wiki movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity's knowledge to power applications, mash up into websites, and stream to devices. It must be powerful enough to keep pace with our ample institutional motivation and energy, and robust enough to service needs that are as-of-yet hidden from view.

This page is the clearinghouse for our planning in both broad strokes and small tasks.

= Projects =

Kraken

 * Set up CDH4 cluster. [otto + dsc] (Sept) [DONE]
 * Load in sample data sets. [otto] (Sept) [DONE]
 * Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
 * First-pass at Hive/Pig Jobs [dsc + otto] (Sept) [DONE]
 * Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
 * WMF Maven parent pom [dsc] (Oct)
 * Puppetize Kraken [otto] (Ongoing)
 * Monitoring
 * Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
 * Hadoop Ganglia Monitoring [otto] (Oct)
 * Get Storm set up [dsc + otto] (Oct)
 * Start work on ETL topology [dsc] (Oct)
 * Hardware reinstallation -- Depends on Ops [otto] (Oct)
 * Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
 * Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)

Limn

 * Bootstrap Dan [dan + dsc] (Sept) [DONE]
 * Refactor charting to use d3 [dan + dsc]
 * Initial Prototype with Options UI (Sept)
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
 * Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
 * Mirror GitHub to Gerrit [dsc] (Sept)
 * Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
 * Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
 * UI support for remote datasets via proxy [dsc + dan] (Oct)
 * Migrate Dario's dashboards to Limn [dsc] (Sept)
 * Support the Global Dev dashboard [evan] (Ongoing)
 * Support the Gerrit Stats dashboard [diederik] (Ongoing)
 * Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)

Legacy Log Collection

 * Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
 * udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
 * udp-filter to filter by http status. [otto] (Oct)

WikiStats

 * Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
 * Repair data errors in wikistats, and add process for checking data integrity [ezachte] (Sept)
 * Make wikistats more robust (MoM validations) [ezachte] (Oct)
 * Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)

Infrastructure

 * Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
 * Maintenance of oxygen/emery/locke [otto] (Ongoing)

Data Releases

 * Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
 * Create Data Release Practices Task Force [diederik] (Sept)
 * Start pushing datasets to AWS [diederik] (Oct)
 * Finalize scripts to massively compact dammit.lt data [erik] (Oct)
 * Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)

= Milestones =

September

 * (Kraken) Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
 * Load in sample data sets. [otto] (Sept)
 * Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
 * First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
 * (Kraken) Puppetize Kraken [otto] (Ongoing)
 * (Legacy Log Collection) Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
 * (Data) Create Data Release Practices Task Force [diederik] (Sept)
 * (Limn) Bootstrap Dan [dan + dsc] (Sept) [DONE]
 * (Limn) Refactor charting to use d3 [dan + dsc]
 * Initial Prototype with Options UI (Sept)
 * (Limn) Mirror GitHub to Gerrit [dsc] (Sept)
 * (Limn) Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
 * (Limn) Migrate Dario's dashboards to Limn [dsc] (Sept)
 * (Limn) Support the Global Dev dashboard [evan] (ongoing)
 * (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)

October

 * (Kraken) Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
 * WMF Maven parent pom [dsc] (Oct)
 * (Kraken) Puppetize Kraken [otto] (Ongoing)
 * (Kraken) Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
 * (Kraken) Get Storm set up [dsc + otto] (Oct)
 * Start work on ETL topology [dsc] (Oct)
 * (Kraken) Hardware reinstallation -- Depends on Ops [otto] (Oct)
 * (Kraken) Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
 * Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)
 * (Legacy Log Collection) udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
 * udp-filter to filter by http status. [otto] (Oct)
 * (WikiStats) Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
 * (WikiStats) Make wikistats more robust (MoM validations) [ezachte] (Oct)
 * (WikiStats) Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)
 * (Ops & Maintenance) Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * (Ops & Maintenance) Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
 * (Ops & Maintenance) Maintenance of oxygen/emery/locke [otto] (Ongoing)
 * (Data) Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
 * (Data) Start pushing datasets to AWS [diederik] (Oct)
 * (Data) Finalize scripts to massively compact dammit.lt data [erik] (Oct)
 * Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)
 * (Limn) Refactor charting to use d3 [dan + dsc]
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
 * (Limn) Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
 * (Limn) Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
 * (Limn) UI support for remote datasets via proxy [dsc + dan] (Oct)
 * (Limn) Support the Global Dev dashboard [evan] (Ongoing)
 * (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)
 * (Limn) Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)

= Team Planning =


 * 2012-2013 Team Roadmap
 * Cluster Hardware Planning
 * Meeting Notes