Analytics/Archive/Roadmap

Our overarching vision is to give the wiki movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity's knowledge to power applications, mash up into websites, and stream to devices. It must be powerful enough to keep pace with our ample institutional motivation and energy, and robust enough to service needs that are as-of-yet hidden from view.

This page is the clearinghouse for our planning in both broad strokes and small tasks.

= Projects =

Kraken

 * Product Codes: Email stakeholders and gather feedback [dsc] (Dec)
 * Ensure Mobile is sending data with a product code!! [dsc] (Dec)
 * EventLogging extension integration
 * Make sure all event data goes into Kraken (I think it may only be esams at the moment, not sure) [otto] (Dec)
 * Divvy up TODOs [otto + dsc + ori] (Dec)
 * Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Dec)
 * WMF Maven parent pom [dsc] (Oct) [DONE]
 * Walkthrough to get up and running with Maven in Eclipse [dsc] (Dec)
 * Storm pom [dsc] (Jan)
 * Core Jobs: aggregation, bucketing [dsc, DvL, others] (Dec)
 * Import Mediawiki tables using Sqoop [dvl] (Dec)
 * Create tool to generate sqoop import statements and oozie workflow documents (Sqoopy) [dvl] (Dec) [DONE]
 * Finetune Kraken configuration (Hue, Hive, Oozie) [dvl + ottomata + dsc] (Dec)
 * Storm ETL
 * Get Storm set up [dsc + otto] (Dec) [DONE]
 * Storm ETL bolt for GeoIP [dsc + otto] (Jan)
 * Storm ETL bolt for HDFS Import [dsc + otto] (Jan)
 * HDFS Import enforces product-code directories and permissions [dsc + otto] (Jan)
 * Storm ETL bolt for Anonymization [dsc + otto] (Jan)
 * Storm ETL bolt for Kafka Checkpointing [dsc + otto] (Jan)
 * Puppetize Kraken [otto] (Ongoing)

Limn

 * Refactor charting to use d3 [dan + dsc] (Jan)
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Jan)
 * Regular deploys to dev-reportcard.wmflabs.org [dan] (Dec)
 * Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Jan)
 * Migrate Dario's dashboards to Limn [dsc + dan] (Jan)
 * Mirror GitHub to Gerrit [dsc] (Dec)
 * Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Jan)
 * Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Jan)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Jan)
 * UI support for remote datasets via proxy [dsc + dan] (Jan)
 * Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Jan)
 * Support the Global Dev dashboard [evan] (Ongoing)
 * Support the Gerrit Stats dashboard [diederik] (Ongoing)

Legacy Log Collection

 * Add support for new domain names in webstatscollector (blog, etc) [diederik] (Dec) [DONE]
 * udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Jan) [Waiting for Fundraiser to finish]'
 * udp-filter to filter by http status. [otto] (Jan) [Waiting for Fundraiser to finish]

WikiStats

 * Setup Jenkins support [stefan] (Dec) [DONE]
 * Wikistats for editors [stefan] (Dec) [DONE]
 * Decouple wikistats from stat1 (so you can run it locally) [stefan] (Dec) [DONE]
 * Fix country related mobile pageview bugs [stefan / diederik] (Dec) [DONE]
 * Git migration [ezachte/diederik] (Nov/Dec) [DONE]
 * Reduce backlog regarding Wikistats traffic (squid etc) scripts [diederik / stefan] (Oct)
 * Repair data errors in wikistats, and add process for checking data integrity [stefan / diederik /ezachte] (Sept)
 * Make wikistats more robust (MoM validations) [stefan] (Oct)
 * Add Blackbox testing to WikiStats [diederik + stefan] (Oct)

Infrastructure

 * Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * Maintenance of oxygen/emery/locke [otto] (Ongoing)
 * Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Jan)

Data Releases

 * Start pushing datasets to AWS [diederik] (Nov) [DONE]
 * Blogpost about what awesome stuff you can do with the AWS datasets [diederik] (Jan)
 * Finalize scripts to massively compact dammit.lt data [erik] (Oct)
 * Deploy stats.grok.se on stat1001 [diederik/dario/andrew] (Dec/Jan)
 * Create Data Release Practices Task Force [diederik/dario] (waiting for feedback from Dario) (Sept)
 * Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)

= Team Planning =


 * 2012-2013 Team Roadmap
 * Cluster Hardware Planning
 * Q2 Quarterly Review
 * Meeting Notes