Analytics/Roadmap/PlanningMeetings/2012 Sept 20

Notes for the Team Analytics roadmap planning meeting for 20 Sept 2012, taken by Dave Schoonover.

= Need to create a set of Data Release Processes =
 * Search log data release contained unacceptable data. How can we prevent this in the future?
 * Diederik on point
 * Conversation with: Team Analytics, robla, Moeller, Legal, Dario, Chris MzMcBride?, Chris Steipp?
 * Who else might have a valued/paranoid perspective? MZMcBride??
 * Process for NEW datasets, as well as Smoke-check each data upload prior to public notice
 * Concrete threads:
 * For any new datastream: "What's the Attack Surface?"
 * Need to spend more time thinking about what sort of privacy exploits are possible
 * Strip all (no matter if we take other steps):
 * URLs? Spam, SEO links, etc
 * Email addresses
 * IP addresses
 * What criterion for k-anonymity are we going to use (if any)? --> Publish behavioral/request data only as aggregates
 * Followup on this release:
 * Disclosure requirements
 * Legal is looking into the impact and our obligations
 * Need to convey clearly to the community what happened and what we're doing

= Milestone Planning =

Kraken

 * Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
 * Load in sample data sets. [otto] (Sept)
 * Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
 * First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
 * Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
 * WMF Maven parent pom [dsc] (Oct)
 * Puppetize Kraken [otto] (Ongoing)
 * Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
 * Get Storm set up [dsc + otto] (Oct)
 * Start work on ETL topology [dsc] (Oct)
 * Hardware reinstallation -- Depends on Ops [otto] (Oct)
 * Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
 * Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)

Legacy Log Collection

 * Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
 * udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
 * udp-filter to filter by http status. [otto] (Oct)

WikiStats

 * Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
 * Repair data errors in wikistats, and add process for checking data integrity [ezachte] (Sept)
 * Make wikistats more robust (MoM validations) [ezachte] (Oct)
 * Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)

Ops & Maintenance

 * Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
 * Maintenance of oxygen/emery/locke [otto] (Ongoing)

Data

 * Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
 * Create Data Release Practices Task Force [diederik] (Sept)
 * Start pushing datasets to AWS [diederik] (Oct)
 * Finalize scripts to massively compact dammit.lt data [erik] (Oct)
 * Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)

Limn

 * Bootstrap Dan [dan + dsc] (Sept) [DONE]
 * Refactor charting to use d3 [dan + dsc]
 * Initial Prototype with Options UI (Sepåt)
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
 * Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
 * Mirror GitHub to Gerrit [dsc] (Sept)
 * Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
 * Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
 * UI support for remote datasets via proxy [dsc + dan] (Oct)
 * Migrate Dario's dashboards to Limn [dsc] (Sept)
 * Support the Global Dev dashboard [evan] (Ongoing)
 * Support the Gerrit Stats dashboard [diederik] (Ongoing)
 * Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)

September

 * (Kraken) Set up Cassandra cluster, get it working with Hadoop. [otto + dsc] (Sept)
 * Load in sample data sets. [otto] (Sept)
 * Tee the udp2log stream into Kraken. [otto + dsc] (Sept)
 * First-pass at Hive/Pig Jobs [dsc + otto] (Sept)
 * (Kraken) Puppetize Kraken [otto] (Ongoing)
 * (Legacy Log Collection) Add support for new domain names in webstatscollector (blog, etc) [diederik] (Sept)
 * (Data) Create Data Release Practices Task Force [diederik] (Sept)
 * (Limn) Bootstrap Dan [dan + dsc] (Sept) [DONE]
 * (Limn) Refactor charting to use d3 [dan + dsc]
 * Initial Prototype with Options UI (Sept)
 * (Limn) Mirror GitHub to Gerrit [dsc] (Sept)
 * (Limn) Coke ( for Coco) task to create symlinks into   from an existing data repository (such as, say, analytics/reportcard/data) [dsc] (Sept)
 * Coke task to download and setup dummy testing data for ease of development [dsc] (Sept)
 * (Limn) Migrate Dario's dashboards to Limn [dsc] (Sept)
 * (Limn) Support the Global Dev dashboard [evan] (ongoing)
 * (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)

October

 * (Kraken) Set up Maven / Somatype Artifact Repository (continuous integration) [dsc] (Oct)
 * WMF Maven parent pom [dsc] (Oct)
 * (Kraken) Puppetize Kraken [otto] (Ongoing)
 * (Kraken) Set up JMX monitoring -- needs to be our LAN [otto + dsc] (Oct)
 * (Kraken) Get Storm set up [dsc + otto] (Oct)
 * Start work on ETL topology [dsc] (Oct)
 * (Kraken) Hardware reinstallation -- Depends on Ops [otto] (Oct)
 * (Kraken) Get to consensus with Ops regarding logging of the firehose [dsc + otto] (Oct)
 * Research needed: test running cli JVM producers does not cause extra load [otto] (Oct)
 * (Legacy Log Collection) udp2log filters
 * Update filters for Wikipedia Zero [otto] (Ongoing)
 * Filter by X-Carrier headers. [otto + asher + diederik] (Oct)
 * udp-filter to filter by http status. [otto] (Oct)
 * (WikiStats) Reduce backlog regarding Wikistats traffic (squid etc) scripts [stefan] (Oct)
 * (WikiStats) Make wikistats more robust (MoM validations) [ezachte] (Oct)
 * (WikiStats) Add Blackbox testing to WikiStats [diederik + ezachte] (Oct)
 * (Ops & Maintenance) Access/support requests for stat1, stat1001 [otto] (Ongoing)
 * (Ops & Maintenance) Migrate Reportcard off Labs onto stat1001 -- reportcard.wikimedia.org [otto + dsc] (Oct)
 * (Ops & Maintenance) Maintenance of oxygen/emery/locke [otto] (Ongoing)
 * (Data) Publish Monthly Report Card -- deal with monthly data processing irregularities, perform correction/validation [ezachte + diederik + dsc] (Ongoing)
 * (Data) Start pushing datasets to AWS [diederik] (Oct)
 * (Data) Finalize scripts to massively compact dammit.lt data [erik] (Oct)
 * Blogpost about what awesome stuff you can do with this [diederik + ?] (Oct)
 * (Limn) Refactor charting to use d3 [dan + dsc]
 * Feature Parity with Dygraphs (plus bugfixes, etc) (Oct)
 * (Limn) Bugfixes (like Save-As, UI Error Notifications, ...) [dan] (Oct)
 * (Limn) Improve Limn wiki, docs, & guides (esp those shameful screenshots) [dan] (Oct)
 * (Limn) UI support for remote datasets via proxy [dsc + dan] (Oct)
 * (Limn) Support the Global Dev dashboard [evan] (Ongoing)
 * (Limn) Support the Gerrit Stats dashboard [diederik] (Ongoing)
 * (Limn) Deploy reportcard / gerrit-stats on stat1001 (aka, "the Debian packaging discussion") [otto + dsc] (Oct)

Followups

 * [dsc] Update wiki with project pages for everything on the Roadmap page
 * Each project owner will then update their Project Status for Sept
 * [dsc] Update the Engineering Roadmap wiki page: https://www.mediawiki.org/wiki/Roadmap
 * [dsc] Fill in week-by-week team roadmap without breakout by project

= &heart; =

<3 http://art.less.ly/2012/heart-dino.png <3