Wikimedia Release Engineering Team/Quarterly review, November 2013

Date: TBD

Time: TBD

Slides: TBC

Notes: etherpad

Who: Topics: Deploy process/pipeline, release process, bug fixing (the how of priority), code review, code management, security deploy/release, automation prioritization
 * Leads: Greg G, Chris M
 * Virtual team: Greg G, Chris M, Antoine, Sam, Chris S, Chad, Zeljko, Michelle G, Andre, Ariel
 * Other review participants (invited): Robla, Sumana, Quim, Maryana, James F, Ryan Lane, Ken, Terry, Tomasz, Alolita

Big picture
Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our focus.

What we want to accomplish:
 * More appreciation of, response to, and creation of tests in development
 * Better monitoring and reporting out of our development and deployment processes, especially test environments and pre-deployment
 * Reduce time between code being finished and being deployed, while finding issues with code earlier and with more certainty.
 * Provide information about software quality in a way that informs release decisions
 * Help WMF Engineering learn and adapt from experience

...All in an effort to pave the path to a more reliable continuous deployment environment.

Team roles
Many people outside of the virtual team play an important role in releases, but this review will focus on the work of the following people in the following roles:
 * Release engineering: Greg G, Sam, Chris S (security)
 * QA and Test Automation: Chris M, Zeljko, Michelle G,
 * Bug escalation: Andre, Greg G., Chris M, Chris S (security)
 * Beta cluster development/maintenance:  Antoine, Ariel, Sam
 * Development tools (e.g. Gerrit, Jenkins): Chad, Antoine

Goals from last quarter

 * Better align QA effort with high profile features
 * Apply model of Language/Mobile embedded QA to a new feature team (specifically VisualEditor)
 * Include more user contributed code testing (eg: Gadgets)
 * Increase capacity through community training for browser tests
 * Improve our deployment process
 * automate as much as possible
 * improve monitoring
 * see DevOps Sprint 2013
 * improve tooling (eg: atomic updates/rollbacks and cache invalidation)
 * see DevOps Sprint 2013
 * Take the Beta Cluster to the next level
 * monitoring of fatals, errors, performance
 * add more automated tests for eg the API
 * feed experiences/gained knowledge of Beta Cluster automation up to production automation

Measures of success from last quarter

 * Successfully integrate QA support in one more feature team (as defined by: more regular/predictable testing and more test coverage)
 * Automation provides the bulk of what is needed now to deploy code
 * The Beta Cluster has an equal amount of monitoring to that of production (just without the paging to Ops). https://bugzilla.wikimedia.org/show_bug.cgi?id=51497
 * Atomicity in deploys (see DevOps Sprint 2013)

Action items from last review

 * ACTION RL/CM/JF: Put together an RFP for experienced tester for VisualEditor with "experience writing automated tests" as a plus rather than a core (Quim has ~3 CVs already from the QA events in the past).
 * done
 * ACTION JF: VE team have hack JS splice-out proxy idea that they will share so that others can use it (but only allows local testing against production where the code is in JS and executed client-side).
 * no
 * ACTION CM: Put browser tests in the repos of the feature they test, this will allow more frequent test running than the twice a day we have now.
 * in-progress (would need chris to give percentage estimate)
 * ACTION GG: We need test discoverability for Selenium/etc. tests - add to core's backlog a system for QA tests similar to how unit tests work in MW core right now?
 * no
 * ACTION GG: outline the options of testing infra and documenting where we want to go/what we're missing/pain points
 * on-going conversations with feature teams (especially mobile and parsoid)
 * needs one-stop shop version
 * ACTION GG/CM/RL: process documentation for ideal test/deployment steps - re-run the ThoughtWorks process we used two years ago to examine and help us start to iterate?
 * in-progress
 * ACTION GG: Add atomicity to success metrics for deploy related goal
 * meta action done, but actual goal not met yet
 * ACTION GG/KS: do retrospectives (post-mortem isn't a nice word)
 * in-progress (not done)

What we've done

 * Proper support for all extensions in beta cluster https://bugzilla.wikimedia.org/show_bug.cgi?id=49846 (from "In-progress last quarterly review)
 * Better align QA effort with high profile features
 * Apply model of Language/Mobile embedded QA to a new feature team (specifically VisualEditor)
 * Increase capacity through community training for browser tests
 * Take the Beta Cluster to the next level
 * monitoring of fatals, errors, performance
 * support for Parsoid
 * support for pre-release extensions e.g. Flow, with full i18n messages
 * Improve our deployment process
 * improve tooling (eg: atomic updates/rollbacks and cache invalidation)
 * see DevOps Sprint 2013

Still in progress

 * Break browser tests out of catch-all /qa/browsertests and into per-feature builds, following the Mobile model. CirrusSearch, ULS, VE https://bugzilla.wikimedia.org/show_bug.cgi?id=52890 https://bugzilla.wikimedia.org/show_bug.cgi?id=52120

Goals

 * Align QA with high profile features
 * Parsoid
 * Release Process
 * Provide more release information to stakeholders
 * where code is when (dashboard) - https://www.mediawiki.org/wiki/Wikimedia_Release_%26_QA_Team/Wishlist#Code_Deploy_Dashboard
 * One page performance metrics with deploy times highlighted
 * One page view of test passes/failures for code that rides the train
 * just draw the dang diagram already (current system)
 * process documentation for ideal test/deployment steps - re-run the ThoughtWorks process we used two years ago to examine and help us start to iterate?
 * old: tomasz, alolita, roan, howie, brion, robla (some others I missed)
 * new: greg, chris m, ori, reedy, gabriel, howie, ....
 * Beta Cluster
 * ganglia.wmflabs.org
 * HTTPS

Dependencies

 * Beta Cluster
 * Platform pull in from learnings
 * Level of ops support
 * Level of feature team support
 * Examples; MF with varnish; CirrusSearch; Flow and messages; Parsoid host for VE support
 * Deploy
 * automation of deployment process
 * Monitoring of deploys (performance)
 * Security patch management on cluster (especially after new wmfXX branches are cut)

Stretch activities as time allows

 * Provide hermetic test environments for developers/testers/community. Vagrant shows the way.
 * Use Vagrant for targeted tests within the WMF Jenkins work flow