Wikimedia Release Engineering Team/Quarterly review, August 2013

Date: August 21st, 2013

Time: 1:30pm Pacific (20:30 UTC)

Slides: gslides

Notes: etherpad

Who: Topics: Deploy process/pipeline, release process, bug fixing, code review, code management, security deploy/release, automation prioritization
 * Leads: Greg G, Chris M
 * Virtual team: Greg G, Chris M, Antoine, Sam, Chris S, Chad, Zeljko, Michelle G, Andre, Ariel
 * Other review participants (invited): Robla, Sumana, Quim, Maryana, James F, Ryan Lane, Ken, Terry, Tomasz, Alolita

Big picture
Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our focus.

What we want to accomplish:
 * More appreciation of, response to, and creation of tests in development
 * Better monitoring and reporting out of our development and deployment processes
 * Reduce time between code being finished and being deployed, while finding issues with code earlier and with more certainty.
 * Provide information about software quality in a way that informs release decisions
 * Help WMF Engineering learn and adapt from experience

...All in an effort to pave the path to a more reliable continuous deployment environment.

Team roles
Many people outside of the virtual team play an important role in releases, but this review will focus on the work of the following people in the following roles:
 * Release engineering: Greg G, Sam, Chris S (security)
 * QA and Test Automation: Chris M, Zeljko, Michelle G,
 * Bug escalation: Andre, Greg G., Chris M, Chris S (security)
 * Beta cluster development/maintenance:  Antoine, Ariel, Sam
 * Development tools (e.g. Gerrit, Jenkins): Chad, Antoine

What we've done

 * Built the Beta Cluster to be something that is instrumental in the quality of our code production
 * all platform and extension code merged to master is deployed to beta labs automatically
 * automated db updates are still under discussion but greatly improved
 * Provided embedded QA support to important feature teams (Language and Mobile)
 * Successfully transitioned to a one-week deploy cycle
 * Community growth through eg OPW, live and online training sessions, QA mail list
 * Virtual team creation
 * Testing and automated browser tests across WMF development teams and projects

Still in progress

 * Proper support for all extensions in beta cluster https://bugzilla.wikimedia.org/show_bug.cgi?id=49846
 * Break browser tests out of catch-all /qa/browsertests and into per-feature builds, following the Mobile model. CirrusSearch, ULS, VE https://bugzilla.wikimedia.org/show_bug.cgi?id=52890 https://bugzilla.wikimedia.org/show_bug.cgi?id=52120

Goals for the next quarter
We have a lot - see also, the list of sprints with associated tracking tickets
 * Better align QA effort with high profile features
 * see: QA testing levels describing test events
 * Apply model of Language/Mobile embedded QA to a new feature team (specifically VisualEditor)
 * Include more user contributed code testing (eg: Gadgets)
 * Increase capacity through community training for browser tests
 * Improve our deployment process
 * automate as much as possible
 * improve monitoring
 * improve tooling (eg: atomic updates/rollbacks and cache invalidation)
 * Take the Beta Cluster to the next level
 * monitoring of fatals, errors, performance
 * add more automated tests for eg the API
 * feed experiences/gained knowledge of Beta Cluster automation up to production automation

Stretch activities as time allows

 * Provide hermetic test environments for developers/testers/community. Vagrant shows the way.
 * Use Vagrant for targeted tests within the WMF Jenkins work flow

ACTIONS!

 * ACTION RL/CM/JF: Put together an RFP for experienced tester for VisualEditor with "experience writing automated tests" as a plus rather than a core (Quim has ~3 CVs already from the QA events in the past).
 * ACTION JF: VE team have hack JS splice-out proxy idea that they will share so that others can use it (but only allows local testing against production where the code is in JS and executed client-side).
 * ACTION CM: Put browser tests in the repos of the feature they test, this will allow more frequent test running than the twice a day we have now.
 * ACTION GG: We need test discoverability for Selenium/etc. tests - add to core's backlog a system for QA tests similar to how unit tests work in MW core right now?
 * ACTION GG: outline the options of testing infra and documenting where we want to go/what we're missing/pain points
 * ACTION GG/CM/RL: process documentation for ideal test/deployment steps - re-run the ThoughtWorks process we used two years ago to examine and help us start to iterate?
 * ACTION GG: Add atomicity to success metrics for deploy related goal
 * ACTION GG/KS: do retrospectives (post-mortem isn't a nice word)

Measures of success

 * Successfully integrate QA support in one more feature team (as defined by: more regular/predictable testing and more test coverage)
 * Automation provides the bulk of what is needed now to deploy code
 * The Beta Cluster has an equal amount of monitoring to that of production (just without the paging to Ops). https://bugzilla.wikimedia.org/show_bug.cgi?id=51497

Questions

 * What does Product Management and QA communication look like?
 * There's a lot to do, where should we prioritize? Where should we build capacity?
 * Sign off for bigger feature deploys/enablings?
 * Is how we plan on measuring success sufficient for your needs?

Worries

 * Our goals are wide-ranging and need support from multiple teams, maybe moreso than your average goal list