Wikimedia Release Engineering Team/Quarterly review, August 2013

Date: August 21st, 2013

Time: 1:30pm Pacific (20:30 UTC)

Who: Topics: Deploy process/pipeline, release process, bug fixing, code review, code management, security deploy/release, automation prioritization
 * Leads: Greg G, Chris M
 * Virtual team: Greg G, Chris M, Antoine, Sam, Chris S, Chad, Zeljko, Michelle G, Andre, Ariel
 * Other review participants (invited): Robla, Sumana, Quim, Maryana, James F, Ryan Lane, Ken, Terry, Tomasz, Alolita

Big picture
Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our focus.

What we want to accomplish:
 * More appreciation of, response to, and creation of tests in development
 * Better monitoring and reporting out of our development and deployment processes
 * Reduce time between code being finished and being deployed, while finding issues with code earlier and with more certainty.
 * Provide information about software quality in a way that informs release decisions
 * Help WMF Engineering learn and adapt from experience

Team roles
Many people outside of the virtual team play an important role in releases, but this review will focus on the work of the following people in the following roles:
 * Release engineering: Greg G, Sam, Chris S (security)
 * QA and Test Automation: Chris M, Zeljko, Michelle G,
 * Bug escalation: Andre, Greg G., Chris M, Chris S (security)
 * Beta cluster development/maintenance:  Antoine, Ariel, Sam
 * Development tools (e.g. Gerrit, Jenkins): Chad, Antoine

What we've done

 * Built the Beta Cluster to be something that is instrumental in the quality of our code production
 * all platform and extension code merged to master is deployed to beta labs automatically
 * automated db updates are still under discussion but greatly improved
 * Provided embedded QA support to important feature teams (Language and Mobile)
 * Successfully transitioned to a one-week deploy cycle
 * Community growth through eg OPW, live and online training sessions, QA mail list
 * Virtual team creation

Still in progress

 * Proper support for all extensions in beta cluster https://bugzilla.wikimedia.org/show_bug.cgi?id=49846

Goals for the next quarter
We have a lot - see also, the list of sprints with associated tracking tickets
 * Align high profile features with QA testing levels
 * Increase capacity through community training for browser tests
 * Also included here is Gadget support
 * Improve monitoring and reporting of errors so we can prevent outages from occurring and/or respond to outages faster.
 * Improve our deployment process with the ultimate aim of making the deployment process as automated as possible (mostly focusing for now on git-deploy and ensuring caches are invalidated when needed)
 * Take the Beta Cluster to the next level
 * monitoring of fatals, errors, performance
 * add more (non-feature specific) automated tests for eg the API
 * automate deploying to beta cluster as much as possible in order to find the best approaches for automation of production deployments
 * Provide hermetic test environments for developers/testers that mimic labs' abilities. Vagrant shows the way.
 * Apply model of Language/Mobile embedded QA to a new feature team (E2? E3?)

Measures of success

 * Successfully integrate QA support in one more feature team (as defined by: more regular/predictable testing and more test coverage)
 * The Beta Cluster has an equal amount of monitoring to that of production (just without the paging to Ops).
 * eg: we have sufficient monitoring that pertinent team members know when the Beta Cluster is down
 * We are able to quickly spin up throw-away browser testing sessions per merge to master.

Questions

 * Is VisualEditor or E3 the right next team to apply the embedded test engineer model to? (similar to Mobile/Language team)
 * Priorities generally for QA test coverage effort: are we hitting these appropriately? How do we communicate this?
 * There's a lot to do, where should we prioritize? Where should we build capacity?
 * Sign off for bigger feature deploys/enablings?
 * Is how we plan on measuring success sufficient for your needs?

Worries

 * Our goals are wide-ranging and need support from multiple teams, maybe moreso than your average goal list