Wikimedia Release Engineering Team/Project/Train2.0

Goal
Generally: To increase developer output by speeding up development and deployment feedback loops.

Improve our weekly deployment train process by using a saner deployment and release process. Primarly this means moving to a long-lived branch strategy versus the current weekly branching strategy.

Much thought has gone into the way we do our "deployment train," and especially the weekly branch cutting. Several adjustments to this process will result in a significant increase in quality and improved developer productivity.

What

 * Maintain two long-lived production branches, wmf/next and wmf/prod
 * Periodically merge from master into wmf/next. As part of this process, all merged commits should be briefly reviewed for production readiness
 * Communicate with the author of questionable or not-well-understoon commits. Release engineering should be confident that a change is reasonable before merging to wmf/prod. Otherwise the commit will be reverted on the branch.

Dependencies

 * TechOps for any needed caching changes.
 * MediaWiki expertise for the needed changes to MW internals.

Milestones

 * 1) Move MW+Extension deploys to scap3
 * 2) Integrate scap with etcd/pybal to automatically depool and repool servers
 * 3) Convert our production deployment strategy to use long-lived branches
 * 4) Start using scap's canary deploy option

Movement

 * Improved code quality means less outages and less bugs make it to production.
 * We should be able to move more quickly and with more confidence, releasing improvements sooner / more often

Foundation

 * Less time spent cutting branches and less time spent hunting down unexpected changes during a deployment. Release Engineering should have a much better "big picture" understanding of what is happening across mediawiki teams, leading to better response to problems when they do arise.

KPI

 * Potential KPIs
 * The number of times reverts are needed after code is deployed to all wikis (including group2 train deploys and SWAT deploys)
 * Other not considered right now:
 * Mean time to recovery
 * Time to roll back a deploy
 * Production log errors
 * Time to deploy a new branch to all web servers