Wikimedia Release Engineering Team/Project/Train2.0

Goal

Generally: To increase developer output by speeding up development and deployment feedback loops.

Improve our weekly deployment train process by using a saner deployment and release process. Primarly this means moving to a long-lived branch strategy versus the current weekly branching strategy.

Much thought has gone into the way we do our "deployment train," and especially the weekly branch cutting. Several adjustments to this process will result in a significant increase in quality and improved developer productivity.

What

Maintain two long-lived production branches, wmf/next and wmf/prod
Periodically merge from master into wmf/next. As part of this process, all merged commits should be briefly reviewed for production readiness
Communicate with the author of questionable or not-well-understoon commits. Release engineering should be confident that a change is reasonable before merging to wmf/prod. Otherwise the commit will be reverted on the branch.

Dependencies

TechOps for any needed caching changes.
MediaWiki expertise for the needed changes to MW internals.

Milestones

1617Q1	Q2	Q3	Q4
Convert our production deployment strategy to use long-lived branches - task T89945	Move MW+Extension deploys to scap3 - task T114313 (part 1) Use a unified git repo for MW deploys Assess new directory structure and how it will interplay with MW Stretch goal: Replace rsync with git syncing	Move MW+Extension deploys to scap3 - task T114313 (part 2) Build fanout support in scap3 Assess impact of moving to `scap deploy` for all parts	Move MW+Extension deploys to scap3 - task T114313 (part 3) Migrate to scap3 natively Integrate scap with etcd/pybal to automatically depool and repool servers - task T104352

Impact

Movement

Improved code quality means less outages and less bugs make it to production.
We should be able to move more quickly and with more confidence, releasing improvements sooner / more often

Foundation

Less time spent cutting branches and less time spent hunting down unexpected changes during a deployment. Release Engineering should have a much better "big picture" understanding of what is happening across mediawiki teams, leading to better response to problems when they do arise.

KPI

Potential KPIs
- The number of times reverts are needed after code is deployed to all wikis (including group2 train deploys and SWAT deploys)
- Other not considered right now:
  - Mean time to recovery
  - Time to roll back a deploy
  - Production log errors
  - Time to deploy a new branch to all web servers