Wikimedia Release Engineering Team/Project/Train2.0

From mediawiki.org

Goal[edit]

Generally: To increase developer output by speeding up development and deployment feedback loops.

Improve our weekly deployment train process by using a saner deployment and release process. Primarly this means moving to a long-lived branch strategy versus the current weekly branching strategy.

Much thought has gone into the way we do our "deployment train," and especially the weekly branch cutting. Several adjustments to this process will result in a significant increase in quality and improved developer productivity.

What[edit]

  • Maintain two long-lived production branches, wmf/next and wmf/prod
  • Periodically merge from master into wmf/next. As part of this process, all merged commits should be briefly reviewed for production readiness
  • Communicate with the author of questionable or not-well-understoon commits. Release engineering should be confident that a change is reasonable before merging to wmf/prod. Otherwise the commit will be reverted on the branch.

Dependencies[edit]

  • TechOps for any needed caching changes.
  • MediaWiki expertise for the needed changes to MW internals.

Milestones[edit]

1617Q1 Q2 Q3 Q4
Convert our production deployment strategy to use long-lived branches - task T89945
  • Move MW+Extension deploys to scap3 - task T114313 (part 1)
    • Use a unified git repo for MW deploys
    • Assess new directory structure and how it will interplay with MW
    • Stretch goal: Replace rsync with git syncing
  • Move MW+Extension deploys to scap3 - task T114313 (part 2)
    • Build fanout support in scap3
    • Assess impact of moving to `scap deploy` for all parts
  • Move MW+Extension deploys to scap3 - task T114313 (part 3)
    • Migrate to scap3 natively
    • Integrate scap with etcd/pybal to automatically depool and repool servers - task T104352

Impact[edit]

Movement[edit]

  • Improved code quality means less outages and less bugs make it to production.
  • We should be able to move more quickly and with more confidence, releasing improvements sooner / more often

Foundation[edit]

  • Less time spent cutting branches and less time spent hunting down unexpected changes during a deployment. Release Engineering should have a much better "big picture" understanding of what is happening across mediawiki teams, leading to better response to problems when they do arise.

KPI[edit]

  • Potential KPIs
    • The number of times reverts are needed after code is deployed to all wikis (including group2 train deploys and SWAT deploys)
    • Other not considered right now:
      • Mean time to recovery
      • Time to roll back a deploy
      • Production log errors
      • Time to deploy a new branch to all web servers