User:Roan Kattouw (WMF)/Deployment process

This page explains the processes and tools we use to deploy code at the Wikimedia Foundation. It's intended to be a high-level introduction for both a technical and a non-technical audience.

Code review in Gerrit
When a developer wants to fix a bug, implement a feature or change the code for some other reason, they first make the change locally on their own computer and test it. Then they upload the proposed change to Gerrit, our code review system. There, other developers can review the change and comment on it. Reviewers also assign one of the following scores to their review:


 * -2 (Do not submit): The reviewer is vetoing this change. A -2 review prevents another reviewer from approving the change.
 * -1 (There's a problem with this change, please improve): The reviewer is asking the author to amend the change
 * 0 (No score): The reviewer is just commenting, not expressing judgment
 * +1 (Looks good to me, but someone else must approve): The reviewer is expressing support, but is not approving the change
 * +2 (Looks good to me, approved): The reviewer has approved the change, and it will be accepted without further amendments

Any Gerrit user can leave -1, 0 and +1 reviews (i.e. can express support or ask for amendments), but only reviewers with "+2 rights" can leave +2 and -2 reviews (i.e. can approve or veto a change).

Once a change has been approved (i.e., a single reviewer has +2ed it), our continuous integration infrastructure (often called "CI" or "Jenkins") runs a suite of automated tests on the change. If all of those tests pass, the change is "merged", meaning it is accepted into the official version of the code. The latest version of the code with all merged changes is called "master".

Beta cluster
The beta cluster (also known by its old name "beta labs") is a set of wikis used for testing that runs the code in master. It is updated by an automated process about every 10 minutes. Once a change is merged, it usually appears on the beta cluster within 30 minutes. QA engineers use the beta cluster wikis for testing new features and bug fixes. Because every change goes live on the beta cluster immediately after it's merged, beta cluster breaks from time to time when bad changes are merged. This is our first opportunity to notice and fix problems, without them affecting production wikis.

The deployment train
The code on our "production" wikis (Wikipedias, Wiktionaries, Wikiquotes, etc) is updated weekly using a staged process called the "deployment train". The train deployments and processes are performed by a member of the Release Engineering team.

Deployment branches
Each week, the deployer creates a snapshot of master. This snapshot is then frozen, and rolled out to the wikis in stages over the course of the week. The snapshot is called a "deployment branch", and the process of creating it is called "cutting the deployment branch" or "the branch cut". Every change that was merged before the branch was cut will be deployed that week, and every change that was merged after ("missed the train") will be deployed the following week. Because a snapshot is used, the changes riding the train range from almost a week old (the ones that narrowly missed the previous week's train) to very recent (the ones that were merged just before the cut happened).

Each deployment branch is assigned a version number that looks like ''1. 34 .0-wmf. 21 . The second number (21'') is incremented every week. The first number (34) is incremented twice a year, at which point the second number resets: 1.33.0-wmf.25 was followed by 1.34.0-wmf.1. When referring to deployment branches, developers typically only use the second number: 1.34.0-wmf.21 is colloquially referred to as "wmf.21".

Some weeks, the deployment train is skipped, often because of holidays, WMF all-hands, or when the Release Enigineering or SRE teams have an off-site. When this happens, that week's deployment branch number is also skipped. For example, 1.34.0-wmf.11 (week of June 24th, 2019) was followed by 1.34.0-wmf.13 (week of July 8th); the week of July 1st was skipped because of the 4th of July holiday, so the number 1.34.0-wmf.12 was skipped as well.

A schedule of past and future deployment branch numbers is available on the roadmap page.

The weekly train schedule
On Tuesday, the deployer creates the new deployment branch, and deploys it to the wikis in group0. This group contains several test wikis (such as test.wikipedia.org, test.wikidata.org and test-commons.wikimedia.org), and mediawiki.org. These are production wikis, but they're not "real" wikis with real users (except for mediawiki.org, whose users are mostly developers). This is our first chance to find issues with the new code running in a production environment, without being exposed to many users yet. Typically, issues caught at this stage are significant issues that are noticeable even with a small number of users, or that cause lots of error messages in the logs, or issues related to complexities in the production environment that aren't modeled well in the beta cluster environment.

If everything goes well in group0 (or any issues found are addressed quickly) then on Wednesday, the new version is deployed to the wikis in group1. This group contains all non-Wikipedia wikis (Wiktionaries, Wikiquotes, Wikivoyages, etc). and the Catalan and Hebrew Wikipedias. This is the first time a significant number of users use the new software. Most remaining problems are found at this stage. Typically, these are issues specific to certain wikis, especially wikis where centralized functionality lives (Commons, Wikidata, Meta and loginwiki), or issues that are hard to discover without a high volume of usage.

If everything keeps going well, then on Thursday, the new version is deployed to all remaining wikis (group2). This group contains all Wikipedias (except Catalan and Hebrew, which are in group1), and accounts for the vast majority (over 90%) of our pageviews. Most weeks, we don't find any significant problems at this stage, but it does sometimes happen. You can see the current version running on each of the groups in the versions tool. You can click to expand each group to see the list of wikis in that group.

Train deployments usually take place around 13:00 UTC (if that week's deployer is in Europe) or 19:00 UTC (if they're in North America). For any given week's schedule, see the "MediaWiki train" entries on the deployment schedule page.

Train rollbacks, delays and blockers
A problem that prevents the next deployment branch from being deployed (to some or all wikis) is called a train blocker. Train blockers are filed as tasks in Phabricator, and their priority is always Unbreak Now. Every train has a tracking task (in the Train Deployments project) that tracks blockers for that train: every train blocker task is a child task of this tracking task. When a train blocker is found, the train is often rolled back, meaning that a wiki or group of wikis that was on e.g. wmf8 is downgraded back to wmf.7. The train might be rolled back partially (only on some wikis/groups) or completely, depending on the specifics of the issue. The train will not advance (i.e. the new version will not be deployed to any more wikis) until the train blockers are resolved. Train blockers typically disrupt the train schedule, with some phases being delayed, and later phases potentially being accelerated to catch up.

For a lot more information on this topic, and policies on when and how to hold the train, see the Holding the train page on wikitech.

Timing of merges around the branch cut
Reviewers are expected to be careful when merging changes shortly before the branch cut happens (on Mondays, and early on Tuesdays). Changes that are large, complex, impactful, or "scary" should not be merged shortly before the cut, with little time to test them in beta labs before they ride the train and get deployed to production. Changes that introduce new translated messages also shouldn't be merged last-minute, because then there's no opportunity for the new messages to be translated.

Such changes can be merged on any other day of the week, but for especially sensitive changes, a reviewer may deliberately choose to hold back approval of a change until the branch cut has happened, and merge it on Tuesday right after the branch cut. This maximizes the amount of time for testing on the beta cluster, and for any issues found to be addressed with follow-up changes, before the change rides the next week's train.

How to find out which train a change is in
Every Tuesday, some time after the branch cut, a list of all changes included in that branch is added to the wiki page for that branch. This list contains the title of each change, the link to the change in Gerrit, and links to the Phabricator task(s) tagged in the change. The list of changes in 1.34.0-wmf.21, for example, is MediaWiki 1.34/wmf.21. These pages are linked from the roadmap page.

You can also see which train a certain change was (or will be) deployed on through Gerrit. When viewing the change in Gerrit, first verify that the change has been merged (it will say "Merged" in the title at the top); only merged changes are picked up by the train. If the change is merged, click "Included in" to reveal a list of branches the change is in. If one or more wmf.NN branches are listed (e.g. ), the change was deployed in the lowest-numbered train listed (in this example, wmf.20). If only "master" is listed, the change hasn't been deployed yet, but will be picked up by the next train.

Finally, ReleaseTaggerBot tags Phabricator tags with the train that their patches will be deployed in, listing both the version number and the date that that version will begin to be deployed (e.g. ). If a task has multiple changes associated with it, the tag reflects the train in which the most recently merged change was or will be deployed.

Backports / cherry-picks
Sometimes a change needs to be added to the deployment branch after the branch is cut. This most commonly happens when an issue is discovered with the code in the deployment branch, or sometimes because a change was merged too late and missed the train for other reasons. Adding a change to the deployment branch after the fact is called backporting or cherry-picking the change. Technically, this involves creating a copy of the change that applies to the deployment branch (instead of master), while omitting any other changes that were merged after the branch was cut, then merging that change into the deployment branch.

If a change needs to be backported, it is first submitted to Gerrit, reviewed, and merged into master like any other change. Then, a cherry-pick change is created in Gerrit against the relevant deployment branch (e.g. ); in cases where two deployment branches are active and both are affected by the bug, two cherry-picks need to be created, one for each branch (e.g. one for   and one for  ). The cherry-pick change is then scheduled for deployment in a SWAT window, by adding it to the Deployments wiki page (see the section on SWAT deployments below). When it's time to deploy the change, the deployer +2s the change, it gets merged, and the deployer deploys it.

Note that cherry-picks in deployment branches are not merged until right before they are deployed. This because it's our policy that a deployment branch should always reflect what's in production (or what will very shortly be in production). Consequently, a change should be deployed shortly after it's merged, and only deployers have +2 rights in deployment branches.

Notes from presentation

 * Code review process
 * Gerrit is the system for code review
 * Gerrit dashboard
 * Dev makes a change on their own computer, uploads it to Gerrit
 * Reviewer can view the change, comment, have a back and forth
 * When reviewer approves change (+2), CI (Jenkins) kicks in, runs automated tests, if those all succeed the change gets merged
 * Merged = canonized, becomes part of the canonical history of MW. From proposed change to put in place
 * Beta labs
 * Once change is merged, it goes here
 * Runs in beta cluster, not real prod servers
 * Updates itself to whatever is in master every ~10 mins
 * Depending on infra hiccups, 5-30 mins after change merging it's live in beta
 * QAs use this for testing
 * This means beta breaks from time to time, when bad changes are merged. Hopefully quickly noticed and fixed
 * Deployment train
 * Runs every week
 * Link to roadmap page
 * Every Tuesday (6am or noon Pacific, depending on European vs American deployer), a releng team member will cut a WMF branch
 * Take a snapshot of what's in master right then, made into a release of sorts, e.g. 1.34.0-wmf.16
 * Number incremented every week. Skip a week, skip a number. E.g. 4th of July, skipped wmf.12
 * A week may be skipped because of holidays, large events, or offsites of relevant teams
 * Snapshot made on Tuesday, deployed to group0 (test wikis, mw.org and closed wikis, but mw.org is the only "real" wiki) that same day
 * Code is now running in real prod but only on test wikis and mw.org (techy people only)
 * Hopefully if something is terribly wrong, we'll notice it at this stage. Most bugs found at this stage are things that break something really badly, or spam the logs.
 * Bugs/tasks preventing deployment from going forward are called "train blockers", when one is encountered the train is generally rolled back
 * Then group1 on Wednesday. Non-Wikipedias plus Catalan and Hebrew Wikipedias. (Including big shared wikis like Commons and Wikidata)
 * With some frequency, we find train blockers at the group1 stage. User-facing bugs sometimes get caught in group0, sometimes in group1 depending on how easy they are to run into.
 * Then group2 (i.e. everywhere) on Thursday. group2 = Wikipedias (except ca, he), including all the big ones
 * Status dashboard on tools [link], tells you which group is running which version
 * Normal situations: 15/15/15, 16/15/15, 16/16/15, 16/16/16
 * Click to expand which wikis are in which groups
 * Train is operated by releng team members
 * Train blockers
 * [move some stuff here]
 * Every train has a tracking task
 * Train blockers are always UBN
 * Practices
 * Merge scary changes on Tuesday, right after the cut: maximum testing time in beta labs before it goes to prod
 * Careful with merging non-trivial changes on Monday
 * Determining what's in which train
 * wmf.* MW pages with full list of changes and task numbers
 * Phab tags
 * Cherry-picks
 * Sometimes the snapshot/release needs to be modified after it was made (or even partially deployed)
 * Because something is broken, or because you want to
 * Change first follows normal process: submitted to Gerrit, reviewed, merged into master
 * Then a cherry-pick commit is created, backporting the (now-merged) commit to wmf.N.
 * Can be done from the Gerrit UI (unless there's a conflict)
 * Cherry-pick is git lingo for taking just one change and transplanting it on top of the release, omitting things that happened in between
 * Cherry-pick is its own Gerrit change. These are approved blindly by deployers, and sometimes self-approved, because the underlying change in master already went through review
 * Only deployers can approve cherry-picks, and they must be deployed immediately after merge: this way the wmf.N branch always reflects what is deployed in production
 * Sometimes a change must be cherry-picked twice, when two different versions are live (e.g. wmf.15 and wmf.16) and both are affected
 * Cherry-picks typically deployed in SWAT windows, although train blockers sometimes get deployed immediately
 * SWATs
 * Deployment schedule page on wikitech
 * SWATs are daily opportunities (usually 3 per day, 0 on Fridays) where people can list patches to be deployed
 * Each window lists the requestor (with irc nick), a list of patches (links to Gerrit) and a category prefix ([config] or [wmf.NN])
 * At the start of a scheduled deploy window, jouncebot pings the listed deployers and the listed requestor for each patch in #wikimedia-operations on IRC
 * A deployer responds to say they will perform the SWAT
 * Requestors are expected to confirm their presence
 * Each change is first merged in Gerrit, then deployed to a test server, where the requestor tests it (using the WikimediaDebug browser extension)
 * Once the requestor confirms the change works, the deployer deploys it to the real servers
 * Patches may not be deployed in the order listed, the order is up to the deployer. They usually decide this based on which requestors immediately indicated their presence, and on how long it will take to merge a change (CI takes much longer for wmf.N cherry-picks than it does for config patches)
 * Config patches
 * The other use case for SWATs
 * SWAT is the only process for deploying config changes, the train doesn't pick them up
 * MW has a lot of config settings that can be changed on a per-wiki basis, also which extensions are enabled on which wiki
 * Config is in a git repo, versioned as if it were software, it's in a Gerrit repo. Processes work as normal. Anyone can submit a config change to Gerrit.
 * Only deployers have +2 rights in this repo, and changes are only merged right before they're deployed
 * Config for beta labs also lives in this repo (and beta sites inherit their prod counterparts' config, plus beta-specific overrides); beta-only patches can be +2ed at any time, but the +2er should git pull them onto the deployment server (without syncing). A Jenkins job deploys config changes to beta as they are merged.
 * Most config settings are set declaratively in InitialiseSettings.php
 * How the deployment process works (in a nutshell)
 * ssh into the deployment server
 * Run git pull in the right directory (wmf-config repo, or e.g. php-1.34.0-wmf.16); for extensions, then run git submodule update extension/Foo
 * ssh into mwdebug1002 and run scap pull there
 * On the deployment server, run scap sync-file (individual files/directories) or scap sync (everything)
 * TODO: explain extension submodules and cherry-picks to them
 * TODO: explain that i18n changes require a full scap: something about the distinction between code that can ride the train (new functionality) and config that needs a SWAT (telling existing functionality how to behave)
 * TODO: explain that i18n changes require a full scap: something about the distinction between code that can ride the train (new functionality) and config that needs a SWAT (telling existing functionality how to behave)


 * Also explain dark deploys
 * Something about using cherry-picks sparingly (emergencies, small changes)
 * TODO: wiki page with exhaustive list of changes