User:Roan Kattouw (WMF)/Deployment process

This page explains the processes and tools we use to deploy code at the Wikimedia Foundation.

Code review in Gerrit
When a developer wants to fix a bug, implement a feature or change the code for some other reason, they first make the change locally on their own computer and test it. Then they upload the proposed change to Gerrit, our code review system. There, other developers can review the change and comment on it. Reviewers also assign one of the following scores to their review:


 * -2 (Do not submit): The reviewer is vetoing this change. A -2 review prevents another reviewer from approving the change.
 * -1 (There's a problem with this change, please improve): The reviewer is asking the author to amend the change
 * 0 (No score): The reviewer is just commenting, not expressing judgment
 * +1 (Looks good to me, but someone else must approve): The reviewer is expressing support, but is not approving the change
 * +2 (Looks good to me, approved): The reviewer has approved the change, and it will be accepted without further amendments

Any Gerrit user can leave -1, 0 and +1 reviews (i.e. can express support or ask for amendments), but only reviewers with "+2 rights" can leave +2 and -2 reviews (i.e. can approve or veto a change).

Once a change has been approved (i.e., a single reviewer has +2ed it), our continuous integration infrastructure (often called "CI" or "Jenkins") runs a suite of automated tests on the change. If all of those tests pass, the change is "merged", meaning it is accepted into the official version of the code. The latest version of the code with all merged changes is called "master".

Beta cluster
The beta cluster (also known by its old name "beta labs") is a set of wikis used for testing that runs the code in master. It is updated by an automated process about every 10 minutes. Once a change is merged, it usually appears on the beta cluster within 30 minutes. QA engineers use the beta cluster wikis for testing new features and bug fixes. Because every change goes live on the beta cluster immediately after it's merged, beta cluster breaks from time to time when bad changes are merged. This is our first opportunity to notice and fix problems, without them affecting production wikis.

The deployment train
The code on our "production" wikis (Wikipedias, Wiktionaries, Wikiquotes, etc) is updated weekly using a staged process called the "deployment train". The train deployments and processes are performed by a member of the Release Engineering team.

Deployment branches
Each week, the deployer creates a snapshot of master. This snapshot is then frozen, and rolled out to the wikis in stages over the course of the week. The snapshot is called a "deployment branch", and the process of creating it is called "cutting the deployment branch" or "the branch cut". Every change that was merged before the branch was cut will be deployed that week, and every change that was merged after ("missed the train") will be deployed the following week. Because a snapshot is used, the changes riding the train range from almost a week old (the ones that narrowly missed the previous week's train) to very recent (the ones that were merged just before the cut happened).

Each deployment branch is assigned a version number that looks like ''1. 34 .0-wmf. 21 . The second number (21'') is incremented every week. The first number (34) is incremented twice a year, at which point the second number resets: 1.33.0-wmf.25 was followed by 1.34.0-wmf.1. When referring to deployment branches, developers typically only use the second number: 1.34.0-wmf.21 is colloquially referred to as "wmf.21".

Some weeks, the deployment train is skipped, often because of holidays, WMF all-hands, or when the Release Enigineering or SRE teams have an off-site. When this happens, that week's deployment branch number is also skipped. For example, 1.34.0-wmf.11 (week of June 24th, 2019) was followed by 1.34.0-wmf.13 (week of July 8th); the week of July 1st was skipped because of the 4th of July holiday, so the number 1.34.0-wmf.12 was skipped as well.

A schedule of past and future deployment branch numbers is available on the roadmap page.

The weekly train schedule
On Tuesday, the deployer creates the new deployment branch, and deploys it to the wikis in group0. This group contains several test wikis (such as test.wikipedia.org, test.wikidata.org and test-commons.wikimedia.org), and mediawiki.org. These are production wikis, but they're not "real" wikis with real users (except for mediawiki.org, whose users are mostly developers). This is our first chance to find issues with the new code running in a production environment, without being exposed to many users yet. Typically, issues caught at this stage are significant issues that are noticeable even with a small number of users, or that cause lots of error messages in the logs, or issues related to complexities in the production environment that aren't modeled well in the beta cluster environment.

If everything goes well in group0 (or any issues found are addressed quickly) then on Wednesday, the new version is deployed to the wikis in group1. This group contains all non-Wikipedia wikis (Wiktionaries, Wikiquotes, Wikivoyages, etc). and the Catalan and Hebrew Wikipedias. This is the first time a significant number of users use the new software. Most remaining problems are found at this stage. Typically, these are issues specific to certain wikis, especially wikis where centralized functionality lives (Commons, Wikidata, Meta and loginwiki), or issues that are hard to discover without a high volume of usage.

If everything keeps going well, then on Thursday, the new version is deployed to all remaining wikis (group2). This group contains all Wikipedias (except Catalan and Hebrew, which are in group1), and accounts for the vast majority (over 90%) of our pageviews. Most weeks, we don't find any significant problems at this stage, but it does sometimes happen. You can see the current version running on each of the groups in the versions tool.

Train deployments usually take place around 13:00 UTC (if that week's deployer is in Europe) or 19:00 UTC (if they're in North America). For any given week's schedule, see the "MediaWiki train" entries on the deployment schedule page.

Notes from presentation

 * Code review process
 * Gerrit is the system for code review
 * Gerrit dashboard
 * Dev makes a change on their own computer, uploads it to Gerrit
 * Reviewer can view the change, comment, have a back and forth
 * When reviewer approves change (+2), CI (Jenkins) kicks in, runs automated tests, if those all succeed the change gets merged
 * Merged = canonized, becomes part of the canonical history of MW. From proposed change to put in place
 * Beta labs
 * Once change is merged, it goes here
 * Runs in beta cluster, not real prod servers
 * Updates itself to whatever is in master every ~10 mins
 * Depending on infra hiccups, 5-30 mins after change merging it's live in beta
 * QAs use this for testing
 * This means beta breaks from time to time, when bad changes are merged. Hopefully quickly noticed and fixed
 * Deployment train
 * Runs every week
 * Link to roadmap page
 * Every Tuesday (6am or noon Pacific, depending on European vs American deployer), a releng team member will cut a WMF branch
 * Take a snapshot of what's in master right then, made into a release of sorts, e.g. 1.34.0-wmf.16
 * Number incremented every week. Skip a week, skip a number. E.g. 4th of July, skipped wmf.12
 * A week may be skipped because of holidays, large events, or offsites of relevant teams
 * Snapshot made on Tuesday, deployed to group0 (test wikis, mw.org and closed wikis, but mw.org is the only "real" wiki) that same day
 * Code is now running in real prod but only on test wikis and mw.org (techy people only)
 * Hopefully if something is terribly wrong, we'll notice it at this stage. Most bugs found at this stage are things that break something really badly, or spam the logs.
 * Bugs/tasks preventing deployment from going forward are called "train blockers", when one is encountered the train is generally rolled back
 * Then group1 on Wednesday. Non-Wikipedias plus Catalan and Hebrew Wikipedias. (Including big shared wikis like Commons and Wikidata)
 * With some frequency, we find train blockers at the group1 stage. User-facing bugs sometimes get caught in group0, sometimes in group1 depending on how easy they are to run into.
 * Then group2 (i.e. everywhere) on Thursday. group2 = Wikipedias (except ca, he), including all the big ones
 * Status dashboard on tools [link], tells you which group is running which version
 * Normal situations: 15/15/15, 16/15/15, 16/16/15, 16/16/16
 * Click to expand which wikis are in which groups
 * Train is operated by releng team members
 * Train blockers
 * [move some stuff here]
 * Every train has a tracking task
 * Train blockers are always UBN
 * Practices
 * Merge scary changes on Tuesday, right after the cut: maximum testing time in beta labs before it goes to prod
 * Careful with merging non-trivial changes on Monday
 * Determining what's in which train
 * wmf.* MW pages with full list of changes and task numbers
 * Phab tags
 * Cherry-picks
 * Sometimes the snapshot/release needs to be modified after it was made (or even partially deployed)
 * Because something is broken, or because you want to
 * Change first follows normal process: submitted to Gerrit, reviewed, merged into master
 * Then a cherry-pick commit is created, backporting the (now-merged) commit to wmf.N.
 * Can be done from the Gerrit UI (unless there's a conflict)
 * Cherry-pick is git lingo for taking just one change and transplanting it on top of the release, omitting things that happened in between
 * Cherry-pick is its own Gerrit change. These are approved blindly by deployers, and sometimes self-approved, because the underlying change in master already went through review
 * Only deployers can approve cherry-picks, and they must be deployed immediately after merge: this way the wmf.N branch always reflects what is deployed in production
 * Sometimes a change must be cherry-picked twice, when two different versions are live (e.g. wmf.15 and wmf.16) and both are affected
 * Cherry-picks typically deployed in SWAT windows, although train blockers sometimes get deployed immediately
 * SWATs
 * Deployment schedule page on wikitech
 * SWATs are daily opportunities (usually 3 per day, 0 on Fridays) where people can list patches to be deployed
 * Each window lists the requestor (with irc nick), a list of patches (links to Gerrit) and a category prefix ([config] or [wmf.NN])
 * At the start of a scheduled deploy window, jouncebot pings the listed deployers and the listed requestor for each patch in #wikimedia-operations on IRC
 * A deployer responds to say they will perform the SWAT
 * Requestors are expected to confirm their presence
 * Each change is first merged in Gerrit, then deployed to a test server, where the requestor tests it (using the WikimediaDebug browser extension)
 * Once the requestor confirms the change works, the deployer deploys it to the real servers
 * Patches may not be deployed in the order listed, the order is up to the deployer. They usually decide this based on which requestors immediately indicated their presence, and on how long it will take to merge a change (CI takes much longer for wmf.N cherry-picks than it does for config patches)
 * Config patches
 * The other use case for SWATs
 * SWAT is the only process for deploying config changes, the train doesn't pick them up
 * MW has a lot of config settings that can be changed on a per-wiki basis, also which extensions are enabled on which wiki
 * Config is in a git repo, versioned as if it were software, it's in a Gerrit repo. Processes work as normal. Anyone can submit a config change to Gerrit.
 * Only deployers have +2 rights in this repo, and changes are only merged right before they're deployed
 * Config for beta labs also lives in this repo (and beta sites inherit their prod counterparts' config, plus beta-specific overrides); beta-only patches can be +2ed at any time, but the +2er should git pull them onto the deployment server (without syncing). A Jenkins job deploys config changes to beta as they are merged.
 * Most config settings are set declaratively in InitialiseSettings.php
 * How the deployment process works (in a nutshell)
 * ssh into the deployment server
 * Run git pull in the right directory (wmf-config repo, or e.g. php-1.34.0-wmf.16); for extensions, then run git submodule update extension/Foo
 * ssh into mwdebug1002 and run scap pull there
 * On the deployment server, run scap sync-file (individual files/directories) or scap sync (everything)
 * TODO: explain extension submodules and cherry-picks to them
 * TODO: explain that i18n changes require a full scap: something about the distinction between code that can ride the train (new functionality) and config that needs a SWAT (telling existing functionality how to behave)
 * TODO: explain that i18n changes require a full scap: something about the distinction between code that can ride the train (new functionality) and config that needs a SWAT (telling existing functionality how to behave)


 * Also explain dark deploys
 * Something about using cherry-picks sparingly (emergencies, small changes)