User:Roan Kattouw (WMF)/Deployment process

This page explains the processes and tools we use to deploy code at the Wikimedia Foundation.

Code review in Gerrit
When a developer wants to fix a bug, implement a feature or change the code for some other reason, they first make the change locally on their own computer and test it. Then they upload the proposed change to Gerrit, our code review system. There, other developers can review the change and comment on it. Reviewers also assign one of the following scores to their review:


 * -2 (Do not submit): The reviewer is vetoing this change. A -2 review prevents another reviewer from approving the change.
 * -1 (There's a problem with this change, please improve): The reviewer is asking the author to amend the change
 * 0 (No score): The reviewer is just commenting, not expressing judgment
 * +1 (Looks good to me, but someone else must approve): The reviewer is expressing support, but is not approving the change
 * +2 (Looks good to me, approved): The reviewer has approved the change, and it will be accepted without further amendments

Any Gerrit user can leave -1, 0 and +1 reviews (i.e. can express support or ask for amendments), but only reviewers with "+2 rights" can leave +2 and -2 reviews (i.e. can approve or veto a change).

Once a change has been approved (i.e., a single reviewer has +2ed it), our continuous integration infrastructure (often called "CI" or "Jenkins") runs a suite of automated tests on the change. If all of those tests pass, the change is "merged", meaning it is accepted into the official version of the code. The latest version of the code with all merged changes is called "master".

Beta cluster
The beta cluster (also known by its old name "beta labs") is a set of wikis used for testing that runs the code in master. It is updated by an automated process about every 10 minutes. Once a change is merged, it usually appears on the beta cluster within 30 minutes. QA engineers use the beta cluster wikis for testing new features and bug fixes. Because every change goes live on the beta cluster immediately after it's merged, beta cluster breaks from time to time when bad changes are merged. This is our first opportunity to notice and fix problems.

Notes from presentation

 * Code review process
 * Gerrit is the system for code review
 * Gerrit dashboard
 * Dev makes a change on their own computer, uploads it to Gerrit
 * Reviewer can view the change, comment, have a back and forth
 * When reviewer approves change (+2), CI (Jenkins) kicks in, runs automated tests, if those all succeed the change gets merged
 * Merged = canonized, becomes part of the canonical history of MW. From proposed change to put in place
 * Beta labs
 * Once change is merged, it goes here
 * Runs in beta cluster, not real prod servers
 * Updates itself to whatever is in master every ~10 mins
 * Depending on infra hiccups, 5-30 mins after change merging it's live in beta
 * QAs use this for testing
 * This means beta breaks from time to time, when bad changes are merged. Hopefully quickly noticed and fixed
 * Deployment train
 * Runs every week
 * Link to roadmap page
 * Every Tuesday (6am or noon Pacific, depending on European vs American deployer), a releng team member will cut a WMF branch
 * Take a snapshot of what's in master right then, made into a release of sorts, e.g. 1.34.0-wmf.16
 * Number incremented every week. Skip a week, skip a number. E.g. 4th of July, skipped wmf.12
 * A week may be skipped because of holidays, large events, or offsites of relevant teams
 * Snapshot made on Tuesday, deployed to group0 (test wikis, mw.org and closed wikis, but mw.org is the only "real" wiki) that same day
 * Code is now running in real prod but only on test wikis and mw.org (techy people only)
 * Hopefully if something is terribly wrong, we'll notice it at this stage. Most bugs found at this stage are things that break something really badly, or spam the logs.
 * Bugs/tasks preventing deployment from going forward are called "train blockers", when one is encountered the train is generally rolled back
 * Then group1 on Wednesday. Non-Wikipedias plus Catalan and Hebrew Wikipedias. (Including big shared wikis like Commons and Wikidata)
 * With some frequency, we find train blockers at the group1 stage. User-facing bugs sometimes get caught in group0, sometimes in group1 depending on how easy they are to run into.
 * Then group2 (i.e. everywhere) on Thursday. group2 = Wikipedias (except ca, he), including all the big ones
 * Status dashboard on tools [link], tells you which group is running which version
 * Normal situations: 15/15/15, 16/15/15, 16/16/15, 16/16/16
 * Click to expand which wikis are in which groups
 * Train is operated by releng team members
 * Train blockers
 * [move some stuff here]
 * Every train has a tracking task
 * Train blockers are always UBN
 * Practices
 * Merge scary changes on Tuesday, right after the cut: maximum testing time in beta labs before it goes to prod
 * Careful with merging non-trivial changes on Monday
 * Determining what's in which train
 * wmf.* MW pages with full list of changes and task numbers
 * Phab tags
 * Cherry-picks
 * Sometimes the snapshot/release needs to be modified after it was made (or even partially deployed)
 * Because something is broken, or because you want to
 * Change first follows normal process: submitted to Gerrit, reviewed, merged into master
 * Then a cherry-pick commit is created, backporting the (now-merged) commit to wmf.N.
 * Can be done from the Gerrit UI (unless there's a conflict)
 * Cherry-pick is git lingo for taking just one change and transplanting it on top of the release, omitting things that happened in between
 * Cherry-pick is its own Gerrit change. These are approved blindly by deployers, and sometimes self-approved, because the underlying change in master already went through review
 * Only deployers can approve cherry-picks, and they must be deployed immediately after merge: this way the wmf.N branch always reflects what is deployed in production
 * Sometimes a change must be cherry-picked twice, when two different versions are live (e.g. wmf.15 and wmf.16) and both are affected
 * Cherry-picks typically deployed in SWAT windows, although train blockers sometimes get deployed immediately
 * SWATs
 * Deployment schedule page on wikitech
 * SWATs are daily opportunities (usually 3 per day, 0 on Fridays) where people can list patches to be deployed
 * Each window lists the requestor (with irc nick), a list of patches (links to Gerrit) and a category prefix ([config] or [wmf.NN])
 * At the start of a scheduled deploy window, jouncebot pings the listed deployers and the listed requestor for each patch in #wikimedia-operations on IRC
 * A deployer responds to say they will perform the SWAT
 * Requestors are expected to confirm their presence
 * Each change is first merged in Gerrit, then deployed to a test server, where the requestor tests it (using the WikimediaDebug browser extension)
 * Once the requestor confirms the change works, the deployer deploys it to the real servers
 * Patches may not be deployed in the order listed, the order is up to the deployer. They usually decide this based on which requestors immediately indicated their presence, and on how long it will take to merge a change (CI takes much longer for wmf.N cherry-picks than it does for config patches)
 * Config patches
 * The other use case for SWATs
 * SWAT is the only process for deploying config changes, the train doesn't pick them up
 * MW has a lot of config settings that can be changed on a per-wiki basis, also which extensions are enabled on which wiki
 * Config is in a git repo, versioned as if it were software, it's in a Gerrit repo. Processes work as normal. Anyone can submit a config change to Gerrit.
 * Only deployers have +2 rights in this repo, and changes are only merged right before they're deployed
 * Config for beta labs also lives in this repo (and beta sites inherit their prod counterparts' config, plus beta-specific overrides); beta-only patches can be +2ed at any time, but the +2er should git pull them onto the deployment server (without syncing). A Jenkins job deploys config changes to beta as they are merged.
 * Most config settings are set declaratively in InitialiseSettings.php
 * How the deployment process works (in a nutshell)
 * ssh into the deployment server
 * Run git pull in the right directory (wmf-config repo, or e.g. php-1.34.0-wmf.16); for extensions, then run git submodule update extension/Foo
 * ssh into mwdebug1002 and run scap pull there
 * On the deployment server, run scap sync-file (individual files/directories) or scap sync (everything)
 * TODO: explain extension submodules and cherry-picks to them
 * TODO: explain that i18n changes require a full scap
 * TODO: something about the distinction between code that can ride the train (new functionality) and config that needs a SWAT (telling existing functionality how to behave)
 * Also explain dark deploys
 * Something about using cherry-picks sparingly (emergencies, small changes)
 * Something about using cherry-picks sparingly (emergencies, small changes)