User:Roan Kattouw (WMF)/Deployment process

Notes from presentation

 * Code review process
 * Gerrit is the system for code review
 * Gerrit dashboard
 * Dev makes a change on their own computer, uploads it to Gerrit
 * Reviewer can view the change, comment, have a back and forth
 * When reviewer approves change (+2), CI (Jenkins) kicks in, runs automated tests, if those all succeed the change gets merged
 * Merged = canonized, becomes part of the canonical history of MW. From proposed change to put in place
 * Beta labs
 * Once change is merged, it goes here
 * Runs in beta cluster, not real prod servers
 * Updates itself to whatever is in master every ~10 mins
 * Depending on infra hiccups, 5-30 mins after change merging it's live in beta
 * QAs use this for testing
 * This means beta breaks from time to time, when bad changes are merged. Hopefully quickly noticed and fixed
 * Deployment train
 * Runs every week
 * Link to roadmap page
 * Every Tuesday (6am or noon Pacific, depending on European vs American deployer), a releng team member will cut a WMF branch
 * Take a snapshot of what's in master right then, made into a release of sorts, e.g. 1.34.0-wmf.16
 * Number incremented every week. Skip a week, skip a number. E.g. 4th of July, skipped wmf.12
 * A week may be skipped because of holidays, large events, or offsites of relevant teams
 * Snapshot made on Tuesday, deployed to group0 (test wikis, mw.org and closed wikis, but mw.org is the only "real" wiki) that same day
 * Code is now running in real prod but only on test wikis and mw.org (techy people only)
 * Hopefully if something is terribly wrong, we'll notice it at this stage. Most bugs found at this stage are things that break something really badly, or spam the logs.
 * Bugs/tasks preventing deployment from going forward are called "train blockers", when one is encountered the train is generally rolled back
 * Then group1 on Wednesday. Non-Wikipedias plus Catalan and Hebrew Wikipedias. (Including big shared wikis like Commons and Wikidata)
 * With some frequency, we find train blockers at the group1 stage. User-facing bugs sometimes get caught in group0, sometimes in group1 depending on how easy they are to run into.
 * Then group2 (i.e. everywhere) on Thursday. group2 = Wikipedias (except ca, he), including all the big ones
 * Status dashboard on tools [link], tells you which group is running which version
 * Normal situations: 15/15/15, 16/15/15, 16/16/15, 16/16/16
 * Click to expand which wikis are in which groups
 * Train is operated by releng team members
 * Train blockers
 * [move some stuff here]
 * Every train has a tracking task
 * Train blockers are always UBN
 * Practices
 * Merge scary changes on Tuesday, right after the cut: maximum testing time in beta labs before it goes to prod
 * Careful with merging non-trivial changes on Monday
 * Determining what's in which train
 * wmf.* MW pages with full list of changes and task numbers
 * Phab tags
 * Cherry-picks
 * Sometimes the snapshot/release needs to be modified after it was made (or even partially deployed)
 * Because something is broken, or because you want to
 * Change first follows normal process: submitted to Gerrit, reviewed, merged into master
 * Then a cherry-pick commit is created, backporting the (now-merged) commit to wmf.N.
 * Can be done from the Gerrit UI (unless there's a conflict)
 * Cherry-pick is git lingo for taking just one change and transplanting it on top of the release, omitting things that happened in between
 * Cherry-pick is its own Gerrit change. These are approved blindly by deployers, and sometimes self-approved, because the underlying change in master already went through review
 * Only deployers can approve cherry-picks, and they must be deployed immediately after merge: this way the wmf.N branch always reflects what is deployed in production
 * Sometimes a change must be cherry-picked twice, when two different versions are live (e.g. wmf.15 and wmf.16) and both are affected
 * Cherry-picks typically deployed in SWAT windows, although train blockers sometimes get deployed immediately
 * SWATs
 * Deployment schedule page on wikitech
 * SWATs are daily opportunities (usually 3 per day, 0 on Fridays) where people can list patches to be deployed
 * Each window lists the requestor (with irc nick), a list of patches (links to Gerrit) and a category prefix ([config] or [wmf.NN])
 * At the start of a scheduled deploy window, jouncebot pings the listed deployers and the listed requestor for each patch in #wikimedia-operations on IRC
 * A deployer responds to say they will perform the SWAT
 * Requestors are expected to confirm their presence
 * Each change is first merged in Gerrit, then deployed to a test server, where the requestor tests it (using the WikimediaDebug browser extension)
 * Once the requestor confirms the change works, the deployer deploys it to the real servers
 * Patches may not be deployed in the order listed, the order is up to the deployer. They usually decide this based on which requestors immediately indicated their presence, and on how long it will take to merge a change (CI takes much longer for wmf.N cherry-picks than it does for config patches)
 * Config patches
 * The other use case for SWATs
 * SWAT is the only process for deploying config changes, the train doesn't pick them up
 * MW has a lot of config settings that can be changed on a per-wiki basis, also which extensions are enabled on which wiki
 * Config is in a git repo, versioned as if it were software, it's in a Gerrit repo. Processes work as normal. Anyone can submit a config change to Gerrit.
 * Only deployers have +2 rights in this repo, and changes are only merged right before they're deployed
 * Config for beta labs also lives in this repo (and beta sites inherit their prod counterparts' config, plus beta-specific overrides); beta-only patches can be +2ed at any time, but the +2er should git pull them onto the deployment server (without syncing). A Jenkins job deploys config changes to beta as they are merged.
 * Most config settings are set declaratively in InitialiseSettings.php
 * How the deployment process works (in a nutshell)
 * ssh into the deployment server
 * Run git pull in the right directory (wmf-config repo, or e.g. php-1.34.0-wmf.16); for extensions, then run git submodule update extension/Foo
 * ssh into mwdebug1002 and run scap pull there
 * On the deployment server, run scap sync-file (individual files/directories) or scap sync (everything)
 * TODO: explain extension submodules and cherry-picks to them
 * TODO: explain that i18n changes require a full scap
 * TODO: something about the distinction between code that can ride the train (new functionality) and config that needs a SWAT (telling existing functionality how to behave)
 * Also explain dark deploys
 * Also explain dark deploys