Wikimedia Release Engineering Team/Checkin archive/2022-06-08

= 2022-06-08 =

Inspiration Week 2022

 * 5 weeks away
 * July 11th: 
 * Please do it!
 * I know you have ideas, let's lead here

ERC Update

 * Still working

Team API
https://docs.google.com/document/d/1KoWCLyhHbekAf8OTmtDnCvNzssCnzEv479et9ljhhvs/edit#

🏆 Wins

 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Monthly_notable_accomplishments
 * June '22 edition


 * GitLab Sprint summary by Brennen https://phabricator.wikimedia.org/phame/post/view/288/gitlab-a-thon/
 * We have GitLab on new metal, and can probably enable GL Container Registry \o/
 * We know more about git than we did in May
 * Functional scap already self-installed in prod
 * JWT presentation!
 * Phab deployment has a runbook https://wikitech.wikimedia.org/wiki/Phabricator/Deployment
 * scap scap

😶 Let's these this empty

 * +1'd gerrit changes
 * Gerrit access requests

📅 Vacations/Important dates

 * https://office.wikimedia.org/wiki/HR_Corner/Holiday_List#2022
 * https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar
 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Time_off

June

 * ~9-10 Jun: Brennen (🔥⛺) (probably ((maybe)))
 * 9 Jun: Antoine


 * 15-17 Jun: Dan (🎒⛰)
 * 20 Jun: Juneteenth (observed) (U.S. Staff with Reqs)
 * 22 Jun: Brennen out afternoon
 * 20-30 Jun: Jaime

July

 * 1 Jul: Jaime; Dan (🏔)
 * 4 Jul: US Independence day (U.S. Staff with Reqs)
 * 25-29 Jul: Dancy out
 * ~29 Jul: Brennen (🪕)

August

 * Antoine: some weeks
 * 9 Aug: International Day of the World’s Indigenous Peoples
 * 12 Aug: Brennen (🎸)
 * 27-31 Aug: Brennen (🔥)

September

 * 5 Sept: U.S. Labor Day (U.S. Staff with Reqs)
 * 1-6 Sept: Brennen (🔥)
 * ~14-18 Sept: Brennen (⛺🪕)

🔥🚂 Train

 * https://tools.wmflabs.org/versions/
 * https://train-blockers.toolforge.org/
 * https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar


 * 2 May – wmf.10 – Antoine + Brennen
 * 9 May - wmf.11 – Skipping for GitLab-a-thon
 * 16 May - wmf.12 - Jaime + Antoine
 * 23 May - wmf.13 - Ahmon + Jaime (Antoine out)
 * 30 May - wmf.14 - Jeena + Ahmon


 * 6 Jun - wmf.15 - Dan + Jeena (Brennen out)
 * 13 Jun - wmf.16 - Brennen + Jeena (Dan out)
 * 20 Jun - wmf.17 - Antoine + Brennen (Jaime out)
 * 27 Jun - wmf.18 - Dan + Antoine (Jaime out)
 * 4 Jul - wmf.19 - Jaime + Dan
 * 11 Jul - wmf.20 - Ahmon + Jaime
 * 18 Jul - wmf.21 - Jeena + Ahmon
 * 25 Jul - wmf.22 - Brennen + Jeena
 * 1 Aug - wmf.23 - Antoine + Brennen
 * 8 Aug - wmf.24 - No train (Brennen out)
 * 15 Aug - wmf.25

Hiring Update

 * We hired someone!
 * Let's fix up our onboarding
 * meetings are incoming

Simple mediawiki rollbacks are one command
The rollbacks we're considering here are rollbacks of prior `scap backport` runs. Rollbacks of wikiversions should be done by running deploy-promote w/ the desired state.

Current documentation https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Rollback git revert $(git log -1 --format=%H -- wikiversions.json) scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"' git commit --amend git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2
 * ^ update docs for backport deployers ( https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers )


 * Command exists in scap
 * Command presents a list of history that the user can select
 * The history is of prior `scap prep auto` runs. scap prep auto keeps a history.  You can view it by running "scap prep --history auto"
 * I swear we could use git notes to track those, or maybe a local git tag (recursive reflog?)
 * scap3 has .git/DEPLOY_HEAD and git tags like scap/sync/ /
 * When selected, /srv/mediawiki-staging is restaged according to the selection and scap sync-world is executed.
 * Command restores Git state on the deployment server
 * scap prep --history auto
 * Command does a full sync of the reverted state
 * scap sync-world


 * NOT IN SCOPE: Revert on Gerrit to be done by user

Current state

 * Scap backport exists
 * Stages Gerrit version numbers
 * No rollback behavior
 * scap prep auto subsystem has a mechanism for recording successful staging
 * Jeena is working on: Scap backport doesn't do mwdebug currently

Pre-discussion

 * If you sync and something goes wrong, it should rollback
 * If you notice something is wrong later, you backport a NEW revert


 * Where does a rollback differ from deploying a patch?
 * MWDebug or scap stopping you from deploying
 * Only for things using scap prep auto


 * Is there ever a case where you finish a backport, and then later you want to undo?
 * scap prep --history
 * scap rollback: gives you a history, and you can just hit "enter"

Stage-train testwikis happens without human intervention

 * New code is checked out and security patches are applied
 * DONE (stage-train does all the magic)
 * systemd timer (cron job?) for stage-train
 * https://releases-jenkins.wikimedia.org/ gives a nice history of builds which is handful when something explodes, then it is public and we probably don't want to hook it to the production deployment server
 * find a way to notify folks of completion
 * Probably need some alarms to be emitted on failure https://wikitech.wikimedia.org/wiki/Alertmanager ?
 * Security patch explosion handling
 * Don't we have a system to routinely test they still apply? (No, not for new branches)
 * Interlock with other deployment operations.


 * New version is sync'd to all MediaWiki servers & TestWiki runs new version
 * Do we need a way to do this without flipping wikiversions? Or JFDI testwiki? I think we landed on JFDI but we could modify scap to be able to pre-stage a non-active mw version.
 * That is "scap prep" isn't it? Iirc that is a global lock and prevent scap backports so the auto task has to be timed in a way which does not overlap with the backport window (which is CEST tied rather than PST/UTC tied). I guess 5am UTC will work or maybe just after Jenkins had cut the branch
 * agree on a set time, update deployment calendar
 * Ensure the new branch has been cut (verify jenkins job is success? check the branch exists in all repos?)
 * Handle skipping trains (due to holidays, team offsite, december deployment freeze) (maybe the timer script can check the deployment calendar repo to check whether a train should run).
 * Alerts (for failed security patches, etc.)
 * Sudo permissions to re-run service

Current state

 * MWpresync account exists
 * Cronjob that runs stage-train!
 * sudo mwpresync still needed