Wikimedia Release Engineering Team/Checkin archive/2023-09-06

= =


 * Last time

🏆 Wins

 * Aug '23 recap


 * Migrated to Phorge!
 * Adopted GerritLab: https://gitlab.wikimedia.org/repos/releng/gerritlab
 * New GitLab runner pools
 * Migrated to use new GitLab JWT Tokens
 * Scap3 feature disable service in secondary host


 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Monthly_notable_accomplishments
 * Sep '23 edition


 * Image published for Blubber that is native LLB, no dockerfile anymore
 * implications
 * dockerfile is unnecessary since no one sees the dockerfile—we can customize each llb instruction and what it displays to the users: a name that corresponds to the blubber.yaml config
 * now we have the ability to create our own instructions
 * dockerfile2llb gone! No more external helper images that haven't been maintained just to copy files around—no more cross-platform compatibility/emulation issues
 * llb gets new stuff first—ex: diffop/mergeop https://www.docker.com/blog/mergediff-building-dags-more-efficiently-and-elegantly/
 * Phorge working on the scap3∞ deployment environment
 * Landed 3 upstream phorge patches, 1 is one we've had for years the blocks some tasks rendering (T284397)
 * Patch for T&S could outputs the MediaWiki SUL account along with the phab username (T344303)

Last week
The six questions I answer week-by-week about our work. This is pretty much all CTPO/VP/Director-types see for what we're doing. If there are specific things to call out here, let's do.

On track


 * Progress update on the hypothesis for the week
 * T345000 – Create a separate memory optimized GitLab runner pool for memory hungry jobs. We created a cpu-optimized and memory-optimized GitLab runners this week
 * In the process tweaked the size of our staging cluster to save cost
 * T300819 – Created UI to make stacked merge requests clearer (upstream)
 * T337570#9133281 – Local Gems for our GitLab instance in testing on our devtools instance— hopefully enables lots of UI customization.
 * Any new metrics related to the hypothesis
 * Repositories on Gerrit decreased (2022 last week → 2020 this week)
 * Any emerging blockers or risks
 * Reached out/set up conversations about pulling apart/scheduling migrations of repos (for T344739 – Old Platform Team projects + T344733 – Metrics Platform as I believe they're unblocked)
 * Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now?
 * No
 * Have there been any new learnings from the hypothesis?
 * No
 * Are you working on anything else outside of this hypothesis? If so, what?
 * MediaWiki 1.41.0-wmf.24
 * 309 Patches ▁▁▇█▂
 * 0 Rollbacks ██▁▁▁
 * 0 Days of delay ▁▁█▁▁
 * 1 Blockers ▅█▅▁▁
 * T345458 – Refactor Blubber's BuildKit frontend gateway to use LLB directly—enables some nicer features in our docker image builds
 * T343967 – Bugfixes for scap backport deploying two stacked patches

This week
Progress update on the hypothesis for the week Any new metrics related to the hypothesis Any emerging blockers or risks Any unresolved dependencies - do you depend on another team that hasn’t already given you what you need? Are you on the hook to give another team something you aren’t able to give right now? Have there been any new learnings from the hypothesis? Are you working on anything else outside of this hypothesis? If so, what?
 * https://phabricator.wikimedia.org/T288624

🌻 Open source/Upstream contributions

 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Upstream


 * https://github.com/yaoyuannnn/gerritlab/pulls?q=is%3Apr+is%3Aclosed
 * https://github.com/tox-dev/tox/issues/3115

Code review

 * +1'd gerrit changes
 * (filed as: https://phabricator.wikimedia.org/T344361 )

Gerrit Access requests

 * Gerrit access requests

Private repo requests
https://phabricator.wikimedia.org/search/query/E7t2_WXX01bB/#R

Gerrit repo requests

 * https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests

GitLab Access requests

 * Accounts and auth -
 * GitLab access requests

High priority tasks

 * UBN! + High: https://phabricator.wikimedia.org/maniphest/query/PkxR1BXrbbU4/#R
 * New in inbox: https://phabricator.wikimedia.org/maniphest/query/7vRDrcVnt8OI/#R

📅 Vacations/Important dates

 * https://office.wikimedia.org/wiki/HR_Corner/Holiday_List#2023
 * https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar
 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Time_off


 * 04 Sep: Labor day (US Staff with reqs)


 * 08 Sep: Tyler
 * 15, 18 Sep: Tyler
 * 26 Aug–05 Sep: Brennen (🔥)
 * 13 Weds–17 Sun: Brennen → KS (approximate)


 * 2-16 Oct: Jaime

Future

 * 15Jan - 15Mar: Andre

🔥🚂 Train

 * https://tools.wmflabs.org/versions/
 * https://train-blockers.toolforge.org/
 * https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar


 * 2 Jan - wmf.17 - Dan + Antoine (Jaime out)
 * 9 Jan - wmf.18 - Jeena + Dan (Jaime out)
 * 16 Jan - wmf.19 - Jaime + Jeena
 * 23 Jan - wmf.20 - Brennen + Jaime
 * 30 Jan - wmf.21 - Ahmon + Brennen
 * 6 Feb - wmf.22 - Chad + Ahmon
 * 13 Feb - wmf.23 – Dan + Chad
 * 20 Feb - wmf.24 – Antoine + Dan
 * 27 Feb - wmf.25 – Jaime + Antoine
 * 6 Mar – wmf.26 – Jeena + Jaime
 * 13 Mar – wmf.27 – Brennen + Jeena
 * 20 Mar – wmf.1 – Ahmon + Brennen
 * 27 Mar – wmf.2 – Chad Dan + Ahmon
 * 3 Apr – wmf.3 – Antoine + Dan
 * 10 Apr – wmf.4 – Chad + Antoine
 * 17 Apr – wmf.5 – Jaime + Chad
 * 24 Apr – wmf.6 – Jeena + Jaime
 * 1 May – wmf.7 – Brennen + Jeena
 * 8 May – wmf.8 – Antoine + Brennen (Ahmon out + Antoine Out 8th)
 * 15 May – wmf.9 – Ahmon + Antoine (Dan out + Chad out)
 * 22 May – wmf.10 – Chad + Ahmon (Dan out + Jeena out 26th)
 * 29 May – wmf.11 – Dan + Chad (Memorial Day 29th)
 * 5 Jun – wmf.12 – Jeena + Dan (Brennen out, Jaime out)
 * 12 Jun – wmf.13 – Jaime + Jeena
 * 19 Jun – wmf.15 – Cancelled for offsite
 * 26 Jun – wmf.16 – Brennen + Jaime (Jeena out)
 * 3 Jul – wmf.17 – Antoine + Brennen (3rd + 4th holidays)
 * 10 Jul – wmf.18 – Dan + Antoine (Ahmon out)
 * 17 Jul – wmf.19 – Ahmon+Dan (Brennen out Friday)
 * 24 Jul – wmf.20 – Jaime+Ahmon
 * 31 Jul – wmf.21 – Ahmon+Jaime (Jeena out, Antoine out) (Ahmon volunteered)
 * 7 Aug – wmf. 22 – No train
 * 14 Aug - wmf.23 – Ahmon+Jaime (Jeena out, Antoine out)
 * 21 Aug - wmf.24 – Dan(brennen out, Jeena out, Antoine out)
 * 28 Aug – wmf.25 – Jeena+Dan
 * 04 Sep – wmf.26 – Antoine+Jeena


 * 11 Sep – wmf.27 – Jaime+Antoine+Andre as lurker!
 * 18 Sep – wmf.28 – Brennen+Jaime
 * 25 Sep – wmf.29 –

Offsite!

 * SF
 * Approved Arrival Date: December 4, 2023
 * Approved Departure Date: December 9, 2023
 * In Person Meeting Days: December 5, 6, 7, 8

Please complete the survey by September 19

DX Runs the train
We should still make sure at least one of us is on call for the train like we do currently to offer support. In particular because I'm assuming we are still taking care of the pre-train automated processes that run late Monday/early Tuesday (branch cut + train presync) There's a few places where I think we could already review/improve the docs beforehand: Security patches: They fail relatively often. This is the only documentation I could find about patches. It would be useful to have something that explains who to contact/how to get and updated security patch when necessary Triage/Bug reporting: Especially relevant people/teams to tag: https://www.mediawiki.org/wiki/Developers/Maintainers Rollback/holding the train:I would consolidate the sections we have about breakage, holding/rolling back and where to monitor in a single place. I would add more direct links to the relevant dashboards in logstash and grafana and revise the criteria themselves too; for example, based on how we normally operate, this criterion sounds too draconian: "In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train": https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Breakage https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#Issues_that_hold_the_train
 * We want to work closer with others in Developer Experience
 * We're looking for short, well-defined projects to tackle together
 * Initially, the projects should be time-bound and simple—we're trying to build our process for doing this and learn how to do this together
 * Later they will be bigger and gnarlier

https://phabricator.wikimedia.org/T264231
 * Update on "Investigate whether issues, operations, wikis, etc. can be disabled globally on GitLab"
 * Antoine tried it!


 * metrics-platform migrating in next sprint (maybe), can I tentatively add someone? https://phabricator.wikimedia.org/T344733


 * Continuous delivery all the things
 * Have to work with SRE on this since this is the deployment-charts repo
 * Need a different way to keeping track of state
 * Access control?
 * We want GitLab to do it?
 * Why do we store the version?
 * Information needs to be stored if we need to rebuild the cluster
 * Need the image name to run, should be store somewhere
 * Git is a question of access: team A can bump versions for team B
 * Don't object to *a* git repo, but the mechanism—should give them control in what's gets deployed
 * don't want something deployed, don't merge it
 * Image tags are currently not meaningful
 * Building on a tag (although this may not be the strictest definition of continuous deployment)
 * Enforcing main always deployable would constrain people
 * Keeping those decoupled would remove that constraint on folks
 * Agree having the mentality of main always deployable is a good mindset, but it's too restrictive if our goal is to make things easier for developers