Wikimedia Release Engineering Team/Checkin archive/20180910

= 2018-09-10 =

Vacations/Important dates

 * https://office.wikimedia.org/wiki/HR_Corner/Holiday_List
 * How to do it


 * Mid september - Mid october, Antoine to take off some weeks/days/part time
 * October 5th (Friday) - Željko on a conference (https://2018.webcampzg.org/ )
 * October 8th - Holiday (Indigenous People's Day, Independence Day - Željko)
 * November 1 (Thursday) - Holiday (All Saints' Day - Željko)
 * November 9th - Holiday (Veteran's Day)
 * November 22+23 - Holidays (Thanksgiving)
 * Week of December 3rd - Team offsite
 * December 24-28 - Holidays (Christmas)

Train

 * Maniphest query for deployment blocker tasks: https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-fmcvjrkfvvzz3gxavs3a&statuses=open%28%29&group=none&order=newest#R


 * July 02 - wmf.11 - Zeljko - no train, Fourth of July
 * July 09 - wmf.12 - Zeljko
 * July 16 - wmf.13 - Zeljko
 * July 23 - wmf.14 - Zeljko
 * July 30 - wmf.15 - Mukunda
 * Aug 06 - wmf.16 - Mukunda
 * Aug 13 - wmf.17 - Mukunda (No train - Wednesday is a holiday)
 * Aug 20 - wmf.18 - Tyler
 * Aug 27 - wmf.19 - Dan && Antoine lurking over the shoulders
 * Sep 03 - wmf.20 - Antoine
 * Sep 10 - wmf.21 - Antoine (No train due to DC switchover) <
 * Sep 17 - wmf.22 - Antoine
 * Sep 24 - wmf.23 - Zeljko (only one week for me? -- Željko)
 * Oct 01 - wmf.24 - Dan
 * Oct 08 - wmf.25 - Dan (No train due to DC switchover)
 * Oct 15 - wmf.26 - Mukunda (last 1.32 wmf.XX release, 1.33 starts the next week)
 * Oct 22 - wmf.1 - Mukunda

SoS

 * July 04 - Dan
 * July 11 - Antoine
 * July 18 - Antoine
 * July 25 - Tyler
 * Aug 01 - Tyler
 * Aug 08 - Zeljko
 * Aug 15 - Dan (No SoS this week)
 * Aug 22 - Zeljko
 * Aug 29 - Zeljko
 * Sep 05 - Tyler / Željko
 * Sep 12 - Tyler / Željko <
 * Sep 19 - Dan
 * Sep 26 - Dan
 * Oct 03 - Zeljko
 * Oct 10 - Zeljko
 * Oct 17 - Antoine
 * Oct 24 - Antoine
 * Oct 31 - Mukunda

Hiring

 * Accepted!
 * October 8th start day
 * Time to review https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Onboarding (Greg did some on Friday)
 * Train the first week?


 * Software Engineer position open and reviewing/hiring for now

First Offsite
Details:
 * Week of December 3rd
 * At the Queen Mary hotel in Long Beach
 * Deb T will be facilitating

Topics!
 * https://etherpad.wikimedia.org/p/RelEng-Offsite-201811-Topics

Needs attention

 * Gerrit Privacy Policy & CoC patch
 * https://phabricator.wikimedia.org/T196835


 * Run mediawiki::maintenance scripts in Beta Cluster
 * https://phabricator.wikimedia.org/T125976
 * Tyler to create instance


 * Deprecate and remove the EducationProgram extension from Wikimedia servers after June 30, 2018
 * https://phabricator.wikimedia.org/T125618
 * legoktm poked thcipriani about it in IRC
 * add to SoS for DBA review of Reedy's proposal on the subtask


 * eqiad row D switch upgrade (email with Greg and Mukunda on thread)
 * m3 db (phabricator) effected
 * either week of Sept 17 or 24th
 * mukunda to reply :)

Google Code In ?

 * https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090799.html
 * interest? Need small/easy-ish tasks that you're willing to help someone think through and review.

Scrum of Scrums

 * Greg to copy to etherpad after meeting: https://etherpad.wikimedia.org/p/Scrum-of-Scrums

Release Engineering

 * Blocked by:
 * DBA (in support of Reedy): https://phabricator.wikimedia.org/T174802 (EducationProgram db dump in prep of removing the extension)
 * Blocking:
 * you tell us :)
 * Updates:
 * Train:
 * we had a UBN! backport needed on Thursday ( https://phabricator.wikimedia.org/T203566 )
 * This has been thoroughly documented in https://phabricator.wikimedia.org/T156541 and it is a regularly recurring problem which causes production breakage every time the structure of a class is changed in an incompatible way. We can do better!
 * Log Health:
 * Exception thrown for failure to save settings appears ~ 1000 times/day: https://phabricator.wikimedia.org/T202149 (Note: add to SoS Callouts)

Release Engineering

 * Blocked by:
 * Noise from https://phabricator.wikimedia.org/T201082 during Train deployment (not really blocked but distracted)
 * Blocking:
 * Updates
 * Train:
 * 1.32.0-wmf.20 at group 1, no problems
 * on European time this week
 * No train next week, DC switchover
 * Log Health:
 * Exception thrown for failure to save settings appears ~ 1000 times/day: https://phabricator.wikimedia.org/T202149
 * labtestweb2001 is sending updates to a read-only db host: db2037: https://phabricator.wikimedia.org/T201082
 * ErrorException from line EducationProgram PHP Notice: Undefined variable: retValue: https://phabricator.wikimedia.org/T203577
 * ErrorException from line EducationProgram PHP Notice: Undefined variable: retValue: https://phabricator.wikimedia.org/T203577

Train status and happenings

 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Roles#Train_Conductor

Past week status updates

 * All of it in table form: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Goals/201718Q4

Pipeline: Move verify stage from Minikube to CI k8s namespace in production context

 * tracking task


 * Done

Code Health

 * T199253 - Investigate and propose record of origin (ROO) for deployed code (currently Developers/Maintainers page)
 * talked to Tyler a bit about this. Also need to hook up with SRE (and other stakeholders).  This appears to be tightly coupled with the review queue process.
 * Perform existing Stewardship review process for Q1 cycle.
 * no reviews requested at the moment. Corey requested to meet with me today to discuss finding homes for some the platform code.
 * T199254 - Add test evaluation to post mortem review process.
 * Review existing e2e test coverage.
 * Define prioritization scheme.
 * Prioritize e2e testing gaps.
 * T199257 - make current unit testing coverage more visible by reporting out to Engineering Management.
 * worked on creating a template for the first montly report. Actually thinking that this will be part of a broader Code Health monthly newsletter.
 * T199259 - Platform and Search Platform teams are using TDM PoC
 * T199262 - Identify key Tech Debt areas
 * T199263 - Put in place Tech Debt management process for PEP
 * T199261 - Define base Code Health metric set.
 * scheduled WG kickoff meeting (tomorrow)

Developer Productivity

 * Make a hire to create the capacity needed for this program.
 * Write and share a survey to measure developer satisfaction and areas for investment. -

Selenium

 * Q1 goals task: T198389 Q1 Selenium framework improvements
 * T179188 Video recording for Selenium tests in Node.js
 * patch working, Timo requested some changes https://gerrit.wikimedia.org/r/c/mediawiki/core/+/422933
 * T188742 Run tests daily targeting beta cluster for all repositories with Selenium tests
 * almost done for repos that had nodepool jobs, all jobs green, patch in final review https://gerrit.wikimedia.org/r/c/integration/config/+/443931
 * working on repos that did not have jobs https://gerrit.wikimedia.org/r/c/integration/config/+/457882
 * T185011 Create selenium-daily-beta-MediaWiki daily Jenkins job
 * job green because it's running only tests that pass on beta https://integration.wikimedia.org/ci/view/Selenium/job/selenium-daily-beta-MediaWiki/
 * patch in review https://gerrit.wikimedia.org/r/c/integration/config/+/457881
 * I have to investigate which tests could run on beta with little/some/much refactoring

Phabricator

 * Nothing significant to note this week.

Jenkins

 * mediawiki-quibble docker jobs fails due to disk full
 * https://phabricator.wikimedia.org/T202457
 * running containers eating into /
 * Docker devicemapper


 * Statsd publisher -- still going? still needed?
 * Either needs to be expanded to collect more data or ditched
 * Doesn't allow for metadata so it's harder to narrow down stats to particular segments
 * In the end, it was a lot easier to pull stats from the Jenkins JSON API, but Jenkins only keeps 30 days worth of build data around
 * Can we simply extend the Jenkins retention period?

QA

 * all quiet - no additional discussions have occurred since the initial barrage.

Antoine

 * What I plan to do this week
 * disk space cleanup patch
 * new version of chromium in debian
 * What I'm blocked on
 * articleplaceholder quint test fixes
 * donationinterface composer-merge bug
 * Other?
 * Last train:
 * database corruption from parer-cache output

Dan

 * What I plan to do this week
 * Helping with CI disk-full problems – https://phabricator.wikimedia.org/T202457
 * Rolling back stats publisher (wah wah)
 * Starting up new bigmem instance
 * Looking at JSON API based stats for tmpfs change and node types
 * What I'm blocked on
 * Disk full problems in CI
 * Other?

Greg

 * What I plan to do this week
 * Hiring/resume review/getting more applicants
 * Development Plans - mine, yous'ins
 * read the first chapter of https://www.worldcat.org/title/leadership-pipeline-how-to-build-the-leadership-powered-company/oclc/47009595 (everything is a pipeline)
 * make a thing based on that ^ that's due, uh, soon, this week?
 * Staging catchup?
 * What I'm blocked on
 * time, I'm late in doing a required manager training :(
 * Other?

Jean-Rene

 * What I plan to do this week
 * Continue work on ROO/Review Queue
 * Continue work on Code Coverage/Code Health report
 * Code Health Metrics workgroup is spinning up
 * Talk with Corey about platform code ownership
 * What I'm blocked on
 * Got new laptop (yay!). But it's got a firmware PW :-( and as a result can't migrate my current laptop's configuration.
 * WTF?!
 * Other?

Mukunda

 * What I plan to do this week
 * Work on developer productivity survey.
 * Finish phabricator support for elasticsearch 6.
 * Finish testing the phabricator spam-revert tool.
 * Pairing with Tyler (task TBD, probably catch up on scap and keyholder stuff)
 * Create a personal workboard in phab.
 * What I'm blocked on
 * Other?
 * Other?

Tyler

 * What I plan to do this week
 * Fix eval.jit=1 via scap (I'm evidently running a newer sudo version than prod)
 * keyholder patch review
 * get list of services running in beta (cumin?)
 * maintenance-disconnect-full-disks bring recovered nodes back online (somehow?)
 * Add instance for mwmaint1001/2001 https://phabricator.wikimedia.org/T125976
 * What I'm blocked on
 * Other?
 * tinker w/zotero v2 if there's time
 * review ext/ContentTranslation -> gatedextension https://gerrit.wikimedia.org/r/c/integration/config/+/450508/
 * review ext/ContentTranslation -> gatedextension https://gerrit.wikimedia.org/r/c/integration/config/+/450508/

Zeljko

 * What I plan to do this week
 * T179188 Video recording for Selenium tests in Node.js
 * patch working, Timo requested some changes https://gerrit.wikimedia.org/r/c/mediawiki/core/+/422933
 * T188742 Run tests daily targeting beta cluster for all repositories with Selenium tests
 * almost done for repos that had nodepool jobs, all jobs green, patch in final review https://gerrit.wikimedia.org/r/c/integration/config/+/443931
 * working on repos that did not have jobs https://gerrit.wikimedia.org/r/c/integration/config/+/457882
 * T185011 Create selenium-daily-beta-MediaWiki daily Jenkins job
 * job green because it's running only tests that pass on beta https://integration.wikimedia.org/ci/view/Selenium/job/selenium-daily-beta-MediaWiki/
 * patch in review https://gerrit.wikimedia.org/r/c/integration/config/+/457881
 * I have to investigate which tests could run on beta with little/some/much refactoring
 * What I'm blocked on
 * Other?
 * Other?

Team Kanban Board Review and Triage

 * closed and touched in the 7 days
 * No update for 4 weeks
 * No update for 3 weeks
 * No update for 2 weeks
 * No update for 1 week
 * All Open
 * Review To Triage column of #releng
 * Assigned
 * Unassigned

Once / month-ish review of backlog(s)

 * releng Review To Triage column of #releng
 * releng-kanban Review unassigned in kanban
 * releng-kanban Review 'backlog' colum of -kanban
 * releng-next - Review for things we need to put on our kanban backlog
 * releng-backlog - oh my, the huge backlog of things...

Kanban stats

 * Burnup chart