Wikimedia Release Engineering Team/Checkin archive/20180924

= 2018-09-24 =

Vacations/Important dates

 * https://office.wikimedia.org/wiki/HR_Corner/Holiday_List
 * How to do it


 * September 27th (Thursday) - Antoine busy handling paperwork
 * Beginning October - Mid october, Antoine to take off some weeks/days/part time
 * October 5th (Friday) - Željko on a conference (https://2018.webcampzg.org/ )
 * October 8th - Holiday (Indigenous People's Day, Independence Day - Željko)
 * October 8th - New hire start date
 * November 1 (Thursday) - Holiday (All Saints' Day - Željko)
 * November 9th - Holiday (Veteran's Day)
 * November 22+23 - Holidays (Thanksgiving)
 * November 25-december 2nd: Mukunda vacation (in California ahead of the offsite)
 * Week of December 3rd - Team offsite
 * December 24-28 - Holidays (Christmas)

Train

 * Maniphest query for deployment blocker tasks: https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-fmcvjrkfvvzz3gxavs3a&statuses=open%28%29&group=none&order=newest#R


 * July 02 - wmf.11 - Zeljko - no train, Fourth of July
 * July 09 - wmf.12 - Zeljko
 * July 16 - wmf.13 - Zeljko
 * July 23 - wmf.14 - Zeljko
 * July 30 - wmf.15 - Mukunda
 * Aug 06 - wmf.16 - Mukunda
 * Aug 13 - wmf.17 - Mukunda (No train - Wednesday is a holiday)
 * Aug 20 - wmf.18 - Tyler
 * Aug 27 - wmf.19 - Dan && Antoine lurking over the shoulders
 * Sep 03 - wmf.20 - Antoine
 * Sep 10 - wmf.21 - Antoine (No train due to DC switchover)
 * Sep 17 - wmf.22 - Antoine
 * Sep 24 - wmf.23 - Zeljko <
 * Oct 01 - wmf.24 - Dan
 * Oct 08 - wmf.25 - Dan (No train due to DC switchover)
 * Oct 15 - wmf.26 - Mukunda (last 1.32 wmf.XX release, 1.33 starts the next week)
 * Oct 22 - wmf.1 - Mukunda

SoS

 * July 04 - Dan
 * July 11 - Antoine
 * July 18 - Antoine
 * July 25 - Tyler
 * Aug 01 - Tyler
 * Aug 08 - Zeljko
 * Aug 15 - Dan (No SoS this week)
 * Aug 22 - Zeljko
 * Aug 29 - Zeljko
 * Sep 05 - Tyler / Željko
 * Sep 12 - Tyler / Željko
 * Sep 19 - Dan / Željko
 * Sep 26 - Zeljko <
 * Oct 03 - Zeljko
 * Oct 10 - Zeljko
 * Oct 17 - Zeljko
 * Oct 24 - Zeljko
 * Oct 31 - Zeljko

Hiring

 * Software Engineer position open and reviewing/hiring for now
 * https://boards.greenhouse.io/wikimedia/jobs/1225258

First Offsite
Details:
 * Week of December 3rd
 * At the Queen Mary hotel in Long Beach
 * Deb T will be facilitating

Topics!
 * https://etherpad.wikimedia.org/p/RelEng-Offsite-201811-Topics

Development plans

 * Due end of the week!

Needs attention

 * 2018-09-10 -- Gerrit Privacy Policy & CoC patch
 * https://phabricator.wikimedia.org/T196835
 * 2018-09-17 -- Patches for new UI:
 * (ops/puppet) Replace polygerrit theme in repo: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458523/
 * (gerrit) Remove from repo: https://gerrit.wikimedia.org/r/#/c/operations/software/gerrit/+/458524/
 * (ops/puppet) Add footer link for new UI: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458833/
 * (ops/puppet) Add footer link for old UI: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460914/
 * All applied to: http://gerrit.tylercipriani.com:8080
 * 2018-09-24 -- for puppet swat tomorrow


 * 2018-09-10 -- Run mediawiki::maintenance scripts in Beta Cluster
 * https://phabricator.wikimedia.org/T125976
 * Tyler to create instance
 * 2018-09-17 - not done
 * 2018-09-24 -- done (deployment-mwmaint01)

Operational Excellence posts

 * greg got it at 5:45 on Friday, hasn't had a chance to review yet....


 * CI: https://docs.google.com/document/d/181-LQJ-iyxKYXEo93tEGEiAsm9dtPiA3iXVDCj_RSRI/edit?ts=5ba5620c
 * Ops: https://docs.google.com/document/d/1dtkvwWGknReIqhA2wkGQSr1Exmel4Q00bttT3rXQBQE/edit?ts=5ba56205

Scrum of Scrums

 * Greg to copy to etherpad after meeting: https://etherpad.wikimedia.org/p/Scrum-of-Scrums

Release Engineering

 * Blocked by:
 * Blocking:
 * Updates:
 * Train Health:
 * Log Health:
 * T204871 web request took longer than 60 seconds and timed out (copy to callouts)
 * Code Health:
 * Creating communication channels (Phabricator https://phabricator.wikimedia.org/tag/code-health-metrics/, IRC, mailing list)
 * Code Health:
 * Creating communication channels (Phabricator https://phabricator.wikimedia.org/tag/code-health-metrics/, IRC, mailing list)

Release Engineering

 * Blocked by:
 * [WMCS] Increased quotas for vcpu and memory in integration project: https://phabricator.wikimedia.org/T204373
 * Blocking:
 * Updates:
 * Train Health: no train last week due to DC switchover, train continues this week
 * Log Health:
 * Code Health:
 * Code Health Metrics Working Group Kickoff last week
 * Code Health Metrics Working Group meeting this week - further discuss/define the workgroup's scope and next steps
 * Code Health Metrics Working Group meeting this week - further discuss/define the workgroup's scope and next steps

Train status and happenings

 * https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Roles#Train_Conductor

1.32.0-wmf.22 went well. Antoine wrote a quick summary at end of task with some thank you for people involved. https://phabricator.wikimedia.org/T191068#4604040

Things potentially worth attention:


 * New but not blocking T204871: Promoting group1 to 1.32.0-wmf.22 caused a spam of web request took longer than 60 seconds and timed out
 * wikiversions.json update (and probably any scap action) cause a spam of requests timeout. That selfs resolves. The timeouts were previously NOT enforced, so we probably always had the issue and they just show up now. To be investigated.
 * For next train: the times out can be ignored for the next 3 or 4 minutes. See task for details.


 * Worked around T204907: Scap is checking canary servers in dormant instead of active-dc
 * scap dsh groups were still referencing EQIAD server making the canary check useless. Antoine changed to codfw hosts. A better solution would have to be found to change them automagically based on the active datacenter. Maybe conftool/etc can come to help.


 * Known T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out
 * When wikis change versions, ORES seems to have troubles handling the new requests. There are a few http timeouts when reaching ORES service. Amir stepped in immediately, asked on Friday whether that was UBN worthy, but Antoine said it could wait for Monday SWAT.


 * thcipriani: maybe make a simple timeline/incident report for this? (frex https://wikitech.wikimedia.org/wiki/Incident_documentation/20180821-Train )
 * ACTION: Antoine do this

Past week status updates

 * All of it in table form: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Goals/201718Q4

Pipeline: Move verify stage from Minikube to CI k8s namespace in production context

 * tracking task


 * some movement for next quarter stuff -- zotero-v2/node10js images

Code Health

 * T199253 - Investigate and propose record of origin (ROO) for deployed code (currently Developers/Maintainers page)
 * On track to have first pass proposal defined
 * Perform existing Stewardship review process for Q1 cycle.
 * T199254 - Add test evaluation to post mortem review process.
 * Review existing e2e test coverage.
 * Define prioritization scheme.
 * Prioritize e2e testing gaps.
 * T199257 - make current unit testing coverage more visible by reporting out to Engineering Management.
 * Will have first pass Code Health Newletter (which will include coverage info) by the end of week.
 * T199259 - Platform and Search Platform teams are using TDM PoC
 * T199262 - Identify key Tech Debt areas
 * T199263 - Put in place Tech Debt management process for PEP
 * T199261 - Define base Code Health metric set.
 * Working group met last week as well, have base tasks defined, and have started defining some metric candidates.

Developer Productivity

 * Make a hire to create the capacity needed for this program.
 * Write and share a survey to measure developer satisfaction and areas for investment. -


 * hiring
 * survey?

Selenium

 * Q1 goals task: T198389 Q1 Selenium framework improvements
 * T179188 Video recording for Selenium tests in Node.js - Antoine and Željko disagree on if it's done :) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/422933


 * T199133 Find top 15 target projects that could use Selenium tests to prevent incidents
 * Review existing e2e test coverage - done
 * Define prioritization scheme - doing
 * Prioritize e2e testing gaps - next

Phabricator

 * Task types work: https://phabricator.wikimedia.org/T93499
 * Blog post about task types: https://phabricator.wikimedia.org/phame/post/view/116/an_introduction_to_task_types_in_phabricator/

Jenkins

 * Timo is writing a wikitech-l newsletter and including a section about our recent CI work (disk space issues, consolidation of instances, etc.). He wants to link out to a more substantial post from us. This would need to be done by Tuesday. :)
 * (Covered. See Production Excellence section under Team Business)

QA

 * Had QA sig meeting last week. Spoke with Elena to see if additional discussions about QA career paths took place in Audiences.  None so far.

SCAP

 * Scap REAL canary patch: https://phabricator.wikimedia.org/D1114
 * thcipriani: accepted! land at will.
 * the rebuildLocalisationCache.php takes 40 minutes task is complete
 * Took 1m 7s without any changes, so it will be slower than that, but should be much much faster

Antoine
Did train, a bit of quibble and CI config. Train went well!


 * What I plan to do this week
 * What I'm blocked on
 * Other?
 * Other?
 * Other?

Dan

 * What I plan to do this week
 * Continuing my crusade of collecting Jenkins build duration stats
 * Blubberoid Swagger/OpenAPI spec
 * Development plan
 * What I'm blocked on
 * Understanding prometheus and/or best way to aggregate statsd buckets
 * Review of my change to service-checker
 * Other?
 * Anyone feel like reviewing?:
 * Blubberoid unit test
 * Remove support for `sharedvolume` in Blubber
 * thcipriani: will do some review on these :)

Greg

 * What I plan to do this week
 * interviewing
 * doing a SWAT today :)
 * "finalize" ya'lls development plans
 * ping Deb on when to start planning out our Offsite - delay this
 * review of onboarding docs again (steal some good stuff from Discovery Team's) (thcipriani: https://wikitech.wikimedia.org/wiki/Ops_Onboarding ops has good stuff to steal, too :))
 * production excellence blog review
 * Pipeline presentation outlining?
 * What I'm blocked on
 * Other?
 * Other?

Jean-Rene

 * What I plan to do this week
 * wrap up Q1 Goals
 * Dev plan
 * What I'm blocked on
 * Other?
 * Other?

Mukunda

 * What I plan to do this week
 * Finish development plan
 * Scap swat https://phabricator.wikimedia.org/T196411
 * Workiing with chase on custom "security issue" task type
 * Some other things and stuff
 * Get feedback on Dev Productivity survey
 * What I'm blocked on
 * Other?
 * Other?

Tyler

 * What I plan to do this week
 * Development plan convo
 * CoC footer patch
 * keyholder code review
 * What I'm blocked on
 * Other?
 * zotero-v2 followup as needed
 * scap workboard cleanup as there's time
 * scap workboard cleanup as there's time

Zeljko

 * What I plan to do this week
 * T191069 1.32.0-wmf.23 deployment blockers


 * T198389 Q1 Selenium framework improvements
 * T179188 Video recording for Selenium tests in Node.js - Antoine and Željko disagree on if it's done :) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/422933


 * T199133 Find top 15 target projects that could use Selenium tests to prevent incidents
 * Review existing e2e test coverage - done
 * Define prioritization scheme - doing
 * Prioritize e2e testing gaps - next
 * What I'm blocked on
 * Other?
 * Other?

Team Kanban Board Review and Triage

 * closed and touched in the 7 days
 * No update for 4 weeks
 * No update for 3 weeks
 * No update for 2 weeks
 * No update for 1 week
 * All Open
 * Review To Triage column of #releng
 * Assigned
 * Unassigned

Once / month-ish review of backlog(s)

 * releng Review To Triage column of #releng
 * releng-kanban Review unassigned in kanban
 * releng-kanban Review 'backlog' colum of -kanban
 * releng-next - Review for things we need to put on our kanban backlog
 * releng-backlog - oh my, the huge backlog of things...

Kanban stats

 * Burnup chart