Wikimedia Release Engineering Team/Offsites/2018-05-Barcelona/Notes

These are the raw notes from our 2 days of offsite discussions.

Data Data Data

 * Talk with Analytics - JR
 * Talk with CE/Bitergia - JR
 * Explore Bitergia - JR
 * Identify data sources we want to collect - RelEng (who know what systems)
 * Erik Bernhardson / Guillaume Lederrey

SWATs/Trains

 * Tyler reasses scap swat in mw-config from Mukunda
 * Look into parsing scap messages for known patterns and pulling out the data
 * Look into enabling scap start/done
 * Look into recording if mwdebug was used during the deploy (eg: 'scap stage')
 * H/Now will we get time for this?
 * Have Mukunda do a couple weeks of SWATs
 * Mukunda has a lot to say about this subject.... writeup incoming

Staging

 * Greg to talk with Deb about what to do next with talking to Victoria
 * Greg to figure out how we can better market what we are accomplishing (eg "monthly showcase")
 * Get a k8s cluster from SRE for CI to deploy to.

Data Data Data
Lead: Jean-René
 * Data for code-stewardship reviews (historic data)
 * Commits & patch sets
 * Jenkins & CI, test results discarded after 15||30 days
 * Where can I put new kind of data/metrics. Is there a shared environment to store them?
 * jr: for example, talking to explanatory testers. No idea about the result of their work. Hard to get new QA testers on board. Role is broad, but a sure thing is they will either produce or consume testing data.
 * We have lots of data/dashboard, but we have not statistics over long-term
 * antoine: raita was the dashboard (but it has been decomissioned)
 * Historic dashboard for metrics and data
 * Dan: targeted towards browser-tests
 * Hypothetical Entity Relationship (ER) diagram
 * Patchsets relate to deployments
 * Deployments relate to outages
 * Relationships in a tree format
 * Relationships between gerrit change and phabricator tasks
 * Developer/maintainers page. For an extension/skin JR would like to:
 * Activity (commits and changes)
 * Outstanding tasks
 * How it follows mediawiki latest standard (ex: extension.json, versions of linters, test coverage etc)
 * Tests that are running:
 * How frequent are errors
 * How many tests are failing
 * Average resolution for a failed test (E2E, unit tests failling on unrelated change because core changed months ago and extension is barely active)
 * The pace of changes being merged
 * extension status, alpha, maintenance, wikimedia deployed, obsolete. That is mostly on mediawiki.org (partly in CI config as "archived")
 * Overview of stewardship
 * https://www.mediawiki.org/wiki/Development_policy/Code_Stewardship
 * github pulse ( https://github.com/wikimedia/mediawiki/pulse ) -- do we want that?
 * Human process oriented vs repository oriented (merges vs task closing)
 * time to resolution (TTR) for tasks (filed to resolved/declined/whatever)
 * but this is only meaningful for "bugs" not other planning type tasks
 * what are the systems that we have, how do we normalize the data for those systems, where do we put it?
 * A consistent interface for retrieving data
 * We need to keep all the data that we can -- get data outside of jenkins (for example we could send that data to elasticsearch, but currently this is locked-up in jenkins)
 * We have an agreement that we'd like to collect all the test data...somewhere somehow
 * RelEng is the best place for this data
 * Do we set this up? Or do we work with other teams to do this?
 * Proposal: prepare for a 20 minute analytics team at the hackathon
 * A system: https://wikimedia.biterg.io/app/kibana#/dashboard/Overview (see also https://www.mediawiki.org/wiki/Community_metrics )
 * Stewardship creates these open questions, useful for annual planning as well
 * Going through, system-by-system, and finding out what data we want to store

Open Questions
 * Is our current analytics stack open for use by others in open ended ways?
 * Example: https://pivot.wikimedia.org/ for page view/requests ( upstream: https://imply.io ). Lets one easily build whatever graph by country/browser etc
 * Analytics: Can we start dumping various data sources into a place and figure out how we're going to view/make sense of it later?
 * How can we interact with Bitergia to extend the data sources and views (poke Quim/Andre)
 * identify reviewers/maintainers: https://www.mediawiki.org/wiki/Git/Reviewers | https://www.mediawiki.org/wiki/Developers/Maintainers

Next Steps:
 * Talk with Analytics - JR
 * Talk with CE/Bitergia - JR
 * Explore Bitergia - JR
 * Identify data sources we want to collect - RelEng (who know what systems)
 * Erik Bernhardson / Guillaume Lederrey

SWATs/Trains
Lead: Tyler TODO
 * Automating/improving logging of SWATs and Trains - https://phabricator.wikimedia.org/T193311 :
 * It would be nice to have concrete data about SWAT windows without having to dig in the SAL. Some nice-to-have info: number of syncs per SWAT window and time spent deploying patches for a given SWAT window.
 * Problem: We've wanted to change SWAT windows/deploys. People hated that we wanted to change things (namely: reduce # of patchsets deployed and how they are done). We need data to make informed decisions. eg: correlating syncs with swats and outages.
 * Definition: SWAT is three 1 hour windows per day for developers to propose hotfixes/config changes. Served by releng / deployment group users.
 * now we have sync and we have windows and they're only relation is through the wiki pages
 * out of scope:
 * relating patches -> swat window
 * proposing patches in a window
 * Zeljko: we are just pushing buttons. We do not have much added value
 * NEEDs:
 * Given a time window, get the list of syncs / patchset deployed (and utlimately a developer / point of contact)
 * we need the data
 * a place to display/query it
 * Minimal Viable Solution
 * Have scap ask "is this a SWAT? y/n" each time it's not a full scap or --force
 * This Deployment did this Change associated with this Task.
 * what about...
 * scap swat start (or: `scap swat` starts a shell)
 * (query wiki page, list changes, etc)
 * scap swat done
 * See: "scap swat" patch from Mukunda
 * ( https://gerrit.wikimedia.org/r/#/c/306259/ / https://phabricator.wikimedia.org/T142880 ). Demo: https://asciinema.org/a/1x54kw77tvatxiqv45ba6ael7
 * current documentation https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment
 * current command: scap sync-file path/to/file 'SWAT: Commit message (T456)'
 * if the comment is not in this format, scap asks you swat/gerrit/phabricator
 * not allow deploys without first indicating what window you're starting
 * scap swat start or scap deploy start (or --force)
 * that informs scap on what how to act/log
 * mw-config.php
 * assume as soon as it's merged it's deployed
 * Tyler reasses scap swat in mw-config from Mukunda
 * Look into parsing scap messages for known patterns and pulling out the data
 * Look into enabling scap start/done
 * Look into recording if mwdebug was used during the deploy (eg: 'scap stage')
 * H/Now will we get time for this?
 * Have Mukunda do a couple weeks of SWATs
 * Mukunda has a lot to say about this subject.... writeup incoming

Staging
https://docs.google.com/document/d/1CT_pKjwiDmFhZZ9LW9mz0z434-wgr3NFdapUPWUvMNA/edit?ts=5aba5398#heading=h.ra4sbg2fs7zl 2018-2019 annual plan https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019

Lead: Greg


 * The presentation
 * The project as defined by operations is incomplete


 * The response to Victoria
 * We are here due to the initial issue of a choice between doing the Pipeline project vs a Staging project. That either/or is now a both/and.
 * Operations wants an environment that can potentially prevent outages depending on how they define it. It could potentially prevent outages of services that we don't control nor deploy.
 * We are making a survey to gather the current usage of the Beta Cluster that can help inform SRE's decisions/planning.
 * We have defined use cases
 * The other questions are best answered by SRE as they heavily depend on technical implementation decisions
 * protocol changes as proposed are out of scope to this dicussion and truthfully feel like reach through micromanagement without any real data nor reasoning.

What RelEng needs:
 * Just to continue to do our positive interaction with SRE in our weekly Pipeline meetings
 * A simple part of that is for SRE to provide a k8s cluster and/or namespace for CI to deploy to (as previously discussed and agreed upon)


 * Idea (Dan) rebrand "deployment pipeline" project to "Continuous Delivery of MediaWiki Stack"

NEXT:
 * Greg to talk with Deb about what to do next with talking to Victoria
 * Greg to figure out how we can better market what we are accomplishing (eg "monthly showcase")
 * Get a k8s cluster from SRE for CI to deploy to.

Developer Productivity JD
Lead: Greg Blog post: https://squiggle.city/~frencil/archives/20150625.html#anatomy_of_a_healthy_job_post

You will be leading the effort to improve overall developer productivity. We will want you to create a replacement for our homebuilt Vagrant-based local development environment using the latest technologies such as Kubernetes (minikube), Docker, and Helm. You will be working closely with several teams and volunteers in the community.

Responsibilities
 * Help engineer container based tooling for MediaWiki application development and deployment
 * Maintain integration of developer tooling into a continuous delivery pipeline
 * Proactively find and create productivity improvements
 * Working in a highly collaborative and open organization and community

Requirements
 * Proficiency with software, systems, or devops engineering
 * Collaboration skills are as, if not more, important as technical skills
 * Experience with continuous integration/deployment systems
 * Experience with virtualization or container technologies
 * Experience with server configuration management software

Nice to haves
 * Free Software experience
 * Experience working in a remote-first organization
 * Experience using a Kubernetes environment
 * MediaWiki and/or Wikimedia project experience
 * Golang experience

Moving to a "everyone deploys their own changes" model (for SWAT)

 * Why are SWATs scheduled?
 * Why are there only a limited number of people in-charge of doing them?

Z: Would like everyone already staff/contractors to be able to do their own deploys. Z: lot of european swat users now self deploy (eg: Amir, David Causse).


 * Turn SWATs into "volunteer patch deployment" windows. If you are staff/contractor, you deploy your own thing when you need to do it.

Pipeline Demo
Lead: Dan/Tyler https://integration.wikimedia.org/ci/job/service-pipeline-test-only-debug Job using Jenkins Pipeline. Defined in Groovy.


 * Presentation of Blubber and pipeline
 * What is minikube

Blubber and MediaWiki + extensions

 * We use docker-pkg w/ Quibble and Blubber in the pipeline. Is problem? No. Not really.
 * Use of docker-pkg is appropriate in domains that require/allow full control of Dockerfile and image build (root)
 * Base images are controlled by SRE (operations/docker-images/production-images)
 * CI images for use with Quibble are controlled by RelEng (integration/config)
 * Talked about whether we should use Quibble as entrypoint in pipeline testing. Should we? No. Probably not.
 * Different use case. Quibble depends on environment that has superset of of MW+ext dependencies. Blubber is meant to be repo-authoritative.
 * EVERYTHING IS GREAT, AGAIN.
 * What does a Blubberized MediaWiki look like? For limited scope of FY1718Q4 goal ((MediaWiki + Math) + Mathoid)? For far future?
 * Discussion about how to deal with Debian dependencies and extensions depending on each other.
 * For Q4 goal, we don't technically need to solve the ext dependency issue (Math does not depend on other extensions or skins)

Are we testing a lot
all quibble jobs -- combinations mysql/vendor/php70 mysql/composer/php70 mysql/vendor/php55 mysql/vendor/hhvmT:
 * php/js lint/eslint
 * qunit/phpunit
 * webdriver.io