Wikimedia Release Engineering Team/Quarterly review, April 2014

Date: April 30th | Time: 18:00 UTC | Slides: ... | Notes:: etherpad on wiki

Who:
 * Leads: Greg G, Chris M
 * Virtual team: Greg G, Chris M, Antoine, Sam, Bryan, Chris S, Chad, Zeljko, Andre, Rummana
 * Other review participants (invited): Robla, Sumana, Quim, Maryana, James F, Terry, Tomasz, Alolita, Erik

Topics: Deploy process/pipeline, release process, bug fixing, code review, code management, security deploy/release, automation prioritization

Big picture
Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our focus. What we want to accomplish: ...All in an effort to pave the path to a more reliable continuous deployment environment.
 * More appreciation of, response to, and creation of tests in development
 * Better monitoring and reporting out of our development and deployment processes, especially test environments and pre-deployment
 * Reduce time between code being merged and being deployed
 * Provide information about software quality in a way that informs release decisions
 * Help WMF Engineering learn and adapt from experience

Team roles
Many people outside of the virtual team play an important role in releases, but this review will focus on the work of the following people in the following roles:
 * Release engineering: Greg G, Sam, Chris S (security), Bryan Davis
 * QA and Test Automation: Chris M, Zeljko, Rummana
 * Bug escalation: Andre, Greg G., Chris M, Chris S (security)
 * Beta cluster development/maintenance:' Antoine, Sam, Bryan Davis
 * Development tools (e.g. Gerrit, Jenkins): Antoine, Zeljko

Goals
vis a vis the WMF Engineering 2013-14 goals.

Deployment Tooling

 * Process through all (useful) pain points from the Dev/Deploy review session (Greg)
 * some done, not all
 * Scap incremental improvements
 * step 1:
 * mostly - Refactor existing scap scripts to enhance maintainability and reveal hidden complexity of current solution (Bryan)
 * "Easy" parts are done. Remaining work was blocked on getting scap running in beta so that changes chould be tested somewhere larger than a Vagrant VM and less potentially catastrophic than production.
 * step 2:
 * - create matrix of tool requirements per software stack (MW, Parsoid, ElasticSearch) (Greg)
 * - Use above matrix to add/fix functionality in scap (or related) tooling for ONE software stack, prioritized by cross stack use (Bryan)

Beta cluster
Goal: continue to have beta labs emulate production more closely (Antoine, all)
 * Make database in beta emulate production (set up db slaves) (Antoine)
 * This could have demonstrated a Flow problem before it was deployed
 * partly - Use beta labs as a testing ground for the above Deployment Tooling work (Greg, Bryan, all)
 * Infra work in place, so far working out.
 * Not from last QR but was a big priority - Migrate Beta cluster from pmtpa to eqiad
 * Much (most?) of the beta cluster configuration was puppetized during the migration. This is a great implevement over the prior cluster in pmtpa which included many hand-built instances.
 * Beta now includes a local puppet master which allows cherry-picking work-in-progress puppet changes and applying them across the cluster. This unblocks Antione and others from getting +2 approval in operations/pupet.git for each desired change. It also provides a testing platform for changes prior to usage in production.
 * Beta now includes a salt master which allows the use of Trebuchet and general experimentation with salt by non-roots.

Hiring

 * - Complete hiring and train new Test Infrastructure Engineer Release Engineer (Greg, all)
 * - Complete hiring and train new QA Automation Engineer Automation Engineer (Ruby)   (Chris, all)

Browser tests
Goal: use the API to create test data for given tests at run time. (Jeff, Chris, Željko) {{status}done}} in heavy use in MobileFrontend tests, queued for VisualEditor and others Goal: create the ability to test headless (Željko, Jeff, Chris) but so much more to come now that we have the basic operation working Goal: run versions of tests compatible with target test environments (Chris, all) tracking this at https://bugzilla.wikimedia.org/show_bug.cgi?id=62509 but have not implemented anything from it Ongoing:
 * target dev environments with bare wikis/one off instances//vagrant/"hermetic" test environments
 * in support of teams who requested this, for example Mobile and public Mediawiki release (Chris)
 * in support of browser tests on WMF Jenkins '''(Jeff, Željko)
 * requires thoughtful use of the API
 * first pass: create articles with particular title and content. Create users with particular names and passwords.
 * although vagrant languishes. One focus for new hire is to bring vagrant back to current
 * targets build systems (Antoine, all)
 * today we always run the master branch of browser tests. This is inconvenient, as target environments such as test2wiki lag beta labs by at least one week.
 * create the ability in Jenkins builds to run the versions of tests appropriate to the versions of extensions in the target wiki.
 * discussion is only begun, but this would be worthwhile.
 * Continue to move shared code to shared repo; e.g. Login
 * current status: https://www.mediawiki.org/wiki/Quality_Assurance/Browser_testing/Shared_features
 * Continue to maintain tests and keep them green, e.g. connection issues
 * {{status}in-progress}}
 * builds WMF-Jenkins -> beta labs in place
 * builds WMF-Jenkins -> SauceLabs coming

Dependencies
Ops dependency: MW Core dependency:
 * Deployment Tooling (see above)
 * Deployment Tooling (see above)
 * Vagrant

Last quarter actions

 * - Greg Bryan to send periodic updates about scap refactoring
 * Greg convene conversation with labs folks post migration re labs-vagrant (including OpenStack API etc)
 * : Have a plan for Vagrant
 * determine fit within test infra explicitly
 * : add MW release tarball as goal in next quarterly review
 * : figure out if a central developer to generate metrics on unit tests, maintaining the framework, etc

Goals
vis a vis the WMF Engineering 2013-14 goals.

Deployment tooling

 * (continued from last quarter) Process through all (useful) pain points from the Dev/Deploy review session - (Greg)
 * Integrate HHVM support into our deployment systems - (Bryan, Greg, others from Platform)
 * start the scap(py) & trebuchet integration conversation
 * dependent upon beta cluster work below

Beta cluster

 * Complete transition to scap as code deploy system - (Bryan, Antoine)
 * (from last quarter) - Make database in beta emulate production (set up db slaves) - (Antoine)
 * Swift cluster in beta??
 * RFC support
 * Support HHVM deployment tooling and puppet configuration testing

MediaWiki Release

 * Successfully support the release of MediaWiki 1.23 - (Antoine, Greg)
 * Investigate and create useful release/deployment metrics visualizations - (Greg)
 * eg: # of builds per day, # of commits/day, # of deploys/day, etc

Past quarter
Since Feb 1 we have reported and fixed (at least) 39 bugs found by analyzing failed browser test results. (As of 21 April)

Bugs found by browser tests were reported for:


 * VisualEditor
 * MultimediaViewer
 * Flow
 * CirrusSearch
 * Page editing -- wikitext editor,
 * MobileFrontend
 * beta labs itself
 * ULS
 * CentralNotice


 * Three bugs were specific to Firefox, all for VisualEditor.
 * One bug was specific to Chrome, for MobileFrontend

One bug should have been caught by a browser test but was not: https://bugzilla.wikimedia.org/show_bug.cgi?id=63503. Chris afterward updated the poorly-implemented test to work properly.

Future plans

 * Increase build speed by using WMF Jenkins with
 * headless browsers
 * parallel execution of tests
 * Reduce noise in Jenkins builds by reducing false failures
 * target beta labs/test2wiki directly from WMF Jenkins
 * target SauceLabs VMs as appropriate
 * ultimately we eliminate Cloudbees Jenkins completely
 * Add browsertests to new repos where possible
 * GettingStarted is the top of the list. (We once had tests for GettingStarted, but they became obsolete and other priorities took precedence over continuing that work)
 * Continue to report software and configuration issues in a timely way
 * We have these tests because they find bugs in time to report and fix the bugs!

Hiring

 * Complete hiring and train new Release Engineer (Greg, all)
 * Complete hiring and train new Automation Engineer (Ruby)  (Chris, all)

Questions

 * to fill in...

Action items
to fill in...