Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, July 2013/Release Management and QA

Date: August 21st, 2013

Time: 1:30pm Pacific (20:30 UTC)

Who: Topics: Deploy process/pipeline, release process, bug fixing, code review, code management, security deploy/release, automation prioritization
 * Leads: Greg G, Chris M
 * Participants (invited): Antoine, Sam, Chris S, Chad, Zeljko, Michelle G, Robla, Sumana, Quim, Maryana, James F, Ryan Lane, Ken, Terry, Tomasz, Alolita

Big picture
Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our goal.

What we want to accomplish:
 * More appreciation of, response to, and use of (failing) tests in development
 * Better monitoring and reporting out of our development and deployment processes
 * Reducing time between code being written and being deployed, simultaneously discovering issues with code earlier in the process with more certainty.
 * Help WMF Engineering learn from our mistakes, and adapt to what we learn.

Non-Sprint Activities
The Mobile and Language teams have done wonderful things with their embedded QA persons. This is a model we either want to expand on, or to mimic as best we can with supplementary training with more teams. For the next quarter, we propose to work more closely with the E3 team on the use of WMF testing infrastructure. They are well positioned to extensively use what we have provided.

Sprints
We plan to put the items below into Release Management and QA related sprints sometime between July and December 2013:

Relevant Bugzilla searches:
 * whiteboard:deploysprint-13
 * whiteboard:rmqa-2013
 * whiteboard:rmqa-2013 or deploysprint-13

git-deploy

 * tracking
 * - auditing salt scripts for completeness
 * - deal with dirty git fetches properly
 * - Deployment-prep deploys from master and uses a submodule with submodules
 * Questions:

monitoring / reporting

 * - Better 500 error/PHP exception monitoring
 * - create monitoring for the issue in that bug
 * Questions:
 * Where should this be stashed in our monitoring?
 * Better reporting out of Jenkins builds status
 * including exporting results from our third-party Jenkins
 * overall dashboard of all build activity

deployment script improvements

 * - Make updates atomic (e.g. symlink + directory move tricks or git-deploy?)
 * - Some improvements for the deployment scripts
 * - Reconciling the use of timestamps on Javascript files (rsync vs ResourceLoader vs git)
 * - resetUserTokens.php not usable on large wikis
 * Kill deployment hacks with fire - live hacks that are still applied as of 2013-05-16

multi-site awareness

 * - mwscript.php/mctest.php does not know about memcache in both datacenters
 * - migrate scripts from hume to terbium
 * Database config cleanup -- multisite awareness in MediaWiki
 * can we fail over database?
 * Identify what on our side needs to be multi-site first.

Labs related

 * - the use of Beta cluster as a true canary (tracking)
 * - allowing extensions to be run from not master
 * - hermetic test environment on labs (vagrant or otherwise)
 * - setup monitoring of beta cluster
 * also usable for throw away dev environments for developer use

QA review draft below
QA + Release management:

Release management and QA intersect in some important areas:


 * we want to release software safely and routinely
 * we want to identify problems before they affect users
 * we want to minimize the time between when a problem is introduced and when it is discovered
 * we want to provide information about software quality in a way that informs release decisions

And we intersect in some important spaces:


 * test2wiki and mediawiki.org
 * beta cluster

Beta cluster holds great potential for helping provide software releases in a Continuous Deployment/DevOps way, but it needs more attention:


 * more automation proven for production
 * for example automatic db updates; prod doesn't have this
 * for example automatic extension updates; prod doesn't have this
 * more monitoring
 * automated fatal monitoring, error monitoring, status monitoring
 * performance monitoring and improvements
 * priority of UI tests
 * other automated tests required: e.g. API

Finally, we may require more upstream work:

Reliable unit tests, API tests, PhantomJS tests upstream in Jenkins/CI before automated deploy to beta cluster etc.

Since last review Feb 2013

Done:
 * Beta cluster improvements
 * Automatic db updates
 * Search working properly
 * Main target for tests over test2wiki now

Still in progress:
 * Proper support for all extensions in beta cluster https://bugzilla.wikimedia.org/show_bug.cgi?id=49846
 * Bizarre errors still occur in beta: https://bugzilla.wikimedia.org/show_bug.cgi?id=50622
 * SUL https://bugzilla.wikimedia.org/show_bug.cgi?id=51622 https://bugzilla.wikimedia.org/show_bug.cgi?id=51700

Abandoned:
 * Detailed project analysis
 * Groups
 * Formal community test events not related to automated browser tests
 * (Although not completely abandoned; one test event in May identified important issues in VisualEditor at the time)

Black Swans:


 * Shifted to community and Foundation emphasis on browser test automation
 * Provided significant support for Language team browser tests
 * Provided significant support for Mobile team browser tests
 * OPW intern
 * QA mail list is working, we have contributors and enthusiastic learners
 * Dovetail with Release Management/Continuous D/elivery/eployment/evOps

Next quarter priorities:


 * Community training/contributions for browser tests
 * Gadgets too
 * emphasis on beta cluster as prod staging environment
 * spike DevOps tools in beta cluster
 * examples: automatic db updates, automatic extensions updates, etc.
 * vagrant vms as shareable hermetic test environments
 * involve E3 dev team in creating browser tests using Language/Mobile as ongoing examples

Casualties:


 * Echo: HTML email notification would benefit from QA attention, but beta cluster, OPW, QA community are higher priority

Upcoming:


 * Flow
 * Multimedia