Wikimedia Platform Engineering/MediaWiki Core Team/Quarterly review, July 2013/Release Management and QA

Date: August 21st, 2013

Time: 1:30pm Pacific (20:30 UTC)

Who: Topics: Deploy process/pipeline, release process, bug fixing, code review, code management, security deploy/release, automation prioritization
 * Leads: Greg G, Chris M
 * Participants (invited): Antoine, Sam, Chris S, Chad, Zeljko, Michelle G, Robla, Sumana, Quim, Maryana, James F, Ryan Lane, Ken, Terry, Tomasz, Alolita

Big picture
Release Engineering and QA are where our efforts in Platform can be amplified. When we do things well, we start to see more responsive development with higher quality code. That is our goal.

What we want to accomplish:
 * More appreciation, response to, and use of (failing) tests in development
 * Better monitoring and reporting out of our development and deployment processes

Non-Sprint Activities
The Mobile and Language teams have done wonderful things with their embedded QA person. This is a model we either want to expand on, or to mimic as best we can with supplementary training with more teams. For the next quarter, we propose to work more closely with the E3 team on the use of WMF testing infrastructure. They are well poised to extensively use what we have provided.

Sprints
We plan to put the items below into Release Management and QA related sprints sometime between July and December 2013:

Relevant Bugzilla searches:
 * whiteboard:deploysprint-13
 * whiteboard:rmqa-2013
 * whiteboard:rmqa-2013 or deploysprint-13

git-deploy

 * tracking
 * - auditing salt scripts for completeness
 * - deal with dirty git fetches properly
 * - Deployment-prep deploys from master and uses a submodule with submodules
 * Questions:
 * Will Platform take over maintenance of git-deploy?

monitoring / reporting

 * - Better 500 error/PHP exception monitoring
 * - create monitoring for the issue in that bug
 * Questions:
 * Where should this be stashed in our monitoring?
 * Better reporting out of Jenkins builds status
 * including exporting results from our third-party Jenkins
 * overall dashboard of all build activity

deployment script improvements

 * - Make updates atomic (e.g. symlink + directory move tricks or git-deploy?)
 * - Some improvements for the deployment scripts
 * - Reconciling the use of timestamps on Javascript files (rsync vs ResourceLoader vs git)
 * - resetUserTokens.php not usable on large wikis
 * Kill deployment hacks with fire - live hacks that are still applied as of 2013-05-16

multi-site awareness

 * - mwscript.php/mctest.php does not know about memcache in both datacenters
 * - migrate scripts from hume to terbium
 * Database config cleanup -- multisite awareness in MediaWiki
 * can we fail over database?
 * Identify what on our side needs to be multi-site first.

Labs related

 * - the use of BetaLabs as a true canary (tracking)
 * - allowing extensions to be run from not master
 * - hermetic test environment on labs (vagrant or otherwise)
 * - setup monitoring of betalabs
 * also usable for throw away dev environments for developer use

QA review draft below
QA + Release management:

Release management and QA intersect in some important areas:


 * we want to release software safely and routinely
 * we want to identify problems before they affect users
 * we want to minimize the time between when a problem is introduced and when it is discovered
 * we want to provide information about software quality in a way that informs release decisions

And we intersect in some important spaces:


 * test2wiki and mediawiki.org
 * beta labs

Beta labs holds great potential for helping provide software releases in a Continuous Deployment/DevOps way, but it needs more attention:


 * more automation proven for production
 * for example automatic db updates; prod doesn't have this
 * for example automatic extension updates; prod doesn't have this
 * more monitoring
 * automated fatal monitoring, error monitoring, status monitoring
 * priority of UI tests
 * other automated tests required: e.g. API

Finally, we may require more upstream work:

Reliable unit tests, API tests, PhantomJS tests upstream in Jenkins/CI before automated deploy to beta etc.

Since last review Feb 2013

Done:
 * Beta labs improvements
 * Automatic db updates
 * Search working properly
 * Main target for tests over test2wiki now

Still in progress:
 * Proper support for all extensions in beta https://bugzilla.wikimedia.org/show_bug.cgi?id=49846
 * Bizarre errors still occur in beta: https://bugzilla.wikimedia.org/show_bug.cgi?id=50622
 * SUL https://bugzilla.wikimedia.org/show_bug.cgi?id=51622 https://bugzilla.wikimedia.org/show_bug.cgi?id=51700

Abandoned:
 * Detailed project analysis
 * Groups
 * Formal community test events not related to automated browser tests
 * (Although not completely abandoned; one test event in May identified important issues in VisualEditor at the time)

Black Swans:


 * Shifted to community and Foundation emphasis on browser test automation
 * Provided significant support for Language team browser tests
 * Provided significant support for Mobile team browser tests
 * OPW intern
 * QA mail list is working, we have contributors and enthusiastic learners
 * Dovetail with Release Management/Continuous D/elivery/eployment/evOps

Next quarter priorities:


 * Community training/contributions for browser tests
 * emphasis on beta labs as prod staging environment
 * spike DevOps tools in beta labs
 * examples: automatic db updates, automatic extensions updates, etc.
 * vagrant vms as shareable hermetic test environments
 * involve dev teams in creating browser tests using Language/Mobile as ongoing examples
 * Gadgets too

Casualties:


 * Echo: HTML email notification would benefit from QA attention, but beta labs, OPW, QA community are higher priority

Upcoming:


 * Flow
 * Multimedia