Development process improvement/2014-01-22/Notes

Pain Points

 * Product review doesn't always happen
 * Security patches don't always get reapplied when extensions are redeployed
 * Getting security review can take a long time
 * Some teams/products don't ride the train
 * Error apathy. Lots of known bugs that nobody is fixing ("Meh. That error is always there or ignore it")
 * Keeping non-Bugzilla tracking systems (Mingle/Trello) synced with Bugzilla is hard
 * Unit test coverage is inadequate across features and projects.
 * Browser/full stack tests are effective, but we rely on them too much
 * Our "test pyramid" is upside-down: http://martinfowler.com/bliki/TestPyramid.html.
 * No facility for pre-merge full stack tests
 * Browser tests are slow (and always will be, even at their fastest)
 * And Cloudbees is flaky, and lots of other known problems with browser tests. See: https://www.mediawiki.org/wiki/Browser_testing/architecture
 * Time between merge and release branch cut can be 1m to 1w.
 * We don't test integration across repos at branch cut time (extensions with core, config with extensions; not an easy task)
 * Could run browser tests on branch cut. Integration/API tests would be useful.
 * Setting up a complex wiki environment in Labs is often manual/difficult
 * Labs configuration is not like production
 * No official Vagrant maintainer
 * Can't easily run automated browser tests against Vagrant. Improvements to this in process now:  https://bugzilla.wikimedia.org/show_bug.cgi?id=58939
 * Bootstrapping a wiki on Vagrant isn't automated
 * "Minor" changes deployed outside windows
 * Sometimes people deploy during reserved deploy windows that they don't own
 * How do we know that a shellbug request has consensus
 * Sometimes shellbug requests bypass bugzilla
 * WMF product should be consulted on some shellbugs
 * People sometimes merge wmfconfig changes without deploying
 * Beta cluster can be broken by a production config change
 * External software dependencies keep some software from riding the train
 * Need for backwards compatibility with schema changes limits velocity
 * Instrumentation is not sufficent for continuous deployment
 * Bug fixes don't roll out quickly enough
 * Gerrit's workflow is "not like github"

Deploy Train

 * "Most" things ride the train
 * But lots of things go as Lightning deploys
 * Is it broken in prod?
 * Is it going to break prod?
 * And then there is Parsoid...

Wants

 * Block commit from production unless a related commit is in production (from Core or Extension)
 * Has bitten Cirrus on more than one occasion; primarily on the old branch
 * Would be nice to automate a "-2 until other change merges" workflow (used by VE)
 * Backports suffer from same/sililar problem and it's possibly exacerbated
 * Integrate browser tests with Jenkins (CI is working on this; browser tests being slow is a problem)
 * Replace (most of) lightning deploys with a task force of rotating deployers that gathers bug fixes and deploys them during a daily window
 * Hopefully makes nominating things for fast deployments more egalitarian
 * Visual regression testing. We have spiked this using Sikuli but the value seems low for now at least.

Investigate

 * Many people mentioned Etsy's work. There is detailed information in these blog posts, circa 2011:
 * http://codeascraft.com/2011/04/20/divide-and-concur/ Note the division of test types.  Note the PHP code base.
 * http://codeascraft.com/2011/10/11/did-you-try-it-before-you-committed/ Note that Etsy's 'try' server is modeled on Mozilla's 'try' server
 * Mozilla's 'try' server: http://rhelmer.org/blog/buildbot-try-support
 * https://wiki.mozilla.org/ReleaseEngineering/TryServer