Deployment tooling

The DevOps Sprint 2013 is a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

= Sprint focus areas =

in progress

 * - When a commons image is updated, update the pages that use it
 * Brian Wolff added one patch to deal with one aspect, still needing another half

todo
sorted (roughly) by priority
 * - Redirect to canonical encoding
 * - Include version in thumbnail URL
 * - Separate Cache-Control header for proxy and client
 * Implement thumbnail purging RfC
 * - Image urls should have far future expires
 * - Rewrite jobs-loop.sh in a proper programming language

done

 * ✅ (Brad) - Queue refreshLinks jobs on template deletion

Monitoring
Primary Goals:
 * We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
 * Inform deployment rollback decisions based on pre/post deployment performance metrics

in progress

 * (Ori) Finish migration & puppetization of graphite to eqiad
 * (Ori) Upgrade graphite (this will fix graph exceptions when a line has no data points)
 * (Aaron) Enable deploy markings (the vertical lines) on all graphs in graphite by default
 * on graphite migration
 * (Bryan) Post RFC for "more structured" logging in MW
 * see User:BDavis_(WMF)/Projects/Structured_logging
 * (Ori) Record sync/scap elapsed wall clock times in graphite

todo

 * On-wiki documentation fatal and exception logging on the cluster -
 * Make l10nupdate emit useful log messages
 * report to SAL with where the log lives
 * Get Icinga to alert on important metrics
 * Determine if icinga has the graphite plugin
 * Have graphite use the prediction algo plugin for alertable metrics
 * Figure out where platform alerts should go (mw-core initially?)
 * Turn stuff on (for a set of metrics)
 * Review of current metrics for alertable ones
 * Find new relevant monitoring/alerting metrics
 * Instrumentation in scap to relay more information out of the deployer's terminal
 * dashboard grid of servers being updated, color indicating status of individual server's code version
 * Log/show exceptions per file/extension
 * simple text file initially
 * Logstash in production

done

 * ✅ (Hashar) Create logstash project in labs for testing
 * ✅ (Bryan) Install logstash in labs for testing

Deployment
Delayed/de-prioritized relative to Monitoring

Stories:
 * As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged

Primary goals:
 * Maintain reasonable usability
 * Speed: no more than 10-15 minutes
 * Graceful handling of unresponsive Apaches
 * A workflow for security patches
 * Better alerting / monitoring
 * Smoke test
 * Usability: commands used should map to logical activities rather than minutia
 * Better SAL entries (include commit ranges)
 * Maybe an easy way to get diffs of what was deployed
 * Deal with umask and .bashrc insanity

ACTION ITEMS

 * Audit of salt scripts for completeness -
 * Add rsync backend for Trebuchet -
 * Add submodule (and recursive submodule) support to Trebuchet -
 * (Ryan) Put new Trebuchet frontend on labs
 * (Aaron) replace scap-recompile with a .deb package of texvc -
 * Enable Trebuchet logging to SAL/IRC
 * Fix the db migration from small to medium -
 * Integrate work from Joey H into Trebuchet (git corruption fixing )
 * Test Trebuchet on production to dummy dir, point a testwiki to it
 * Look at l10nupdate; DevOps Sprint 2013/l10nupdate dataflow