Deployment tooling

The DevOps Sprint 2013 is a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

= Sprint focus areas =

Cache Improvements

 * - Rewrite jobs-loop.sh in a proper programming language
 * - Redirect to canonical encoding
 * - When a commons image is updated, update the pages that use it
 * - Include version in thumbnail URL
 * ✅ (Brad) - Queue refreshLinks jobs on template deletion
 * - Separate Cache-Control header for proxy and client
 * Implement thumbnail purging RfC

Monitoring
Primary Goals:
 * We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
 * Inform deployment rollback decisions based on pre/post deployment performance metrics

ACTION ITEMS

 * (Ori) Finish migration & puppetization of graphite to eqiad
 * Upgrade graphite (this will fix graph exceptions when a line has no data points)
 * (Aaron) Enable deploy markings (the vertical lines) on all graphs in graphite
 * BLOCKED on graphite migration
 * Document fatal and exception logging on the cluster -
 * (Bryan) Start document/RFC for "more structured" logging in MW
 * ✅ (Hashar) Install logstash in labs for testing
 * Get Icinga to alert on important metrics
 * Determine if icinga has the graphite plugin
 * Have graphite use the prediction algo plugin for alertable metrics
 * Turn stuff on (for a set of metrics)
 * Review of current metrics for alertable ones
 * Find new relevant monitoring/alerting metrics
 * Expose exceptions data (showing exceptions per file/extension)
 * (Ori) Record sync/scap elapsed wall clock times in graphite
 * Make l10nupdate emit useful log messages
 * report to SAL with where the log lives
 * Figure out where platform alerts should go
 * Instrumentation in scap to relay more information out of the deployer's terminal
 * adds ability to identify problem areas of scap (which parts take the longest)
 * dashboard grid of servers being updated, color indicating status of individual server's code version

Deployment
Delayed/de-prioritized relative to Monitoring

Stories:
 * As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged
 * NEEDS ACTION ITEMS

Primary goals:
 * Maintain reasonable usability
 * Speed: no more than 10-15 minutes
 * Graceful handling of unresponsive Apaches
 * A workflow for security patches
 * Better alerting / monitoring
 * Smoke test
 * Usability: commands used should map to logical activities rather than minutia
 * Better SAL entries (include commit ranges)
 * Maybe an easy way to get diffs of what was deployed
 * Deal with umask and .bashrc insanity

ACTION ITEMS

 * Audit of salt scripts for completeness -
 * Add rsync backend for Trebuchet -
 * Add submodule (and recursive submodule) support to Trebuchet -
 * (Ryan) Put new Trebuchet frontend on labs
 * (Aaron) replace scap-recompile with a .deb package of texvc -
 * Enable Trebuchet logging to SAL/IRC
 * Fix the db migration from small to medium -
 * Integrate work from Joey H into Trebuchet (git corruption fixing )
 * Test Trebuchet on production to dummy dir, point a testwiki to it
 * Look at l10nupdate; DevOps Sprint 2013/l10nupdate dataflow