Deployment tooling

The DevOps Sprint 2013 is a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

= Sprint focus areas =

Deployment
Stories:
 * As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged
 * NEEDS ACTION ITEMS

Primary goals:
 * Maintain reasonable usability
 * Speed: no more than 10-15 minutes
 * Graceful handling of unresponsive Apaches
 * A workflow for security patches
 * Better alerting / monitoring
 * Smoke test
 * Usability: commands used should map to logical activities rather than minutia
 * Better SAL entries (include commit ranges)
 * Maybe an easy way to get diffs of what was deployed
 * Deal with umask and .bashrc insanity

ACTION ITEMS

 * Audit of salt scripts for completeness -
 * Add rsync backend for Trebuchet -
 * Add submodule (and recursive submodule) support to Trebuchet -
 * (Ryan) Put new Trebuchet frontend on labs
 * (Aaron) replace scap-recompile with a .deb package of texvc -
 * Enable Trebuchet logging to SAL/IRC
 * Fix the db migration from small to medium -
 * Integrate work from Joey H into Trebuchet (git corruption fixing )
 * Test Trebuchet on production to dummy dir, point a testwiki to it

Monitoring
Primary Goals:
 * We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
 * Inform deployment rollback decisions based on pre/post deployment performance metrics

ACTION ITEMS

 * (Ori) Finish migration & puppetization of graphite to eqiad
 * Upgrade graphite (this will fix graph exceptions when a line has no data points)
 * (Aaron) Enable deploy markings (the vertical lines) on all graphs in graphite
 * BLOCKED on graphite migration
 * Document fatal and exception logging on the cluster -
 * (Bryan) Start document/RFC for "more structured" logging in MW
 * ✅ (Hashar) Install logstash in labs for testing
 * Brainstorm relevant monitoring/alerting metrics
 * Review of current metrics for alertable ones
 * Expose exceptions data (showing exceptions per file/extension)
 * (Ori) Record sync/scap elapsed wall clock times in graphite
 * Make l10nupdate emit useful log messages
 * Figure out where platform alerts should go