Deployment tooling

The DevOps Sprint 2013 is a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

= Status =

= Sprint focus areas =

Monitoring
Primary Goals:
 * We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
 * Inform deployment rollback decisions based on pre/post deployment performance metrics

Graphite

 * Ori: finish puppetization
 * need ops review?
 * Migrate to eqiad (ops or ori?)
 * Aaron: enable deploy markings by default

Logstash

 * Ops/Bryan: procur servers - rt ticket
 * Ops/Greg: get Aaron and Bryan root on those servers - rt ticket
 * Write more filter rules to parse various message formats
 * Get a syslog feed from beta
 * Package logstash jar for deployment
 * Ask Andrew Otto for advice based on Analytics packaging experience for Java projects
 * Make a puppet module
 * Note: matanya has said "i'm writing the logstash module" on IRC which may cover this and the packaging question
 * We would probably still need to provide config for our usage
 * Determine architecture for production deployment
 * What log forwarding methods should be used?
 * How many Logstash instances?
 * HA strategy?
 * QOS terms?
 * tee the logs from MW to logstash (in addition to current flourine logs)

Logging

 * structured logging RFC
 * Design proposed PHP API changes
 * Clean proposal to remove/archive other examples
 * Submit structured logging RFC
 * Ori: Record sync/scap elapsed wall clock times in graphite

todo

 * On-wiki documentation fatal and exception logging on the cluster -
 * Make l10nupdate emit useful log messages
 * report to SAL with where the log lives
 * Get Icinga to alert on important metrics
 * Determine if icinga has the graphite plugin
 * Have graphite use the prediction algo plugin for alertable metrics
 * Figure out where platform alerts should go (mw-core initially?)
 * Turn stuff on (for a set of metrics)
 * Review of current metrics for alertable ones
 * Find new relevant monitoring/alerting metrics
 * Instrumentation in scap to relay more information out of the deployer's terminal
 * dashboard grid of servers being updated, color indicating status of individual server's code version
 * Log/show exceptions per file/extension
 * simple text file initially
 * Logstash in production

done

 * ✅ (Hashar) Create logstash project in labs for testing
 * ✅ (Bryan) Install logstash in labs for testing

in progress

 * - When a commons image is updated, update the pages that use it
 * Brian Wolff added one patch to deal with one aspect, still needing another half

todo
sorted (roughly) by priority
 * - Include version in thumbnail URL
 * - Separate Cache-Control header for proxy and client
 * Implement thumbnail purging RfC
 * - Image urls should have far future expires
 * - Rewrite jobs-loop.sh in a proper programming language

done

 * ✅ (Brad) - Queue refreshLinks jobs on template deletion
 * ✅ (Tim) - Redirect to canonical encoding

Deployment
Delayed/de-prioritized relative to Monitoring

Stories:
 * As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged

Primary goals:
 * Maintain reasonable usability
 * Speed: no more than 10-15 minutes
 * Graceful handling of unresponsive Apaches
 * A workflow for security patches
 * Better alerting / monitoring
 * Smoke test
 * Usability: commands used should map to logical activities rather than minutia
 * Better SAL entries (include commit ranges)
 * Maybe an easy way to get diffs of what was deployed
 * Deal with umask and .bashrc insanity

ACTION ITEMS

 * Audit of salt scripts for completeness -
 * Add rsync backend for Trebuchet -
 * Add submodule (and recursive submodule) support to Trebuchet -
 * (Ryan) Put new Trebuchet frontend on labs
 * (Aaron) replace scap-recompile with a .deb package of texvc -
 * Enable Trebuchet logging to SAL/IRC
 * Fix the db migration from small to medium -
 * Integrate work from Joey H into Trebuchet (git corruption fixing )
 * Test Trebuchet on production to dummy dir, point a testwiki to it
 * Look at l10nupdate; DevOps Sprint 2013/l10nupdate dataflow