Deployment tooling/2013Sprint

From mediawiki.org

The DevOps Sprint 2013 was a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

Post sprint retrospective notes

Sprint focus areas[edit]

Monitoring[edit]

Primary Goals:

  • We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
  • Inform deployment rollback decisions based on pre/post deployment performance metrics

in progress[edit]

Graphite[edit]

  • Ori: finish puppetization
    • need ops review?
  • Migrate to eqiad (ops or ori?)
  • Aaron: enable deploy markings by default

Logstash[edit]

  • Yes Done Ops/Bryan: procure servers - rt ticket
  • Yes Done Ops/Greg: get Aaron and Bryan root on those servers - rt ticket
  • Write more filter rules to parse various message formats
  • Get a syslog feed from beta
  • Yes Done Package logstash jar for deployment
    • Ask Andrew Otto for advice based on Analytics packaging experience for Java projects
  • Yes Done Make a puppet module
    • Note: matanya has said "i'm writing the logstash module" on IRC which may cover this and the packaging question
    • We would probably still need to provide config for our usage
  • Yes Done Determine architecture for production deployment
    • What log forwarding methods should be used?
      • Udp2log is being used for now. Additional inputs will be added as needed
    • How many Logstash instances?
      • Starting with a 3 node cluster. Udp2log is only pointed at a single instance currently (logstash1001).
    • HA strategy?
      • Native Elasticsearch clustering. Kibana can be served from any of the 3 nodes. Udp2log input is currently a single point of failure.
    • QOS terms?
      • No QoS/ToS requirements have been established yet.
  • Yes Done tee the logs from MW to logstash (in addition to current fluorine logs)

Logging[edit]

todo[edit]

  • On-wiki documentation fatal and exception logging on the cluster - bug 52026
  • Make l10nupdate emit useful log messages
    • report to SAL with where the log lives
  • Get Icinga to alert on important metrics
    • Determine if icinga has the graphite plugin
    • Have graphite use the prediction algo plugin for alertable metrics
    • Figure out where platform alerts should go (mw-core initially?)
    • Turn stuff on (for a set of metrics)
    • Review of current metrics for alertable ones
    • Find new relevant monitoring/alerting metrics
  • Instrumentation in scap to relay more information out of the deployer's terminal
    • dashboard grid of servers being updated, color indicating status of individual server's code version
  • Log/show exceptions per file/extension
    • simple text file initially
  • Logstash in production

done[edit]

  • Yes Done (Hashar) Create logstash project in labs for testing
  • Yes Done (Bryan) Install logstash in labs for testing

Cache Improvements[edit]

in progress[edit]

  • bug 22390 - When a commons image is updated, update the pages that use it
    • Brian Wolff added one patch to deal with one aspect, still needing another half

todo[edit]

sorted (roughly) by priority

done[edit]

  • Yes Done (Brad) bug 5382 - Queue refreshLinks jobs on template deletion
  • Yes Done (Tim) bug 27935 - Redirect to canonical encoding

Deployment[edit]

Delayed/de-prioritized relative to Monitoring

Stories:

  • As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged

Primary goals:

  • Maintain reasonable usability
  • Speed: no more than 10-15 minutes
  • Graceful handling of unresponsive Apaches
  • A workflow for security patches
  • Better alerting / monitoring
    • Smoke test
  • Usability: commands used should map to logical activities rather than minutia
  • Better SAL entries (include commit ranges)
    • Maybe an easy way to get diffs of what was deployed
  • Deal with umask and .bashrc insanity

ACTION ITEMS[edit]

  • Audit of salt scripts for completeness - bug 43615
  • Add rsync backend for Trebuchet - bug 54185
  • Add submodule (and recursive submodule) support to Trebuchet - bug 51581
  • (Ryan) Put new Trebuchet frontend on labs
  • (Aaron) replace scap-recompile with a .deb package of texvc - bug 45076
  • Enable Trebuchet logging to SAL/IRC
  • Fix the db migration from small to medium - bug 56222
  • Integrate work from Joey H into Trebuchet (git corruption fixing bug 51142)
  • Test Trebuchet on production to dummy dir, point a testwiki to it
  • Look at l10nupdate; DevOps Sprint 2013/l10nupdate dataflow