The DevOps Sprint 2013 was a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.

Sprint focus areas[edit]

Monitoring[edit]

Primary Goals:

We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
Inform deployment rollback decisions based on pre/post deployment performance metrics

Done Ops/Bryan: procure servers - rt ticket
Done Ops/Greg: get Aaron and Bryan root on those servers - rt ticket
Write more filter rules to parse various message formats
Get a syslog feed from beta
Done Package logstash jar for deployment
- ~~Ask Andrew Otto for advice based on Analytics packaging experience for Java projects~~
Done Make a puppet module
- Note: matanya has said "i'm writing the logstash module" on IRC which may cover this and the packaging question
- We would probably still need to provide config for our usage
Done Determine architecture for production deployment
- What log forwarding methods should be used?
  - Udp2log is being used for now. Additional inputs will be added as needed
- How many Logstash instances?
  - Starting with a 3 node cluster. Udp2log is only pointed at a single instance currently (logstash1001).
- HA strategy?
  - Native Elasticsearch clustering. Kibana can be served from any of the 3 nodes. Udp2log input is currently a single point of failure.
- QOS terms?
  - No QoS/ToS requirements have been established yet.
Done tee the logs from MW to logstash (in addition to current fluorine logs)

structured logging RFC
- Design proposed PHP API changes
- Clean proposal to remove/archive other examples
- Done Submit structured logging RFC
Ori: Record sync/scap elapsed wall clock times in graphite

On-wiki documentation fatal and exception logging on the cluster - bug 52026
Make l10nupdate emit useful log messages
- report to SAL with where the log lives
Get Icinga to alert on important metrics
- Determine if icinga has the graphite plugin
- Have graphite use the prediction algo plugin for alertable metrics
- Figure out where platform alerts should go (mw-core initially?)
- Turn stuff on (for a set of metrics)
- Review of current metrics for alertable ones
- Find new relevant monitoring/alerting metrics
Instrumentation in scap to relay more information out of the deployer's terminal
- dashboard grid of servers being updated, color indicating status of individual server's code version
Log/show exceptions per file/extension
- simple text file initially
Logstash in production

bug 22390 - When a commons image is updated, update the pages that use it
- Brian Wolff added one patch to deal with one aspect, still needing another half

sorted (roughly) by priority

Delayed/de-prioritized relative to Monitoring

Stories:

As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged

Primary goals:

Maintain reasonable usability
Speed: no more than 10-15 minutes
Graceful handling of unresponsive Apaches
A workflow for security patches
Better alerting / monitoring
- Smoke test
Usability: commands used should map to logical activities rather than minutia
Better SAL entries (include commit ranges)
- Maybe an easy way to get diffs of what was deployed
Deal with umask and .bashrc insanity