The DevOps Sprint 2013 was a project undertaken by the MediaWiki Core team to improve the deployment process and operational monitoring capabilities in use for the Wikimedia content projects and related infrastructure.
Sprint focus areas
- We shouldn't find out that various parts of our infrastructure are down because of a failed browser test (that only happen twice/day).
- Inform deployment rollback decisions based on pre/post deployment performance metrics
- Ori: finish puppetization
- need ops review?
- Migrate to eqiad (ops or ori?)
- Aaron: enable deploy markings by default
- Done Ops/Bryan: procure servers - rt ticket
- Done Ops/Greg: get Aaron and Bryan root on those servers - rt ticket
- Write more filter rules to parse various message formats
- Get a syslog feed from beta
- Done Package logstash jar for deployment
Ask Andrew Otto for advice based on Analytics packaging experience for Java projects
- Done Make a puppet module
- Note: matanya has said "i'm writing the logstash module" on IRC which may cover this and the packaging question
- We would probably still need to provide config for our usage
- Done Determine architecture for production deployment
- What log forwarding methods should be used?
- Udp2log is being used for now. Additional inputs will be added as needed
- How many Logstash instances?
- Starting with a 3 node cluster. Udp2log is only pointed at a single instance currently (logstash1001).
- HA strategy?
- Native Elasticsearch clustering. Kibana can be served from any of the 3 nodes. Udp2log input is currently a single point of failure.
- QOS terms?
- No QoS/ToS requirements have been established yet.
- What log forwarding methods should be used?
- Done tee the logs from MW to logstash (in addition to current fluorine logs)
- structured logging RFC
- Design proposed PHP API changes
- Clean proposal to remove/archive other examples
- Done Submit structured logging RFC
- Ori: Record sync/scap elapsed wall clock times in graphite
- On-wiki documentation fatal and exception logging on the cluster - bug 52026
- Make l10nupdate emit useful log messages
- report to SAL with where the log lives
- Get Icinga to alert on important metrics
- Determine if icinga has the graphite plugin
- Have graphite use the prediction algo plugin for alertable metrics
- Figure out where platform alerts should go (mw-core initially?)
- Turn stuff on (for a set of metrics)
- Review of current metrics for alertable ones
- Find new relevant monitoring/alerting metrics
- Instrumentation in scap to relay more information out of the deployer's terminal
- dashboard grid of servers being updated, color indicating status of individual server's code version
- Log/show exceptions per file/extension
- simple text file initially
- Logstash in production
- Done (Hashar) Create logstash project in labs for testing
- Done (Bryan) Install logstash in labs for testing
- bug 22390 - When a commons image is updated, update the pages that use it
- Brian Wolff added one patch to deal with one aspect, still needing another half
sorted (roughly) by priority
- bug 17577 - Include version in thumbnail URL
- bug 48835 - Separate Cache-Control header for proxy and client
- Implement thumbnail purging RfC
- bug 17577 - Image urls should have far future expires
- bug 46770 - Rewrite jobs-loop.sh in a proper programming language
- Done (Brad) bug 5382 - Queue refreshLinks jobs on template deletion
- Done (Tim) bug 27935 - Redirect to canonical encoding
Delayed/de-prioritized relative to Monitoring
- As a release manager, I want to eliminate manual steps that may be overlooked so that the Ops team doesn't get paged
- Maintain reasonable usability
- Speed: no more than 10-15 minutes
- Graceful handling of unresponsive Apaches
- A workflow for security patches
- Better alerting / monitoring
- Smoke test
- Usability: commands used should map to logical activities rather than minutia
- Better SAL entries (include commit ranges)
- Maybe an easy way to get diffs of what was deployed
- Deal with umask and .bashrc insanity
- Audit of salt scripts for completeness - bug 43615
- Add rsync backend for Trebuchet - bug 54185
- Add submodule (and recursive submodule) support to Trebuchet - bug 51581
- (Ryan) Put new Trebuchet frontend on labs
- (Aaron) replace scap-recompile with a .deb package of texvc - bug 45076
- Enable Trebuchet logging to SAL/IRC
- Fix the db migration from small to medium - bug 56222
- Integrate work from Joey H into Trebuchet (git corruption fixing bug 51142)
- Test Trebuchet on production to dummy dir, point a testwiki to it
- Look at l10nupdate; DevOps Sprint 2013/l10nupdate dataflow