Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals
Program Goals and Status for FY18/19[edit]
- Goal Owner: Mark Bergsma
- Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
- Annual Plan: TEC6 Address Infrastructure Gaps
- Primary Goal is Knowledge as a Service: Evolve our systems and structures
- Tech Goal: Sustaining
[edit]
Outcome 2 / Output 3[edit]
Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Dependencies on: Search Platform; Primary team: Infrastructure Foundations
Goal(s)[edit]
Adopt Logstash Done
- Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
- Audit log producers across the infrastructure and plan their transition to centralized logging.
- Investigate log shipping methods and standardize on them.
Status[edit]
Note: July 2018
In progress
Note: August 14, 2018
In progress
Note: September 11, 2018
In progress A comprehensive design document has been prepared for logging and is currently in final review.
Outcome 3 / Output 4[edit]
Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence
Goal(s)[edit]
Monitor database backup generation for failure or incorrect generation Done
- Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
- Detect and alert on backup metrics anomalies
Status[edit]
Note: July 30, 2018
In progress
Note: August 14, 2018
In progress
Note: September 11, 2018
In progress Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.
Outcome 4 / Output 6[edit]
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations
Goal(s)[edit]
Migrate the hardware inventory from Racktables to Netbox Done
- Define Netbox existing and custom fields usage standards/best practices
- Switch over from Racktables to Netbox
- Stretch: Investigate Netbox reporting capabilities to automatically validate data
- Stretch: Investigate Netbox potential future integrations, towards a single source of truth
To do
Status[edit]
Note: July 30, 2018
In progress
Note: August 2018
In progress
Note: September 11, 2018
- <
In progress A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.
[edit]
Outcome 2 / Output 3[edit]
Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Primary teams: SRE / Infrastructure Foundations
Goal(s)[edit]
Begin the implementation of Q1's Logging Infrastructure design[edit]
- Procure and provision Logging pipeline hardware in multiple datacenters
In progress
- Migrate >=90% of existing Logstash traffic to the logging pipeline
In progress
- Onboard at least 10 new non-sensitive log producers to the logging pipeline
In progress
- Investigate approaches to ingest sensitive log producers
To do
- [stretch] Deprecate >= 50% of udp2log producers
To do
Expand modern metrics infrastructure coverage[edit]
- Plan and execute a new organization scheme for SRE Grafana dashboards
In progress
- Retire >= 80% of production Diamond collectors
In progress
- Provision >= 50% of statsd/Graphite-only metrics in Prometheus
In progress
Status[edit]
Note: November 14, 2018
- updated goals for current status
Note: December 12, 2018
- The implementation of logging infrastructure is going well and mostly still
In progress, and is expected to be
Done by the end of December. The stretch goals will be done in Q3.
- Expanding the metrics infra is going well and is
In progress and should be done by end of quarter.
Outcome 3 / Output 4[edit]
Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Primary teams: SRE / Data Persistence
Goal(s)[edit]
- Design and prepare infrastructure for database binary backups
In progress
- Research options for producing binary backups (lvm snapshots, cold backups, mariabackup)
Done
- Implement a proof of concept of a snapshot cycle automation for a mediawiki section database
In progress
- Procure hardware for binary backups
In progress
- Research options for producing binary backups (lvm snapshots, cold backups, mariabackup)
Status[edit]
Note: November 14, 2018
- updated goals for current status
Note: December 12, 2018
- This goal is going much slower than expected, due to various things and it will be completed in Q3.
Outcome 3 / Output 4 (Performance)[edit]
Wikimedia projects and content are protected against major disasters that threaten availability.
Primary teams: SRE / Data Persistence, Performance
Goal(s)[edit]
- Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4)
To do
- Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views
To do
Status[edit]
Note: November 14, 2018
- updated goals for current status
Note: December 12, 2018
- TLS is still
Stalled on DBA technology selection/implementation due to other work requirements that have higher priorities.
- Watchlist also
Stalled due to emergent work and other work that has higher priorities, we hope to get it done in early Q3.
Outcome 4 / Output 6[edit]
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Primary teams: SRE / Infrastructure Foundations
Goal(s)[edit]
Expand Spicerack library and SRE Cookbooks[edit]
- Split and convert the existing wmf-auto-reimage-lib into Spicerack modules
In progress
- Convert wmf-auto-reimage scripts to Cookbooks
In progress
- Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish)
To do
- Generate documentation for Spicerack
To do
Expand Netbox usage[edit]
- Upgrade Netbox to the latest version (>= 2.4)
Done
- Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.)
Done
- Explore Netbox/NAPALM integration to pull live data from network devices
Done
- Develop and deploy at least three Netbox reports to assist with data correctness and consistency
In progress
- [stretch] Add a Cumin backend for Netbox
To do
Status[edit]
Note: November 14, 2018
- The migration of logging to Logstash and metrics into Prometheus is
In progress. Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.
Note: December 12, 2018
- Convert wmf-auto-reimage scripts to Cookbooks is
In progress and will mostly be finished in Q3 due to holidays. The other two goals will start after the conversion is done.
- Upgrade Netbox to the latest version is
Done but the stretch goal will mostly tackled in Q3.
[edit]
Outcome 1 / Output 1[edit]
Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.
- Create a staging cluster comparable to production infrastructure
Primary teams: SRE / Service Operations, Release Engineering
Goal(s)[edit]
First steps towards Canary Deployments
- Introduce progressive rollouts to the mediawiki train
- Introduce deployment run state in scap to keep track of successful scap runs
- Investigate the use of versioning in MediaWiki, allowing scap to keep track of deployed revisions
Status[edit]
Note: April 8, 2019
- This has been
Postponed to Q4
- This has been
Outcome 2 / Output 3[edit]
Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Primary teams: SRE / Infrastructure Foundations
Goal(s)[edit]
Build an understanding of our needs around external monitoring services[edit]
- Produce a short document with a cost/benefit analysis of our current external monitoring systems
- Gather a set of requirements, desires, and likely technology choices for an external monitoring system, with a focus on achievability in a short timeframe (1-2 quarters)
Increase utilization of application logging pipeline[edit]
- Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch
- Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs (candidates: log4j, udp2log, syslog/syslog_tls etc.)
- Retire udp2log: onboard its producers and consumers to the logging pipeline
- [stretch] Implement sensitive log access control, onboard 3 sensitive log producers
Upgrade metrics monitoring infrastructure core components[edit]
- Serve >= 50% of production Prometheus systems with Prometheus v2
- Upgrade production prometheus-node-exporter to >= 0.16
- [stretch] Investigate distributed and long term storage solutions for Prometheus
Status[edit]
Note: April 8, 2019
- Build an understanding of our needs around external monitoring services is {[partially done}} in Q3
- Increase utilization of application logging pipeline is
Partially done - there is still work to be done on the 'Migrate at least 3 existing Logstash' goal (so,
Partially done) and the retiring udp2log and the stretch goal have been
Postponed to Q4
Outcome 3 / Output 4 (SRE / Data Persistence)[edit]
Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Primary teams: SRE / Data Persistence
Goal(s)[edit]
Design and prepare infrastructure for database binary backups
- Design a backup policy for logical and binary backups for both short term and long term storage
- Procure and setup final hardware for binary backups
- Fully implement binary backups and its rotation policy for all MediaWiki metadata and misc databases
Status[edit]
Note: April 8, 2019
- Backup policy is
Done but the procure and implement has been
Postponed to Q4
- Backup policy is
Outcome 4 / Output 6[edit]
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Primary teams: SRE / Infrastructure Foundations
Goal(s)[edit]
Build automated workflows for server provisioning[edit]
- Take additional steps towards a "single source of truth" system (Netbox)
- Upgrade Netbox to v2.5 and use the new cable tracking feature
- Expose production VMs to Netbox and keep them synchronized with Ganeti
- Incorporate at least two more categories of data (servers interfaces, server IPs, MAC addresses, network device IPs, management/OOB, etc.)
- Redesign the server provisioning and decommisioning process to facilitate orchestration
- Add Netbox module to Spicerack and integrate it in the reimage and decom cookbooks
- Convert virtual machine creation script to a cookbook
- Reduce the number of manual steps involved in the provisioning process by at least 4
Status[edit]
Note: April 8, 2019
- Both goals are
In progress and will continue into Q4
- Both goals are
[edit]
Outcome 2 / Output 3[edit]
Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Primary teams: SRE / Infrastructure Foundations
Dependencies on:
Goal(s)[edit]
Logging[edit]
- Deprecate all non-Kafka logstash inputs
- [stretch] Implement sensitive log access control, onboard 3 sensitive log producers
Metrics[edit]
- 100% of Prometheus traffic served by Prometheus v2
- Migrate all metrics originated by PoPs from statsd to Prometheus
- Investigate distributed and long term storage solutions for Prometheus
Status[edit]
Note: May 8, 2019
- Logging - deprecating non-Kafka is
In progress, stretch goal is still
To do
- Metrics: 100% of Prometheus traffic served by Prometheus v2 is now
Done! :)
- Migrating the metrics and investigating the distributed storage solutions are
In progress
- Logging - deprecating non-Kafka is
Note: June 13, 2019
- Logging: is
In progress but will might be pushed into next quarter along with the stretch goal.
- Metrics: 100% of prometheus is
Done, migrate all metrics is currently
Blocked but should be able to resolve it by end of quarter; investigating the long term storage is
Partially done and will be completely done by end of quarter.
- Logging: is
Outcome 3 / Output 4 (SRE / Data Persistence)[edit]
Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Primary teams: SRE / Data Persistence
Goal(s)[edit]
Stretch: Setup and deploy backup hardware[edit]
- Install and setup eqiad/codfw backups/recovery hosts
- Install and setup dump slaves
- Perform fine tuning of snapshot and dumps performance on final hardware
- Decommission old backups hosts dbstore1001, dbstore2001 and dbstore2002
Status[edit]
Note: May 8,2019
- Install and setup the backups and dump slaves are
Done and the rest is still
In progress, fine tuning is ongoing and removal will take place later.
- Install and setup the backups and dump slaves are
Note: June 13, 2019
- Install and setup eqiad/codfw backups/recovery hosts is
Done
- Install and setup dump slaves is
Done
- Perform fine tuning of snapshot is still
In progress and will be done by end of quarter
- Decommission old backups hosts is
Blocked on time - we have to wait until the other work is done by end of quarter.
- Install and setup eqiad/codfw backups/recovery hosts is
Outcome 4 / Output 6[edit]
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Primary teams: SRE (Infrastructure Foundations, Data Persistence, Service Operations)
Goal(s)[edit]
Database workflows automation[edit]
- Complete and deploy the tool for pooling/depooling databases dynamically from MediaWiki (dbconfig)
- Migrate MediaWiki to use etcd for the database configuration in production
- Write Spicerack abstractions for common database operations (pool/depool)
- [stretch] Write Spicerack cookbooks to automate 2 common DBA workflows
Status[edit]
Note: May 8, 2019
- This is fully
In progress except for the stretch goal
- This is fully
Note: June 13, 2019
- Complete and deploy the tool should be finished up by end of this quarter, the rest of this particular goal will go into next quarter.
Outcome 4 / Output 7[edit]
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts
Primary teams: SRE / Infrastructure Foundations
Dependencies on: Cloud Services, Security,
Goal(s)[edit]
Developer account management[edit]
- Audit production and WMCS infrastructure and document all authenticated services and their authentication & authorization capabilities
- Engage with stakeholders and collect functional and non-functional requirements for identity and access management for web services
- Evaluate free & open source Identity Management/SSO software solutions against our requirements and create a short list of 1-2
- Build a migration plan from OpenStackManager and Striker towards a unified identity and access management system for developer accounts
Status[edit]
Note: May 8, 2019
- Audit production and WMCS infrastructure and document is
In progress and the others are awaiting it's completion.
- Audit production and WMCS infrastructure and document is
Note: June 13, 2019
- Audit production and WMCS infrastructure is
Done
- Engage with stakeholders and collect functional and non-functional requirements is
In progress and should be done by end of quarter
- Evaluate free & open source Identity Management/SSO software solutions is
Partially done
- Build a migration plan is
To do but the team met this week and should be
In progress but probably finish early next quarter.
- Audit production and WMCS infrastructure is