Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals

=Program Goals and Status for FY18/19=

TEC6 Address Infrastructure Gaps
 * Goal Owner: Mark Bergsma
 * Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
 * Annual Plan: TEC6 Address Infrastructure Gaps
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Sustaining



 = Q1 Goals =

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Dependencies on: Search Platform; Primary team: Infrastructure Foundations

Goal(s)
Adopt Logstash ✅
 * Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
 * Audit log producers across the infrastructure and plan their transition to centralized logging.
 * Investigate log shipping methods and standardize on them.

Status
July 2018

August 14, 2018

September 11, 2018
 * A comprehensive design document has been prepared for logging and is currently in final review.

Outcome 3 / Output 4
Wikimedia projects and content are protected against major disasters that threaten availability.
 * Strengthen backups with reliable and redundant backup infrastructure

Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence

Goal(s)
Monitor database backup generation for failure or incorrect generation ✅
 * Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
 * Detect and alert on backup metrics anomalies

Status
July 30, 2018

August 14, 2018

September 11, 2018
 * Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations

Goal(s)
Migrate the hardware inventory from Racktables to Netbox ✅
 * Define Netbox existing and custom fields usage standards/best practices
 * Switch over from Racktables to Netbox
 * Stretch: Investigate Netbox reporting capabilities to automatically validate data
 * Stretch: Investigate Netbox potential future integrations, towards a single source of truth

Status
July 30, 2018

August 2018

September 11, 2018
 * A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.



= Q2 Goals =

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Begin the implementation of Q1's Logging Infrastructure design

 * Procure and provision Logging pipeline hardware in multiple datacenters
 * Migrate >=90% of existing Logstash traffic to the logging pipeline
 * Onboard at least 10 new non-sensitive log producers to the logging pipeline
 * Investigate approaches to ingest sensitive log producers
 * [stretch] Deprecate >= 50% of udp2log producers

Expand modern metrics infrastructure coverage

 * Plan and execute a new organization scheme for SRE Grafana dashboards
 * Retire >= 80% of production Diamond collectors
 * Provision >= 50% of statsd/Graphite-only metrics in Prometheus

Status
November 14, 2018
 * updated goals for current status

December 12, 2018
 * The implementation of logging infrastructure is going well and mostly still, and is expected to be ✅ by the end of December. The stretch goals will be done in Q3.
 * Expanding the metrics infra is going well and is and should be done by end of quarter.

Outcome 3 / Output 4
Wikimedia projects and content are protected against major disasters that threaten availability.
 * Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

 * Design and prepare infrastructure for database binary backups
 * Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) ✅
 * Implement a proof of concept of a snapshot cycle automation for a mediawiki section database
 * Procure hardware for binary backups

Status
November 14, 2018
 * updated goals for current status

December 12, 2018
 * This goal is going much slower than expected, due to various things and it will be completed in Q3.

Outcome 3 / Output 4 (Performance)
Wikimedia projects and content are protected against major disasters that threaten availability.



Primary teams: SRE / Data Persistence, Performance

Goal(s)

 * Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4)
 * Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views

Status
November 14, 2018
 * updated goals for current status

December 12, 2018
 * TLS is still ❌ on DBA technology selection/implementation due to other work requirements that have higher priorities.
 * Watchlist also ❌ due to emergent work and other work that has higher priorities, we hope to get it done in early Q3.

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Expand Spicerack library and SRE Cookbooks

 * Split and convert the existing wmf-auto-reimage-lib into Spicerack modules
 * Convert wmf-auto-reimage scripts to Cookbooks
 * Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish)
 * Generate documentation for Spicerack

Expand Netbox usage

 * Upgrade Netbox to the latest version (>= 2.4) ✅
 * Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.) ✅
 * Explore Netbox/NAPALM integration to pull live data from network devices ✅
 * Develop and deploy at least three Netbox reports to assist with data correctness and consistency
 * [stretch] Add a Cumin backend for Netbox

Status
November 14, 2018
 * The migration of logging to Logstash and metrics into Prometheus is . Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.

December 12, 2018
 * Convert wmf-auto-reimage scripts to Cookbooks is and will mostly be finished in Q3 due to holidays. The other two goals will start after the conversion is done.
 * Upgrade Netbox to the latest version is ✅ but the stretch goal will mostly tackled in Q3.



= Q3 Goals =

Outcome 1 / Output 1
Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.


 * Create a staging cluster comparable to production infrastructure

Primary teams: SRE / Service Operations, Release Engineering

Goal(s)
First steps towards Canary Deployments


 * Introduce progressive rollouts to the mediawiki train
 * Introduce deployment run state in scap to keep track of successful scap runs
 * Investigate the use of versioning in MediaWiki, allowing scap to keep track of deployed revisions

Status
April 8, 2019
 * This has been ❌ to Q4

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Build an understanding of our needs around external monitoring services

 * Produce a short document with a cost/benefit analysis of our current external monitoring systems
 * Gather a set of requirements, desires, and likely technology choices for an external monitoring system, with a focus on achievability in a short timeframe (1-2 quarters)

Increase utilization of application logging pipeline

 * Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch
 * Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs (candidates: log4j, udp2log, syslog/syslog_tls etc.)
 * Retire udp2log: onboard its producers and consumers to the logging pipeline
 * [stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Upgrade metrics monitoring infrastructure core components

 * Serve >= 50% of production Prometheus systems with Prometheus v2
 * Upgrade production prometheus-node-exporter to >= 0.16
 * [stretch] Investigate distributed and long term storage solutions for Prometheus
 * Formulate requirements around aggregation, retention, hardware, etc.
 * Evaluate M3 and Thanos

Status
April 8, 2019
 * Build an understanding of our needs around external monitoring services is {[partially done}} in Q3
 * Increase utilization of application logging pipeline is - there is still work to be done on the 'Migrate at least 3 existing Logstash' goal (so, ) and the retiring udp2log and the stretch goal have been ❌ to Q4

Outcome 3 / Output 4 (SRE / Data Persistence)
Wikimedia projects and content are protected against major disasters that threaten availability.


 * Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)
Design and prepare infrastructure for database binary backups
 * Design a backup policy for logical and binary backups for both short term and long term storage
 * Procure and setup final hardware for binary backups
 * Fully implement binary backups and its rotation policy for all MediaWiki metadata and misc databases

Status
April 8, 2019
 * Backup policy is ✅ but the procure and implement has been ❌ to Q4

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Build automated workflows for server provisioning

 * Take additional steps towards a "single source of truth" system (Netbox)
 * Upgrade Netbox to v2.5 and use the new cable tracking feature
 * Expose production VMs to Netbox and keep them synchronized with Ganeti
 * Incorporate at least two more categories of data (servers interfaces, server IPs, MAC addresses, network device IPs, management/OOB, etc.)
 * Redesign the server provisioning and decommisioning process to facilitate orchestration
 * Add Netbox module to Spicerack and integrate it in the reimage and decom cookbooks
 * Convert virtual machine creation script to a cookbook
 * Reduce the number of manual steps involved in the provisioning process by at least 4

Status
April 8, 2019
 * Both goals are and will continue into Q4



=  Q4 Goals =

Outcome 3 / Output 4 (SRE / Data Persistence)
Wikimedia projects and content are protected against major disasters that threaten availability.


 * Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Stretch: Setup and deploy backup hardware

 * Install and setup eqiad/codfw backups/recovery hosts
 * Install and setup dump slaves
 * Perform fine tuning of snapshot and dumps performance on final hardware
 * Decommission old backups hosts dbstore1001, dbstore2001 and dbstore2002

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Dependencies on:

Logging

 * Deprecate all non-Kafka logstash inputs
 * [stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Metrics

 * 100% of Prometheus traffic served by Prometheus v2
 * Migrate all metrics originated by PoPs from statsd to Prometheus
 * Investigate distributed and long term storage solutions for Prometheus

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...

Outcome 4 / Output 7
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts

Primary teams: SRE / Infrastructure Foundations

Dependencies on: Cloud Services, Security, 

Developer account management

 * Audit production and WMCS infrastructure and document all authenticated services and their authentication & authorization capabilities
 * Engage with stakeholders and collect functional and non-functional requirements for identity and access management for web services
 * Evaluate free & open source Identity Management/SSO software solutions against our requirements and create a short list of 1-2
 * Build a migration plan from OpenStackManager and Striker towards a unified identity and access management system for developer accounts

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.


 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE (Infrastructure Foundations, Data Persistence, Service Operations)

Database workflows automation

 * Complete and deploy the tool for pooling/depooling databases dynamically from MediaWiki (dbconfig)
 * Migrate MediaWiki to use etcd for the database configuration in production
 * Write Spicerack abstractions for common database operations (pool/depool)
 * [stretch] Write Spicerack cookbooks to automate 2 common DBA workflows

Status
April 2019
 * Discussed...

May 2019
 * Discussed...

June 2019
 * Discussed...