Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals

=Program Goals and Status for FY18/19=

TEC6 Address Infrastructure Gaps
 * Goal Owner: Mark Bergsma
 * Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
 * Annual Plan: TEC6 Address Infrastructure Gaps
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Sustaining



 = Q1 Goals =

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Dependencies on: Search Platform; Primary team: Infrastructure Foundations

Goal(s)
Adopt Logstash ✅
 * Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
 * Audit log producers across the infrastructure and plan their transition to centralized logging.
 * Investigate log shipping methods and standardize on them.

Status
July 2018

August 14, 2018

September 11, 2018
 * A comprehensive design document has been prepared for logging and is currently in final review.

Outcome 3 / Output 4
Wikimedia projects and content are protected against major disasters that threaten availability.
 * Strengthen backups with reliable and redundant backup infrastructure

Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence

Goal(s)
Monitor database backup generation for failure or incorrect generation ✅
 * Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
 * Detect and alert on backup metrics anomalies

Status
July 30, 2018

August 14, 2018

September 11, 2018
 * Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations

Goal(s)
Migrate the hardware inventory from Racktables to Netbox ✅
 * Define Netbox existing and custom fields usage standards/best practices
 * Switch over from Racktables to Netbox
 * Stretch: Investigate Netbox reporting capabilities to automatically validate data
 * Stretch: Investigate Netbox potential future integrations, towards a single source of truth

Status
July 30, 2018

August 2018

September 11, 2018
 * A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.



= Q2 Goals =

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Begin the implementation of Q1's Logging Infrastructure design

 * Procure and provision Logging pipeline hardware in multiple datacenters
 * Migrate >=90% of existing Logstash traffic to the logging pipeline
 * Onboard at least 10 new non-sensitive log producers to the logging pipeline
 * Investigate approaches to ingest sensitive log producers
 * [stretch] Deprecate >= 50% of udp2log producers

Expand modern metrics infrastructure coverage

 * Plan and execute a new organization scheme for SRE Grafana dashboards
 * Retire >= 80% of production Diamond collectors
 * Provision >= 50% of statsd/Graphite-only metrics in Prometheus

Outcome 3 / Output 4
Wikimedia projects and content are protected against major disasters that threaten availability.
 * Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

 * Design and prepare infrastructure for database binary backups
 * Research options for producing binary backups (lvm snapshots, cold backups, mariabackup)
 * Implement a proof of concept of a snapshot cycle automation for a mediawiki section database
 * Procure hardware for binary backups

Outcome 3 / Output 4 (Performance)
Wikimedia projects and content are protected against major disasters that threaten availability.



Primary teams: SRE / Data Persistence, Performance

Goal(s)

 * Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4)
 * Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Expand Spicerack library and SRE Cookbooks

 * Split and convert the existing wmf-auto-reimage-lib into Spicerack modules
 * Convert wmf-auto-reimage scripts to Cookbooks
 * Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish)
 * Generate documentation for Spicerack

Expand Netbox usage

 * Upgrade Netbox to the latest version (>= 2.4) ✅
 * Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.)
 * Explore Netbox/NAPALM integration to pull live data from network devices
 * Develop and deploy at least three Netbox reports to assist with data correctness and consistency
 * [stretch] Add a Cumin backend for Netbox

Status
October 2018
 * Discussed...

November 2018
 * The migration of logging to Logstash and metrics into Prometheus is in full progress. Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.

December 2018
 * Discussed...