Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals

=Program Goals and Status for FY18/19=

TEC6 Address Infrastructure Gaps
 * Goal Owner: Mark Bergsma
 * Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
 * Annual Plan: TEC6 Address Infrastructure Gaps
 * Primary Goal is Knowledge as a Service: Evolve our systems and structures
 * Tech Goal: Sustaining



 = Q1 Goals =

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Dependencies on: Search Platform; Primary team: Infrastructure Foundations

Goal(s)
Adopt Logstash ✅
 * Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
 * Audit log producers across the infrastructure and plan their transition to centralized logging.
 * Investigate log shipping methods and standardize on them.

Status
July 2018

August 14, 2018

September 11, 2018
 * A comprehensive design document has been prepared for logging and is currently in final review.

Outcome 3 / Output 4
Wikimedia projects and content are protected against major disasters that threaten availability.
 * Strengthen backups with reliable and redundant backup infrastructure

Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence

Goal(s)
Monitor database backup generation for failure or incorrect generation ✅
 * Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
 * Detect and alert on backup metrics anomalies

Status
July 30, 2018

August 14, 2018

September 11, 2018
 * Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations

Goal(s)
Migrate the hardware inventory from Racktables to Netbox ✅
 * Define Netbox existing and custom fields usage standards/best practices
 * Switch over from Racktables to Netbox
 * Stretch: Investigate Netbox reporting capabilities to automatically validate data
 * Stretch: Investigate Netbox potential future integrations, towards a single source of truth

Status
July 30, 2018

August 2018

September 11, 2018
 * A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.



= Q2 Goals =

Outcome 2 / Output 3
Technical staff have increased visibility into the operation of our services and infrastructure.
 * Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Begin the implementation of Q1's Logging Infrastructure design

 * Procure and provision Logging pipeline hardware in multiple datacenters
 * Migrate >=90% of existing Logstash traffic to the logging pipeline
 * Onboard at least 10 new non-sensitive log producers to the logging pipeline
 * Investigate approaches to ingest sensitive log producers
 * [stretch] Deprecate >= 50% of udp2log producers

Expand modern metrics infrastructure coverage

 * Plan and execute a new organization scheme for SRE Grafana dashboards
 * Retire >= 80% of production Diamond collectors
 * Provision >= 50% of statsd/Graphite-only metrics in Prometheus

Status
November 14, 2018
 * updated goals for current status

December 2018
 * Discussed...

Outcome 3 / Output 4
Wikimedia projects and content are protected against major disasters that threaten availability.
 * Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

 * Design and prepare infrastructure for database binary backups
 * Research options for producing binary backups (lvm snapshots, cold backups, mariabackup)
 * Implement a proof of concept of a snapshot cycle automation for a mediawiki section database
 * Procure hardware for binary backups

Status
November 14, 2018
 * updated goals for current status

December 2018
 * Discussed...

Outcome 3 / Output 4 (Performance)
Wikimedia projects and content are protected against major disasters that threaten availability.



Primary teams: SRE / Data Persistence, Performance

Goal(s)

 * Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4)
 * Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views

Status
November 14, 2018
 * updated goals for current status

December 2018
 * Discussed...

Outcome 4 / Output 6
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Expand Spicerack library and SRE Cookbooks

 * Split and convert the existing wmf-auto-reimage-lib into Spicerack modules
 * Convert wmf-auto-reimage scripts to Cookbooks
 * Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish)
 * Generate documentation for Spicerack

Expand Netbox usage

 * Upgrade Netbox to the latest version (>= 2.4) ✅
 * Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.)
 * Explore Netbox/NAPALM integration to pull live data from network devices
 * Develop and deploy at least three Netbox reports to assist with data correctness and consistency
 * [stretch] Add a Cumin backend for Netbox

Status
November 14, 2018
 * The migration of logging to Logstash and metrics into Prometheus is . Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.

December 2018
 * Discussed...