Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps

Current Goal Status.

Teams contributing to the program
Site Reliability Engineering

Annual Plan priorities
Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?
The wiki projects are Wikimedia’s primary tool advancing its mission, and the underlying infrastructure is core to its work. By evolving this infrastructure, and strengthening the teams, processes and structures supporting it we are putting Wikimedia in a better position for execution of its mid-term strategy.

Program Goal
Long-standing gaps in the resiliency, reliability and maintainability of Wikimedia’s technical infrastructure and resourcing of supporting teams are addressed.
 * Outcome 1
 * Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.
 * Output 1
 * Create a staging cluster comparable to production infrastructure


 * Output 2
 * Migrate (micro)services to our Streamlined Service Delivery platform with integrated CI/CD


 * Outcome 2
 * Technical staff have increased visibility into the operation of our services and infrastructure.


 * Output 3
 * Modernize logging, alerting and metrics monitoring infrastructure


 * Outcome 3
 * Wikimedia projects and content are protected against major disasters that threaten availability.


 * Output 4
 * Strengthen backups with reliable and redundant backup infrastructure


 * Output 5
 * Serve projects and services out of multiple data centers


 * Outcome 4
 * Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.


 * Output 6
 * Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management


 * Output 7
 * Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts


 * Outcome 5
 * The Site Reliability team is able to perform its duties with adequate resourcing and a more reasonable division of responsibilities


 * Output 8
 * Continue the FY17-18 efforts to build a management support structure to support the SRE team's growth and process duties


 * Output 9
 * Address under-resourcing and reduce the bus-factor in several key areas by additional engineering capacity/staffing

Outcome 2

 * Target 2
 * 1) 20% increase of services having adopted the modern metrics stack
 * 2) 100% of services involved in page views are using centralized logging
 * Measurement method
 * 1) Percentage of modern metrics stack adoption
 * 2) Percentage of production services using centralized logging

Outcome 3

 * Targets
 * 1) > 90% of backup generation jobs succeeds
 * 2) < 5% services important and relevant to the wider public served out of a single data center
 * Measurement method
 * 1) Ratio of successful/failed backup generation jobs
 * 2) Number of services that are served out of a single data center

Outcome 4

 * Measurement method
 * 1) Amount of time spent on manual, non-automated tasks ("toil") in common workflows as indicated in repeated surveys of SREs

Outcome 5

 * Target
 * 1) 50% improvement between status quo at program start and 3-year "healthy" goal
 * Measurement method
 * 1) Progression on the SRE team's "get healthy" staff responsibilities diagram

Dependencies
n/a