Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps


Current Goal Status.

Program outline[edit]

Teams contributing to the program[edit]

Site Reliability Engineering

Annual Plan priorities[edit]

Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?[edit]

The wiki projects are Wikimedia’s primary tool advancing its mission, and the underlying infrastructure is core to its work. By evolving this infrastructure, and strengthening the teams, processes and structures supporting it we are putting Wikimedia in a better position for execution of its mid-term strategy.

Program Goal[edit]

Long-standing gaps in the resiliency, reliability and maintainability of Wikimedia’s technical infrastructure and resourcing of supporting teams are addressed.

Outcome 1
Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.
Output 1
Create a staging cluster comparable to production infrastructure
Output 2
Migrate (micro)services to our Streamlined Service Delivery platform with integrated CI/CD
Outcome 2
Technical staff have increased visibility into the operation of our services and infrastructure.
Output 3
Modernize logging, alerting and metrics monitoring infrastructure
Outcome 3
Wikimedia projects and content are protected against major disasters that threaten availability.
Output 4
Strengthen backups with reliable and redundant backup infrastructure
Output 5
Serve projects and services out of multiple data centers
Outcome 4
Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
Output 6
Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Output 7
Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts
Outcome 5
The Site Reliability team is able to perform its duties with adequate resourcing and a more reasonable division of responsibilities
Output 8
Continue the FY17-18 efforts to build a management support structure to support the SRE team's growth and process duties
Output 9
Address under-resourcing and reduce the bus-factor in several key areas by additional engineering capacity/staffing


FY2017–18 FY2018–19
People (OpEx)
  • Senior Site Reliability Engineer
  • Site Reliability Engineer
  • Site Reliability Engineer
  • Database Architect / Engineering Manager (new hire)
  • Engineering Manager (new hire)
  • Site Reliability Engineer (new hire)
  • Site Reliability Engineer (new hire)
  • Site Reliability Engineer (new hire)
  • Senior Site Reliability Engineer
  • Site Reliability Engineer
  • Site Reliability Engineer
Stuff (CapEx) TBD TBD
Travel & Other


Outcome 1[edit]

Outcome 2[edit]

Target 2
  1. 20% increase of services having adopted the modern metrics stack
  2. 100% of services involved in page views are using centralized logging
Measurement method
  1. Percentage of modern metrics stack adoption
  2. Percentage of production services using centralized logging

Outcome 3[edit]

  1. > 90% of backup generation jobs succeeds
  2. < 5% services important and relevant to the wider public served out of a single data center
Measurement method
  1. Ratio of successful/failed backup generation jobs
  2. Number of services that are served out of a single data center

Outcome 4[edit]

Measurement method
  1. Amount of time spent on manual, non-automated tasks ("toil") in common workflows as indicated in repeated surveys of SREs

Outcome 5[edit]

  1. 50% improvement between status quo at program start and 3-year "healthy" goal
Measurement method
  1. Progression on the SRE team's "get healthy" staff responsibilities diagram