Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps

Program outline

Teams contributing to the program

Site Reliability Engineering

Annual Plan priorities

Primary Goal: 3. Knowledge as a Service - evolve our systems and structures

How does your program affect annual plan priority?

The wiki projects are Wikimedia’s primary tool advancing its mission, and the underlying infrastructure is core to its work. By evolving this infrastructure, and strengthening the teams, processes and structures supporting it we are putting Wikimedia in a better position for execution of its mid-term strategy.

Program Goal

Long-standing gaps in the resiliency, reliability and maintainability of Wikimedia’s technical infrastructure and resourcing of supporting teams are addressed.

Outcome 1

Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.

Output 1

Create a staging cluster comparable to production infrastructure

Output 2

Migrate (micro)services to our Streamlined Service Delivery platform with integrated CI/CD

Outcome 2: Technical staff have increased visibility into the operation of our services and infrastructure.

Output 3

Modernize logging, alerting and metrics monitoring infrastructure

Outcome 3: Wikimedia projects and content are protected against major disasters that threaten availability.

Output 4

Strengthen backups with reliable and redundant backup infrastructure

Output 5

Serve projects and services out of multiple data centers

Outcome 4: Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Output 6

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Output 7

Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts

Outcome 5: The Site Reliability team is able to perform its duties with adequate resourcing and a more reasonable division of responsibilities

Output 8

Continue the FY17-18 efforts to build a management support structure to support the SRE team's growth and process duties

Output 9

Address under-resourcing and reduce the bus-factor in several key areas by additional engineering capacity/staffing

Resources

	FY2017–18	FY2018–19
People (OpEx)	Senior Site Reliability Engineer Site Reliability Engineer Site Reliability Engineer	Database Architect / Engineering Manager (new hire) Engineering Manager (new hire) Site Reliability Engineer (new hire) Site Reliability Engineer (new hire) Site Reliability Engineer (new hire) Senior Site Reliability Engineer Site Reliability Engineer Site Reliability Engineer
Stuff (CapEx)	TBD	TBD
Travel & Other

Targets

Outcome 1

Outcome 2

Target 2

20% increase of services having adopted the modern metrics stack
100% of services involved in page views are using centralized logging

Measurement method

Percentage of modern metrics stack adoption
Percentage of production services using centralized logging

Outcome 3

Targets

> 90% of backup generation jobs succeeds
< 5% services important and relevant to the wider public served out of a single data center

Measurement method

Ratio of successful/failed backup generation jobs
Number of services that are served out of a single data center

Outcome 4

Measurement method

Amount of time spent on manual, non-automated tasks ("toil") in common workflows as indicated in repeated surveys of SREs

Outcome 5

Target

50% improvement between status quo at program start and 3-year "healthy" goal

Measurement method

Progression on the SRE team's "get healthy" staff responsibilities diagram

Dependencies

n/a