Wikimedia Technology/Annual Plans/ERF OKR: Platform Excellence: Resilience

From mediawiki.org

Platform Excellence: Resilience[edit]

FY21/22 Organization Efficacy & Resilience OKR for Wikimedia Technology Department

Accountable: Faidon Liambotis

OKR Overview[edit]

Our services, infrastructure and data are resilient to and/or quick to recover from unexpected malicious or nonmalicious events

Key Result 1
Wikimedia's infrastructure is scaled to address known compute, storage and traffic capacity risks, by adding a new data center in EMEA (by end of Q1), expanding our main data center by at least 20% (by end of Q2), and by documenting two new capacity plans (by end of Q4)

Key Result 2
Service and security operational issues are detected, escalated, remediated and communicated to stakeholders and the movement, as measured by a 20% incident score improvement

Key Result 3
Security and privacy services are enterprise wide, centrally coordinated, scalable and resilient in a way that empowers all users to make good security and privacy decisions, measured by a 10% increase in consumption of consultation services and a 30% decrease in operational services


Objective Rationale[edit]

As part of the Foundation's Efficacy and Resilience Framework, this objective is intended to capture programmatic work that goes into improving the resilience in our services, processes, infrastructure and data. In many ways, this objective is a continuation of our Front-Line Defenses objective from FY20-21; it builds upon that, extending its scope to cover resilience concerns, including reliability and security, for malicious and nonmalicious events alike.

The activities to achieve this objective will happen across our technical stack, in the infrastructure and software we build, and also in process and documentation improvements, cross-training, consulting and advocacy.

The work is necessary for the Foundation to evolve to meet the demands of a changing landscape, and specifically:

  • The emergence of new use cases (e.g. machine learning & data engineering) or traffic patterns, including ones that result in increased usage of our resources;
  • Technologies in the industry evolving, requiring us to respond to new challenges, threats, and adjust the way we work;
  • Growth of our engineering organization, requiring us to adjust for our size
  • Past investments (such as capital investments, large infrastructure contracts, or technical implementation choices) at the end of their runway, requiring extended, renewed commitments for us to sustain our current pace of growth.

Three concrete Key Results are envisioned for this Objective. However, the Objective is purposefully broad in nature, with more OKRs across various levels in the organization, in service of that broader goal, expected to be aligned to it, on a quarter-by-quarter basis.

The objective meets the budgetary guardrails necessary to surface this as one of the top priorities of the department and organization, by requiring work from multiple teams and at least 15 FTEs, and with over a $1M budget in OpEx.


Key Result 1: Address capacity risks[edit]

Wikimedia's infrastructure is scaled to address known compute, storage and traffic capacity risks, by adding a new data center in EMEA (by end of Q1), expanding our main data center by at least 20% (by end of Q2), and by documenting two new capacity plans (by end of Q4)

Intent and Desired Outcomes
This is a multi-faceted KR, attempting to "move the needle" on multiple fronts.

The first part is a continuation of the FY20-21 Front-line Defenses Key Result around a data center expansion for our EMEA service region (originally envisioned to complete in FY21-22 Q1). It captures the addition of a site in Marseille, France, to increase our resilience against failures. Our Amsterdam location (“esams”) has grown to serve half of our traffic, making it “too big to fail”, or be “drained” for scheduled maintenance, without the potential for cascading failures in other sites and the site as a whole, or performance degradation in all of our regions. The location is also strategically located in a network hub that is not only well connected to our backbone network and US sites, and reasonably priced, but is also uniquely placed in a location where multiple submarine communication cables that interconnect Europe, Middle-East and Africa land. Therefore, while the primary impact is envisioned to be increased resiliency for our infrastructure, we expect this key result will also contribute to improved website performance for users in North Africa and the Middle East.

The second part is the growth of one of our main data centers, and specifically our largest one in Ashburn. As the Foundation has steadily grown its use cases and overall footprint, the utilization of that site has increased, to the point where the data center has only a few months of runway before space runs out entirely, and with effects of space constraints being visible even today for some of our use cases. Given data center contracts require an upfront investment with fixed costs in bootstrapping, we envision a growth step of at least 20%, with a ramp of power usage up to 50% over the coming year.

Finally, the third part of this KR is around building a culture of structured capacity planning, to support attempts to predict future growth and provide inflection points in future decision making. This can be a complex endeavour, at the heart of the SRE practice; we envision the KR in this FY to deliver on two capacity plans as pilots, to help inform the FY22-23 annual planning process.


Definitions and Scoping

  • SRE – Site Reliability Engineering, an engineering discipline and a Foundation team comprising engineers & managers specializing in the discipline and responsible for the Foundation’s site reliability
  • Data center – large facilities where servers are hosted. In this context, we are referring to leased space, power & cooling in secured spaces within a vendor’s larger facility.
  • EMEA – Europe, Middle East and Africa
  • FTE – Full-time Employee


Related Quarterly OKRs

  • Add a new data center in the EMEA region (Marseille) [Q1-Q2]
  • Expand our main data center (eqiad) by at least 20% [Q1-Q2]
  • Document two capacity plans [Q1-Q3]


Activities and Deliverables

  • EMEA data center (“drmrs”)
    • Contract & hardware procurement
    • Physical deployment (buildout)
    • Network design and deployment
    • Traffic edge site deployment and turn-up
  • Main data center (“eqiad”) expansion
    • Contract & hardware procurement
    • Physical deployment (buildout)
    • Network design and deployment
  • Capacity plan pilots


Resourcing

Activity Responsible Accountable Consulted Informed
EMEA data center: Contract & hardware procurement Data Center Operations team Willy Pao Contracts, Purchasing, FP&A, Infrastructure Foundations team
EMEA: data center: Physical deployment Data Center Operations team Willy Pao Traffic team, Infrastructure Foundations team SRE organization
EMEA data center: Network design & deployment Infrastructure Foundations team Joanna Borun Traffic team SRE organization
EMEA data center: Traffic edge site deployment and turn-up Traffic team Mark Bergsma Performance team, Infrastructure Foundations team The world
Main data center expansion: Contract & hardware procurement Data Center Operations team Willy Pao Traffic team, Infrastructure Foundations team SRE organization
Main data center expansion: Physical deployment Data Center Operations team Willy Pao Infrastructure Foundations team SRE organization
Main data center expansion: Network design and deployment Infrastructure Foundations team Joanna Borun Traffic team SRE organization
Capacity plan pilots To be selected later Lukasz Sobanski SRE organization Budget owners for SRE & APP delegates


Key Result 2: Incident management[edit]

Service and security operational issues are detected, escalated, remediated and communicated to stakeholders and the movement, as measured by a 20% incident score improvement

Intent and Desired Outcomes
As the Site Reliability Engineering (SRE) organization grows and evolves, some practices require maturing and polishing. Stability is a critical aspect of steady and predictable operations. Systems at the Foundation are generally stable, and the incident count is arguably low, considering Wikimedia’s traffic scale; however, there is always room for improvement.

In 2020 the ONFIRE working group alongside the Foundation’s SRE Observability team put together an internal Incident Management Survey. The survey received about 20 responses from engineers, indicating a need for improvement in managing processes, people and tooling around the overall incident management practice. Feedback from this survey has been used as means to gather bottom-up feedback, and combined with top-down requirements.

The result was a set of broad directional practices in an attempt to bolster our incident management practices:

  • Transparency & communication: better user & community experience for our users by openly sharing relevant information about impactful events, expectations for recovery, impact to the movement, and our actions towards remediation, evolving and enhancing our current practices. (“look at Phabricator or Wikitech” is not sufficient)
  • Efficiency & sustainability: as the organization has grown, practices that worked in the past ("all hands on deck" paging for every incident) are, at the current team size, not just inefficient but also counterproductive.
  • Availability & scalability: efforts made in service of Incident Management should result in long-term reduction in the number and severity of incidents, and a reduction of time to recover.
  • Equity & fairness: in preventing impact to individuals disproportionately, in having to respond to incidents that occur on a 24/7 basis, promoting fairness and equity, also in alignment with our “Resilient and Inclusive Foundation” organizational goal.

Applying these directional practices, the intent is for progress in FY21-22 to result into the following desired outcomes:

  • Engineers, including­ but not limited to Site Reliability Engineers, hold the necessary knowledge and skills to engage effectively in any incident.
  • An incident management process that is well documented and understood by everyone in the technology team/organization.
  • A structured incident documentation with detailed event categorization, severity, impact, and other relevant metrics or Incident Artifacts.
  • Clear guidelines for communications and escalation during incidents.
  • Adequate tooling exists, to be able to communicate, respond and engage effectively during incidents.

To measure progress on all of those different fronts, the intent is for work to be spent primarily in Q1 and Q2 to construct an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score. Throughout the activities that have already been envisioned and described below, as well as other activities to be devised throughout the course of FY21-22, we expect the score to be improved by 20% by the end of the fiscal year.

This Key Result is expected to also benefit significantly by the work that is described in the “Culture, Equity and Team Practices” FY21-22 Objective, and specifically the SLO activities in KR1. The gradual deployment of Service Level Objectives (SLO), as well as associated Service Level Indicators (SLI) are going to be instrumental in defining entry and exit conditions for the incident management processes described in this Key Result, and therefore progress in these two initiatives is expected to be synergistic in nature.


Definitions

  • Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.
  • SLO: A Service Level Objective (SLO) is an understanding between teams about expectations for reliability and performance. An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. (ie More than 99% of all request are successful)
  • SLI: An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret.


Scoping

  • People: How do we prepare our people to respond to incidents better?
  • Process: How should we behave/operate during an incident?
  • Tooling: What tools should we implement to facilitate responding to incidents?


Related Quarterly OKRs

  • Define and document a project plan which outlines changes to people practices, processes, and tooling required to ensure the success of the objective [Q1]
  • Define and document the scorecard used to measure incident engagement [Q1]
  • 100% adoption of scorecard across all incidents to establish metrics baseline [draft, Q2]
  • ~10% scorecard assessment improvement over previous quarter [draft, Q3]
  • ~10% scorecard assessment improvement over previous quarter [draft, Q4]


Activities and Deliverables
There are two overarching activities envisioned in Q1 & Q2: to define and document a detailed project plan as well as the scorecard that will be used to measure incident engagement.

After this phase of the project is complete, and a baseline is established, activities in Q3 & Q4 will be selected based on where we can make the most measurable impact to the incident scores. Activities that have been envisioned so far include (but are not limited to) the following, with only a subset of them expected to be implemented in FY 21-22:

  • People
    • Development of a training & certification program for incident responders
    • Clear expectation setting for 24/7 incident responders
    • Tabletop incident walkthroughs and simulations
  • Process
    • Incident response process assessment, documentation and revamp
    • “Post-mortem” incident review protocols & standards
  • Tooling
    • Development of incident management coordination tooling
    • Improvements in alerting and escalation tooling
    • Improvements on public communication and visibility of incidents


Resourcing

Activity Responsible Accountable Consulted Informed
Define and document a detailed project plan Working group Leo Mata SRE organization
Define and document scorecard To be defined Leo Mata SRE organization
Development of a training & certification program for incident responders To be defined Leo Mata SRE organization
Clear expectation setting for 24/7 incident responders Leo Mata Faidon Liambotis T&C organization
Tabletop incident walkthroughs and simulations Multiple SREs Leo Mata SRE organization, Security
Incident response process assessment, documentation and revamp Working group Leo Mata Technology Department
Post-mortem” incident review protocols & standards Leo Mata SRE organization, Release Engineering, Security Technology Department
Development of incident management coordination tooling Observability, Infrastructure Foundations Leo Mata SRE organization
Improvements in alerting and escalation tooling Observability Leo Mata SRE organization, Security Leadership
Improvements on public communication and visibility of incidents Observability, Infrastructure Foundations Leo Mata SRE organization, Communications The world


Key Result 3: Security and privacy services[edit]

Security and privacy services are enterprise wide, centrally coordinated, scalable and resilient in a way that empowers all users to make good security and privacy decisions, measured by a 10% increase in consumption of consultation services and a 30% decrease in operational services

Intent and Desired Outcomes
The intention is to identify, prioritize, coordinate and scale security and privacy activities across the Foundation. The outcomes will be expressed as a reduction in operational work, meaning the Security team will be pushing security and privacy services in a consumable way so that other teams in Technology and Product can make good security and privacy decisions. This Key Result is all about helping the Security and Privacy teams better develop and deliver their services.


Definitions
Work will begin within the Security team where we will be baselining services and their consumption from the following security services:

  • Application Security
  • Privacy Engineering
  • Threat and Vulnerability Management
  • Security Incident Response
  • Capabilities Management
  • Cyber Risk

1st pass will be to baseline these activities to understand volume and who is and how these services are being consumed.

2nd pass will be to understand bottlenecks and prioritize and identify service gaps and where we need to be equipping consumers differently.

3rd pass will be to apply controls to address efficacy and efficiency and various other gaps in our deliverables.


Related Quarterly OKRs

  • Security and privacy services have each identified and documented baseline measurements for the purposes of transparency, accountability, and quality control [Q1]


Activities and Deliverables

  • Fusion Center
    • Count of Severity 1 & 2 Security Incidents
    • Count of Critical and high risk vulnerabilities
    • Count of supplier security reviews
  • Capabilities Management
    • Count of onboarding security awareness modules
      • Not yet available
    • Count of new members in #talk-to-security
      • Increase from 31 members on 8/3/21 (first date of measurement) to 59 members 8/25/21
  • Cyber Risk
    • Count of Critical and High risk issues
  • Architecture
    • Count of number of privacy engineering risk assessments


Resourcing

Activity Responsible Accountable Consulted Informed
Baseline services Security team Jennifer Cross / John Bennett Various teams Various consumers
Identify service delivery issues Security team Jennifer Cross / John Bennett Various teams Various consumers
Implement service delivery controls Security team Jennifer Cross / John Bennett Various teams Various consumers
Test and provide feedback on deliverables Various consumers Security team Security team Jennifer Cross / John Bennett