Wikimedia Site Reliability Engineering

{Media Records engineering project information
 * name = Earth FM
 * description = Director
 * group = Technology
 * EPM = James Lutho, Thomas Siyavuya
 * Phabricator = sre
 * updates =
 * progress =
 * team = In teams:


 * Data Center Operations
 * Thomas Siyavuya James Lutho, Shumane Esihle


 * Data Persistence
 * Shumane Esihle Thomas Siyavuya


 * Infrastructure Foundations
 * Emihle Gura Thomas Siyavuya


 * Observability
 * Leo Mata Filippo Giunchedi, Keith Herron, Cole White, Andrea Denisse Gómez-Martínez


 * Service Operations
 * Alexandros Kosiaris, Lukasz Sobanski Giuseppe Lavagetto, Reuven Lazarus, Effie Mouzeli, Daniel Zahn, Janis Meybohm, Kunal Mehta, Jelto Wodstrcil, Arnold Okoth, Clément Goubert


 * Traffic
 * Kwaku Addo Ofori Brandon Black, Brett Cornwall, Valentin Gutierrez, Emanuele Rocca, Sukhbir Singh, Marc Mandere

}} The Site Reliability Engineering team, or SRE for short, is the team responsible for developing and maintaining Wikimedia's production infrastructure. Previously known as Technical Operations, they are in charge of making sure all Wikimedia's sites and services used by the public (including MediaWiki and all associated services) run reliably, securely, and with high performance.
 * perennial = yes
 * backlog =
 * display =
 * start =
 * end =

Notify us of emergencies with Klaxon.

Additional documentation related to our infrastructure and the team's work can be found on Wikitech.

Data Center Operations
The Data Center Operations team is responsible for all of Wikimedia’s data center deployments and logistics as well as maintaining our presence in locations across the world. They perform on-site work and maintain the full 5-year life cycle (specs, purchasing, physical install, break/fix and decommissioning) for all hardware.

Infrastructure Foundations
The team focuses on building and maintaining our base platform (“metal cloud”) that forms the foundations upon which nearly everything else in our infrastructure builds upon. On top of our bare metal deployments, their responsibilities include (but are not limited to) configuration management systems, infrastructure automation, orchestration tooling, infrastructure security and network operations.

Observability
The Observability team, or "o11y" for short, works across SRE and Technology to provide teams with tools, platforms, and insights into how systems and services are performing. It leverages technologies such as Grafana, Kibana/Logstash, Prometheus, AlertManager and more.

Traffic
The Traffic team is responsible for the critical first layer of high-traffic infrastructure which now spans much of the globe, including our TLS termination and caching layers (ATS, Varnish), load balancing, DNS and our own network.

Data Persistence
The Data Persistence team focuses on Wikimedia’s persistent data storage and retrieval systems, including (No)SQL databases, (distributed) object storage, file storage and backup systems.

Service Operations
The Service Operations team takes care of public and “user-visible” services alongside Technology and Product teams. This means, for example, our MediaWiki platform, but also the newer (micro)services that comprise our stack. It also includes miscellaneous services and components that we rely upon (think Phabricator, mail systems, VRTS, etc…). The team is also building our new SOA service infrastructure based on Kubernetes.

Contacting the team
If you need to get in touch with the team, there are detailed instructions on SRE Team requests