Wikimedia Site Reliability Engineering/de

Das Team Site Reliability Engineering oder kurz SRE ist das Team, das für die Entwicklung und Wartung der Produktionsinfrastruktur von Wikimedia verantwortlich ist. Es war zuvor als Technical Operations bekannt und ist dafür verantwortlich sicherzustellen, dass alle Wikimedia-Seiten und -Dienste, die von der Öffentlichkeit genutzt werden (einschließlich MediaWiki und allen zugehörigen Diensten), verlässlich, sicher und mit hoher Leistung laufen.

Benachrichtige uns in Notfällen über Klaxon.

Zusätzliche Dokumentation zu unserer Infrastruktur und der Arbeit des Teams finden sich auf Wikitech.



Data Center Operations
The Data Center Operations team is responsible for all of Wikimedia’s data center deployments and logistics as well as maintaining our presence in locations across the world. They perform on-site work and maintain the full 5-year life cycle (specs, purchasing, physical install, break/fix and decommissioning) for all hardware.

Infrastructure Foundations
The team focuses on building and maintaining our base platform (“metal cloud”) that forms the foundations upon which nearly everything else in our infrastructure builds upon. On top of our bare metal deployments, their responsibilities include (but are not limited to) configuration management systems, infrastructure automation, orchestration tooling, infrastructure security and network operations.

Observability
The Observability team, or "o11y" for short, works across SRE and Technology to provide teams with diagnostic tools, platforms, and insights into how systems and services perform. It leverages technologies such as Grafana, Kibana/Logstash, OpenSearch, Prometheus, AlertManager and more.

Traffic
The Traffic team is responsible for the critical first layer of high-traffic infrastructure which now spans much of the globe, including our TLS termination and caching layers (ATS, Varnish), load balancing, DNS and our own network.

Data Persistence
The Data Persistence team focuses on Wikimedia’s persistent data storage and retrieval systems, including (No)SQL databases, (distributed) object storage, file storage and backup systems.

Service Operations
The Service Operations team takes care of public and “user-visible” services in close collaboration with both the Technology and Product teams. This includes our MediaWiki platform, the SOA service infrastructure based on Kubernetes, as well as community and developer-facing services like Gitlab, Gerrit, Phabricator and VRTS.

Contacting the team
If you need to get in touch with the team, there are detailed instructions on SRE Team requests.