Core Platform Team/Initiatives/Serve read requests from multiple datacenters

Serve read requests from multiple datacenters

Product Owner:	TBD
Technical Contact:	Marko Obrovac
Status:	In Progress
Initiative Vision:
Initiative Description:	Read Edit
Epics, User Stories, and Requirements:	Read Edit
Time and Resource Estimates:	Read Edit
Open Questions:	Read Edit
Documentation Links:

Initiative Description

< Initiatives

Summary

Currently we are serving all of our MediaWiki-related traffic from a single DC (data centre), which is problematic for a number of reasons:

if the main DC gets cut off or experiences networking issues, all of our projects become invisible to users;
we keep a full copy of the infrastructure in the secondary DC, thus wasting hardware and electricity; and
serving content from only one DC means slower access times and higher latencies for markets that would profit from being served from other DCs

As a first step in being able to fully serve traffic from multiple DCs, we need to start serving only GET requests (a.k.a. read requests) from multiple DCs.

Significance and Motivation

By completing this project, we will have better utilisation of the hardware we have in all DCs and partially avoid catastrophic events. Moreover, in doing so we will be able to serve better and faster emergent markets.

Outcomes

Increase the scalability of the platform for future applications and new types of content, as well as a growing user base and amount of content

The primary measure of the success of this project is that both MediaWiki and the production environment can serve GET requests in all DCs.

Baseline Metrics

Infrastructure in CODFW ready to serve read requests without slowing down all writes and reads in all DCs

Target Metrics

Split incoming traffic on the edge for GET requests and route them to the closest DC (instead of the main DC)

Stakeholders

Core Platform
SRE

Known Dependencies/Blockers

Multi-DC Session Storage

Epics, User Stories, and Requirements

< Initiatives

Replace all the DB slaves in codfw (need new hardware to support traffic)
Update and/or replace GTID_Wait / heartbeat
Decide where to move mainstash data (non-session data in Redis)
Migrate mainstash data to new location
Decide what is the acceptable replication lag for MediaWiki?
- Current lag is 10-15 seconds. ~1 second would be acceptable for 90% of reads get what they need, last 10% needs reengineering
Evaluate lag for codfw, if not acceptable, then engineer a solution
Update MediaWiki code to wait for replication (if needed)
TLS Proxy work
Separate traffic on varnish level
Serve thumbnails

Time and Resource Estimates

< Initiatives

Estimated Start Date

None given

Actual Start Date

Started July 2018

Estimated Completion Date

None given

Actual Completion Date

None given

Resource Estimates

None given

Collaborators

Core Platform
Performance
SRE

Open Questions

< Initiatives Many parts of MediaWiki assume that the DB is "close by and secure", which will change for Multi-DC. How do we address this?

Long term, what do we do with MYSQL? It doesn’t do master-master operations and blocks us from supporting writes in both locations.

If we remove performance tricks in MediaWiki, would the performance be acceptable and enable us to generalize the DB abstraction? Working on the Abstract Schema Migrations RFC may clear some of this question up.

Subpages