Requests for comment/Multi datacenter strategy for MediaWiki

Request for comment (RFC)
Multi datacenter strategy for MediaWiki
Component	General
Creation date	7 February 2015
Author(s)	Aaron Schulz
Document status	accepted "we can't approve every last detail but it is broadly going to happen" – Tim Starling in 2015-09-02 RFC meeting See Phabricator.

Tracked in Phabricator
Task T88666

This proposal is in regards to Wikimedia configuration changes and MediaWiki support for having "reads" (GET/HEAD requests) be routed to both datacenters, with writes (POST requests) going to a single datacenter. Having writes be routable to multiple datacenters is not in the scope of this particular proposal.

Background

Historically, MediaWiki has designed for single-cluster master/slave database setups. Some modules (FileBackend, Search, BagOStuff, ExternalStore, ...) support distributed stores with no master/slave distinction. In any case, it was not designed for cross-DC sites serving the same MediaWiki instance. In order to avoid single DC disaster scenarios, the Wikimedia Foundation has opted to have multiple datacenters, with a full copy of the data. Caching proxy servers exist at these and at separate sites to handle latency.

Problem

The use of proxy servers only handles basic page, file, and asset (JS/CSS) requests by logged-out users.

It does not cover various special pages or actions, just standard page views and asset delivery. Once users log in, page views always go to the active datacenter. The standby datacenter only makes use of proxy cache servers, the rest of them are just backups (Swift, Redis, MariaDB, ...).
There is currently no way to keep the caches in the standby warm. Also, since no traffic goes to the standby datacenter, its ability to handle traffic and perform correctly is not well tested, leaving room for surprises on failover. One could make a relay script, but that would not yield as many benefits. There is also less motivation to keep the secondary DC config up to date if no traffic goes there.
Switch-over time is not terribly well optimized and involves a lot of puppet work. This should be optimized in any case, but it would be nice to have read requests "Just Work" if a DC has a power outage or other internal issues.
A fair amount of hardware/power is wasted just sitting there losing value.

Proposal

Rather than active/standby, a master/slave setup could be established by sending actual traffic to all DCs. This could be extended to more than just two datacenters. Read requests could go to the DC closest to the cache proxy nodes handling them. Write requests would always be routed to the "master" DC, where the DB masters reside. The basic implementation is laid out in the next sections.

The T88445 "epic" Phabricator task tracks major code work for this effort.

Request routing

Varnish will follow some rules for selecting a backend on cache miss:

GET/HEAD/OPTIONS goes to closest DC
POST goes to the master DC
Any request with a valid "datacenter_preferred" cookie is routed to the master DC
A response header should indicate the DC for debugging

HTTP GET/HEAD idempotence

Currently, some MediaWiki GET requests (like rollback) do DB writes. Sometimes opportunistic updates (e.g. cascading protection, protected page list purging) happen on read requests. These should be changed to use a local job queue to avoid slow cross-DC database master queries. Each DC would have its own local jobqueue with runners for "enqueue jobs". These jobs simply enqueue the job to the master DC. All other job types would normally only run on the master DC.

Data store usage/replication

MariaDB (main wiki tables): master DBs are in the primary DC, slaves in the slave DC
MariaDB (externalstore): master DBs are in the primary DC, slaves in the slave DC
Swift: use a global swift cluster with read/write affinity set (http://docs.openstack.org/developer/swift/admin_guide.html)
CirrusSearch: clusters maintained in both DCs with a "MultiEnqueueJob" queue in the primary DC that forks jobs into both DCs (cirrus jobs could run in both DCs)
MediaWiki sessions: redis masters in the master DC and slaves in the slave DC
- Longer term goal is to evaluated automatically sharded stores (e.g. Cassandra ect). We only ~15mb usage per redis server with mostly get() traffic and a tiny portion being setex(). Various stores can handle this load easily. Manual redis sharding has limited ability to intelligently handle slaves in the slave DC going down and moving traffic over.

Job queuing

The would be two local job queues with runners:

a) The "EnqueueJob" type that pushes jobs to the master DC
b) The "MultiEnqueueJob" type that pushes jobs to both DCs

Some job types might also be run locally if needed (mostly to work around systems with poor geo-replication). Most jobs that do actual work will do so on the master DC.

Locking/PoolCounter

PoolCounter daemons would be local to each DC
LockManager would reside in the master DC, since the slave DCs shouldn't need it anyway
Random callers using $wgMemc->lock() or $wgMemc->add() can usually get DC local locks

Memory stashes

MediaWiki somtimes stashes values (like upload statuses, prepared edits in ApiEditStash, StatCounter deltas) in memcached for later use. The stash strategy will be:

memcached: stashed values could go exclusively to the primary DC memcached cluster
redis: stashes values are written to the master servers in redis and replicated to the other DC

In some cases, it might be worth switching from memcached to redis if cross-DC replication is desired.

Live activity state

Some things need to convey real-time state between DCs. For example, FlaggedRevs shows the (last) user current viewing a pending changes diff at Special:PendingChanges (and the API exposes this). This essentially means there is state that needs to be mutated and replicated by users simply viewing content. As long as the state changes use POST requests (e.g. via AJAX), then a master/slave strategy could be used, with the data going in Redis. Using async AJAX also means that this doesn't slow anything down.

Caching and purges

MediaWiki makes fairly heavy use of caching. The cache strategy will be:

memcached: cached values would go in DC-specific memcached clusters with an interface for broadcasting deletes (WANObjectCache) via daemons
varnish: the HTCP purger could be subscribed to relay updates, but could possibly be consolidated into the memcached purger daemons
MariaDB parsercache: these caches would be cluster-local

Logging and metrics

Ganglia: already grouped by DC and will stay that way
Logging/metrics in Elastic: these could be DC-local? Please, one Kibana that can see all logs for both DCs.
graphite/gdash: deprecated
grafana et al: probably global except for special cases (a prefix could be used or something)

Consistency

Sessions using redis replication creates a slight opportunity for stale data. To avoid this, when sessions mutate (on POST), a "datacenter_preferred" cookie can be sent to the user (lasting only 10 seconds). This would sticky them to the active DC for more than enough time for replication to complete. This also means that the DB positions stored in ChronologyProtecter will make it through, preserving the "session consistency" we try to maintain now.

Design implications

Some changes would be needed to MediaWiki development standards (in Backend performance guidelines#Caching layers and possibly elsewhere):

DB and other data store writes should be avoided on non-POST requests to avoid latency (shared-nothing caches are fine, even if using RDBMes). In some cases, using DeferredUpdates is acceptable, even for GET requests (the action is done after the user gets the response, so it does not block the request).
DB_MASTER reads should be avoided on non-POST request to avoid latency as well
Code that uses caches must be aware of whether it needs to do explicit purges or can use the data-center local cache. In general, if code uses delete() or uses set() to update the cache when a record changes, it needs will need to use the WAN cache. If it just uses TTL based logic or caches immutable objects, it can use either.
Code pushing to job queues on GET must use the local "enqueue" queue (EnqueueJob).
Ephemeral stashing, user activity data, and such most come about via POST requests (not just updates serving GET) or must use the local "enqueue" job queue (EnqueueJob).
Session data should only change on login/logout (POSTS) or in *rare* cases otherwise

Deployment steps

Actually deployment could be done in stages to keep things manageable:

Make the new MediaWiki job/cache classes and sticky DC cookie code available in core
Setup the new (WAN) caches and daemons, testing them in labs and manually (this can be done in 1 DC)
Likewise for the local job queues and runners (that merely route jobs)
Likewise for the sticky DC cookie logic
Start using the new cache modules in core
Document new usage patterns for cache/queue/stores on mw.org and convert extensions
Get the second datacenter puppetized and fully setup with replicated data (with no user traffic)
- Search and file/thumbnail requests might be excluded if there is no replication solution for those yet
Setup and verify that request work in the second datacenter (with cache daemons in place there)
Deploy the Varnish routing logic to the inactive datacenter and verify that requests work
- If the second DC is being used for live requests as a CDN, then the routing will need to require some debug header
Actually route traffic to both DCs (via DNS or not requiring the debug header)
Work on getting elasticsearch, swift, and anything else left handled by both DCs

Codfw mediawiki-config checklist

Redis slaves setup with proper tag hashing (using the master names)
$wgGettingStartedRedisSlave is set
The $wgJobQueueConf['enqueue'] is set to be local and has runners via puppet
Memcached setup as DC-local with WAN cache relay daemons (at least the secondary DC should be listening for changes)
PoolCounter setup as DC-local
HTTP POST cookie config set