WMF Projects/Master & Secondary Datacenters

Overview
This project originally was the result of an RfC being approved. The work has been split into essentially two phases:

Phase I: Fast failover
 * Making sure failover from one DC to the secondary DC actually works for core MediaWiki (MW) services
 * Simplifying the failover process and DC-based configuration (mediawiki-config, puppet, ...)

Phase II: Concurrent traffic
 * Allowing concurrent traffic to both DCs simultaneously as per the RfC

Works is currently ongoing for both goals at the same time, as there is a long-tail of tasks needed for Phase II. In any case, the first goal has become an organization wide priority.

Phase I: Fast failover
The project currently has weekly meetings at a tracking board in Phabricator. Periodic switches will be tested, with resulting bugs and problems being fixed each iteration. The goal is to get better and better at this (less errors and faster switchover). This involves the basic steps of read-only, eqiad => codfw, r/w, wait, read-only, codfw => eqiad, r/w. Ideally, much of the switching can be done at CDN level, and MW config changes can be done via a central switch of which DC is "master". This is what ProductionServices.php in mediawiki-config was created for. Other config files are still being improved in this regard.

Tasks which could use more support:
 * T129657 - Read-only error message display problems
 * T129258 - Long-running scripts interacting with read-only mode
 * T114271 - This is mostly for *unplanned* emergency switch-over. Starting with the DBTransaction log in Kibana to look for multi-DB "commits" from MediaWiki. Also covers secondary services populated by event streams, job queue, hooks, ect...In some cases, the decision may be to "not worry about it".

Phase II: Concurrent traffic
There is a basic RfC defining the idea and a tracking task in Phabricator of blockers. Read-only traffic can use the closest datacenter for performance, while HTTP POST requests that may cause database writes will always use a "primary" datacenter.

Tasks which could use more support:
 * T92357 - long tail of extension fixes that cause a high amount of master queries on page views (HTTP GET/HEAD)
 * T97562 - memcached & varnish purge daemon using Kafka (via a new EventRelayer subclass in MediaWiki)

Proposal
Rather than active/standby, a master/slave setup could be established by sending actual traffic to all DCs. This could be extended to more than just two datacenters. Read requests could go to the DC closest to the cache proxy nodes handling them. Write requests would always be routed to the "master" DC, where the DB masters reside. The basic implementation is laid out in the next sections.

The T88445 "epic" Phabricator task tracks major code work for this effort.

Request routing
Varnish will follow some rules for selecting a backend on cache miss:


 * GET/HEAD/OPTIONS goes to closest DC
 * POST goes to the master DC
 * Any request with a valid "datacenter_preferred" cookie is routed to the master DC
 * A response header should indicate the DC for debugging

HTTP GET/HEAD idempotence
Currently, some MediaWiki GET requests (like rollback) do DB writes. Sometimes opportunistic updates (e.g. cascading protection, protected page list purging) happen on read requests. These should be changed to use the job queue to avoid slow cross-DC database master queries. Job insertion can work via deferred updates.

Data store usage/replication

 * MariaDB (main wiki tables): master DBs are in the primary DC, slaves in the slave DC
 * MariaDB (externalstore): master DBs are in the primary DC, slaves in the slave DC
 * Swift: use a global swift cluster with read/write affinity set (http://docs.openstack.org/developer/swift/admin_guide.html)
 * CirrusSearch: clusters maintained in both DCs with a "MultiEnqueueJob" queue in the primary DC that forks jobs into both DCs (cirrus jobs could run in both DCs)
 * MediaWiki sessions: redis masters in the master DC and slaves in the slave DC
 * Longer term goal is to evaluated automatically sharded stores (e.g. Cassandra ect). We only ~15mb usage per redis server with mostly get traffic and a tiny portion being setex. Various stores can handle this load easily. Manual redis sharding has limited ability to intelligently handle slaves in the slave DC going down and moving traffic over.

Job queue & runners
Jobs will be enqueued to and run in the master datacenter. The 'enqueue' queue (for the EnqueueJob class) can, however, exist in both DCs if desired (which would also both have runners in that case). The use of post-send DeferredUpdates and JobQueueGroup::lazyPush means that there is no latency issue with jobs enqueued on HTTP GET requests.

Locks/PoolCounter

 * PoolCounter daemons would be local to each DC
 * LockManager (used for file operations) would reside in the master DC, since the slave DCs shouldn't need it anyway
 * Random callers using $wgMemc->lock or $wgMemc->add can usually get DC local locks

Memory stashes
MediaWiki somtimes stashes values (like upload statuses, prepared edits in ApiEditStash, StatCounter deltas) in memcached for later use. The stash strategy will be:

In some cases, it might be worth switching from memcached to redis if cross-DC replication is desired.
 * memcached: stashed values could go exclusively to the primary DC memcached cluster (e.g. if all involved requests are POST anyway)
 * redis: stashes values are written to the master servers in redis and replicated to the other DC

Live activity state
Some things need to convey real-time state between DCs. For example, FlaggedRevs shows the (last) user current viewing a pending changes diff at Special:PendingChanges (and the API exposes this). This essentially means there is state that needs to be mutated and replicated by users simply viewing content. As long as the state changes use POST requests (e.g. via AJAX), then a master/slave strategy could be used, with the data going in Redis. Using async AJAX also means that this doesn't slow anything down.

Caching and purges
MediaWiki makes fairly heavy use of caching. The cache strategy will be:


 * memcached: cached values would go in DC-specific memcached clusters with an interface for broadcasting deletes (WANObjectCache) via daemons
 * varnish: the HTCP purger could be subscribed to relay updates, but could possibly be consolidated into the memcached purger daemons
 * MariaDB parsercache: these caches would be cluster-local

Logging and metrics
Various backends are used to store logs and metrics.


 * Ganglia: already grouped by DC and will stay that way
 * Logging/metrics in Elastic: these could be DC-local? Please, one Kibana that can see all logs for both DCs.
 * grafana et al: probably global except for special cases (a prefix could be used or something)

Consistency

 * Sessions using redis replication creates a slight opportunity for stale data. To avoid this, when sessions mutate (on POST), a "UseDC" cookie can be sent to the user (lasting only 10 seconds). This would sticky them to the active DC for more than enough time for replication to complete. This also means that the DB positions stored in ChronologyProtecter will make it through, preserving the "session consistency" we try to maintain now.
 * Additionally, redis writes to ChronologyProtecter can be synchronous to all DCs (247325), which handles cross-domain redirects and page views involving user updates to one wiki (domain) and subsequent views to another wiki (domain). For example, CentralAuth account creation involves updates to a local wiki, then login.wikimedia.org via redirect, and then back to the local wiki; each step involves various state changes.

Design implications
Some changes would be needed to MediaWiki development standards (in Performance guidelines and possibly elsewhere):
 * DB and other data store writes should be avoided on non-POST requests to avoid latency (shared-nothing caches are fine, even if using RDBMes). In some cases, using DeferredUpdates is acceptable, even for GET requests (the action is done after the user gets the response, so it does not block the request).
 * DB_MASTER reads should be avoided on non-POST request to avoid latency as well
 * Code that uses caches must be aware of whether it needs to do explicit purges or can use the data-center local cache. In general, if code uses delete or uses set to update the cache when a record changes, it needs will need to use the WAN cache. If it just uses TTL based logic or caches immutable objects, it can use either.
 * Code pushing to job queues on GET must use the local "enqueue" queue (EnqueueJob).
 * Ephemeral stashing, user activity data, and such most come about via POST requests (not just updates serving GET) or must use the local "enqueue" job queue (EnqueueJob).
 * Session data should only change on login/logout (POSTS) or in *rare* cases otherwise