Requests for comment/Multi datacenter strategy for MediaWiki

This proposal is in regards to Wikimedia configuration changes and MediaWiki support for having "reads" (GET/HEAD requests) be routed to both datacenters, with writes (POST requests) going to a single datacenter. Having writes be routable to multiple datacenters is not in the scope of this particular proposal.

Background
Historically, MediaWiki has designed for single-cluster master/slave database setups. Some modules (FileBackend, Search, BagOStuff, ExternalStore, ...) support distributed stores with no master/slave distinction. In any case, it was not designed for cross-DC sites serving the same MediaWiki instance. In order to avoid single DC disaster scenarios, the Wikimedia Foundation has opted to have multiple datacenters, with a full copy of the data. Caching proxy servers exist at these and at separate sites to handle latency.

Problem
The use of proxy servers only handles basic page, file, and asset (JS/CSS) requests by logged-out users.
 * It does not cover various special pages or actions, just standard page views and asset delivery. Once users log in, page views always go to the active datacenter. The standby datacenter only makes use of proxy cache servers, the rest of them are just backups (Swift, Redis, MariaDB, ...).
 * There is currently no way to keep the caches in the standby warm. Also, since no traffic goes to the standby datacenter, its ability to handle traffic and perform correctly is not well tested, leaving room for surprises on failover. One could make a relay script, but that would not yield as many benefits. There is also less motivation to keep the secondary DC config up to date if no traffic goes there.
 * Switch-over time is not terribly well optimized and involves a lot of puppet work. This should be optimized in any case, but it would be nice to have read requests "Just Work" if a DC has a power outage or other internal issues.
 * A fair amount of hardware/power is wasted just sitting there losing value.

Proposal
Rather than active/standby, a master/slave setup could be established by sending actual traffic to all DCs. This could be extended to more than just two datacenters. Read requests could go to the DC closest to the cache proxy nodes handling them. Write requests would always be routed to the "master" DC, where the DB masters reside. The basic implementation is laid out in the next sections.

The T88445 "epic" Phabricator task tracks major code work for this effort.

Request routing
Varnish will follow some rules for selecting a backend on cache miss:


 * GET/HEAD/OPTIONS goes to closest DC
 * POST goes to the master DC
 * Any request with a valid "datacenter_preferred" cookie is routed to the master DC
 * A response header should indicate the DC for debugging

HTTP GET/HEAD idempotence
Currently, some MediaWiki GET requests (like rollback) do DB writes. Sometimes opportunistic updates (e.g. cascading protection, protected page list purging) happen on read requests. These should be changed to use a local job queue to avoid slow cross-DC database master queries. Each DC would have its own local jobqueue with runners for "enqueue jobs". These jobs simply enqueue the job to the master DC. All other job types would normally only run on the master DC.

Data store usage/replication

 * MariaDB (main wiki tables): master DBs are in the primary DC, slaves in the slave DC
 * MariaDB (externalstore): master DBs are in the primary DC, slaves in the slave DC
 * Swift: use a global swift cluster with read/write affinity set (http://docs.openstack.org/developer/swift/admin_guide.html)
 * CirrusSearch: clusters maintained in both DCs with a "MultiEnqueueJob" queue in the primary DC that forks jobs into both DCs (cirrus jobs could run in both DCs)
 * MediaWiki sessions: redis masters in the master DC and slaves in the slave DC
 * Longer term goal is to evaluated automatically sharded stores (e.g. Cassandra ect). We only ~15mb usage per redis server with mostly get traffic and a tiny portion being setex. Various stores can handle this load easily. Manual redis sharding has limited ability to intelligently handle slaves in the slave DC going down and moving traffic over.

Job queuing
The would be two local job queues with runners:
 * a) The "EnqueueJob" type that pushes jobs to the master DC
 * b) The "MultiEnqueueJob" type that pushes jobs to both DCs

Some job types might also be run locally if needed (mostly to work around systems with poor geo-replication). Most jobs that do actual work will do so on the master DC.

Locking/PoolCounter

 * PoolCounter daemons would be local to each DC
 * LockManager would reside in the master DC, since the slave DCs shouldn't need it anyway
 * Random callers using $wgMemc->lock or $wgMemc->add can usually get DC local locks

Memory stashes
MediaWiki somtimes stashes values (like upload statuses, prepared edits in ApiEditStash, StatCounter deltas) in memcached for later use. The stash strategy will be:

In some cases, it might be worth switching from memcached to redis if cross-DC replication is desired.
 * memcached: stashed values could go exclusively to the primary DC memcached cluster
 * redis: stashes values are written to the master servers in redis and replicated to the other DC

Live activity state
Some things need to convey real-time state between DCs. For example, FlaggedRevs shows the (last) user current viewing a pending changes diff at Special:PendingChanges (and the API exposes this). This essentially means there is state that needs to be mutated and replicated by users simply viewing content. As long as the state changes use POST requests (e.g. via AJAX), then a master/slave strategy could be used, with the data going in Redis. Using async AJAX also means that this doesn't slow anything down.

Caching and purges
MediaWiki makes fairly heavy use of caching. The cache strategy will be:


 * memcached: cached values would go in DC-specific memcached clusters with an interface for broadcasting deletes (WANObjectCache) via daemons
 * varnish: the HTCP purger could be subscribed to relay updates, but could possibly be consolidated into the memcached purger daemons
 * MariaDB parsercache: these caches would be cluster-local

Logging and metrics

 * Ganglia: already grouped by DC and will stay that way
 * Logging/metrics in Elastic: these could be DC-local? Please, one Kibana that can see all logs for both DCs.
 * graphite/gdash: deprecated
 * grafana et al: probably global except for special cases (a prefix could be used or something)

Consistency

 * Sessions using redis replication creates a slight opportunity for stale data. To avoid this, when sessions mutate (on POST), a "datacenter_preferred" cookie can be sent to the user (lasting only 10 seconds). This would sticky them to the active DC for more than enough time for replication to complete. This also means that the DB positions stored in ChronologyProtecter will make it through, preserving the "session consistency" we try to maintain now.

Design implications
Some changes would be needed to MediaWiki development standards (in Backend performance guidelines#Caching layers and possibly elsewhere):
 * DB and other data store writes should be avoided on non-POST requests to avoid latency (shared-nothing caches are fine, even if using RDBMes). In some cases, using DeferredUpdates is acceptable, even for GET requests (the action is done after the user gets the response, so it does not block the request).
 * DB_MASTER reads should be avoided on non-POST request to avoid latency as well
 * Code that uses caches must be aware of whether it needs to do explicit purges or can use the data-center local cache. In general, if code uses delete or uses set to update the cache when a record changes, it needs will need to use the WAN cache. If it just uses TTL based logic or caches immutable objects, it can use either.
 * Code pushing to job queues on GET must use the local "enqueue" queue (EnqueueJob).
 * Ephemeral stashing, user activity data, and such most come about via POST requests (not just updates serving GET) or must use the local "enqueue" job queue (EnqueueJob).
 * Session data should only change on login/logout (POSTS) or in *rare* cases otherwise

Deployment steps
Actually deployment could be done in stages to keep things manageable:
 * Make the new MediaWiki job/cache classes and sticky DC cookie code available in core
 * Setup the new (WAN) caches and daemons, testing them in labs and manually (this can be done in 1 DC)
 * Likewise for the local job queues and runners (that merely route jobs)
 * Likewise for the sticky DC cookie logic
 * Start using the new cache modules in core
 * Document new usage patterns for cache/queue/stores on mw.org and convert extensions
 * Get the second datacenter puppetized and fully setup with replicated data (with no user traffic)
 * Search and file/thumbnail requests might be excluded if there is no replication solution for those yet
 * Setup and verify that request work in the second datacenter (with cache daemons in place there)
 * Deploy the Varnish routing logic to the inactive datacenter and verify that requests work
 * If the second DC is being used for live requests as a CDN, then the routing will need to require some debug header
 * Actually route traffic to both DCs (via DNS or not requiring the debug header)
 * Work on getting elasticsearch, swift, and anything else left handled by both DCs

Codfw mediawiki-config checklist

 * Redis slaves setup with proper tag hashing (using the master names)
 * $wgGettingStartedRedisSlave is set
 * The $wgJobQueueConf['enqueue'] is set to be local and has runners via puppet
 * Memcached setup as DC-local with WAN cache relay daemons (at least the secondary DC should be listening for changes)
 * PoolCounter setup as DC-local
 * HTTP POST cookie config set