WMF Projects/Master & Secondary Datacenters

Overview
This project originally was the result of an RfC being approved. The work has been split into essentially two phases:

Phase I: Fast failover
 * Making sure failover from one DC to the secondary DC actually works for core MediaWiki (MW) services
 * Simplifying the failover process and DC-based configuration (mediawiki-config, puppet, ...)

Phase II: Concurrent traffic
 * Allowing concurrent traffic to both DCs simultaneously as per the RfC

Works is currently ongoing for both goals at the same time, as there is a long-tail of tasks needed for Phase II. In any case, the first goal has become an organization wide priority.

Fast failover
The project currently has weekly meetings at a tracking board in Phabricator. Periodic switches will be tested, with resulting bugs and problems being fixed each iteration. The goal is to get better and better at this (less errors and faster switchover). This involves the basic steps of read-only, eqiad => codfw, r/w, wait, read-only, codfw => eqiad, r/w. Ideally, much of the switching can be done at CDN level, and MW config changes can be done via a central switch of which DC is "master". This is what ProductionServices.php in mediawiki-config was created for. Other config files are still being improved in this regard.

Tasks which could use more support:
 * T129657 - Read-only error message display problems
 * T129258 - Long-running scripts interacting with read-only mode
 * T114271 - This is mostly for *unplanned* emergency switch-over. Starting with the DBTransaction log in Kibana to look for multi-DB "commits" from MediaWiki. Also covers secondary services populated by event streams, job queue, hooks, ect...In some cases, the decision may be to "not worry about it".

Concurrent traffic
There is a basic RfC defining the idea and a tracking task in Phabricator of blockers.

Tasks which could use more support:
 * T92357 - long tail of extension fixes that cause a high amount of master queries on page views (HTTP GET/HEAD)
 * T97562 - memcached & varnish purge daemon using Kafka (via a new EventRelayer subclass in MediaWiki)