Wikimedia Performance Team/Multi-DC MediaWiki

Multi-DC MediaWiki (also known as, Active-active MediaWiki) is a cross-cutting project driven by the Performance Team to give MediaWiki the ability to serve read requests from multiple datacenters. Currently, MediaWiki is deployed at Wikimedia Foundation to serve requests from the primary datacenter only.

Having MediaWiki deployed and actively serving from two or more datacenters during normal operations, ensures higher resilience in case of a datacenter failure, removes most switchover costs and complexity, and eases regular maintenance. It also brings a promise of future performance potential, and has throughout its developed brought about various performance improvements due to restructuring how business logic is implemented and naturally requiring more async or event-driven solutions.

Remaining work
The project was formalised via the Multi-DC strategy RFC in 2015. Since then, Aaron Schulz has driven the effort of improving, upgrading, and porting the various production systems around MediaWiki to work in an active-active context with multiple datacenters serving MediaWiki web requests. You can see the history of subtasks on Phabricator.

This document focuses on remaining work left as of December 2020 – the major blockers left before enabling the active-active serving of MediaWiki.

ChronologyProtector
ChronologyProtector is the system ensuring that editors see the result of their own actions in subsequent interactions.

The remaining work is deciding where and how to store the data going forward, to deploy any infra and software changes as needed, and to enable these.


 * Lead: Performance Team (Timo).
 * In collaboration with SRE Service Operations.
 * Task: T254634

Updates:


 * September 2020: an architectural solution has been decided on and the Performance Team, in collaboration with Service Operations, will migrate ChronologyProtector to a new data storage (either Memcached or Redis), during Oct-Dec 2020 (FY 2020-2021 Q2).
 * February 2021: code simplification and backend configuration for Multi-DC ChronologyProtector have been implemented and deployed to production for all wikis.
 * March 2021: Documented CP store requirements for third-parties.
 * March 2021: Task closed.

Session storage
The session store holds temporary data required for authenticating and authorization procedures such as logging in, creating accounts, and security checks before actions such as editing pages.

The older data storage system has various short-comings beyond mere incompatibility with a multi-DC operation. Even in our current single-DC deployment the annual switchovers are cumbersome, and a replacement has been underway for some time.

The remaining work is to finish the the data storage migration from Redis (non-replicated) to Kask (Cassandra-based).


 * Lead: Core Platform Team (Eric Evans), Performance Team (Tim, Timo).
 * Past tasks: RFC T206010, T206016.
 * Current tasks: T270225 (core logic).

Updates:


 * 2018-2020 (T206016): Develop and deploy Kask, gradually roll out to all Beta and production wikis.
 * Dec 2020: Performance Team realize that requirements appear unmet, citing multiple unresolved "TODOs" in the code for primary requirements, internally inconsistent claims about service interface. T270225
 * Jan 2021: CPT triages task from Inbox.
 * Feb 2021: CPT moves task to "Platform Engineering Roadmap > Later".
 * March 2021: Future optimisation identified by Tim (two-level session storage) - T277834.
 * July 2022: Performance take over T270225 within the limited scope of completing Multi-DC needs.




 * Fulfil or explain away the unresolved TODOs at T270225.
 * Straighten out internally inconsistent interface guruntees at T270225.

CentralAuth storage
A special kind of session storage for the central login system and cross-wiki "auto login" and "stay logged in" mechanism.

The last part of that work, migrating CentralAuth sessions, is currently scheduled for completion in Oct-Dec 2020 (2020-2021 Q2).


 * Lead: Core Platform Team (Bill Pirkle), Performance Team (Tim, Timo).
 * Task: T267270

Updates:


 * Nov 2020: Initial assessment done by CPT.
 * Jan 2021: Assessment concluded.
 * Feb 2021: Assessment re-opened.
 * Jul 2022: Decided on Kask-sessions (Cassandra) as backend. Should not be separate from core sessions. TTL mismatch considered a bug and also fixed by Tim. T313496




 * Document CA interface requirements and expectations.

Main Stash store
The Redis cluster previously used for session storage is also host to other miscellaneous application data through the Main Stash interface. This has different needs than session storage which become more prominent in a multi-DC deployment which make it unsuitable for Cassandra/Kask.

The remaining work is to survey the consumers and needs of Main Stash, decide how to accomodate them going forward. E.g. would it help if we migrated some of its consumers elsewhere and have a simpler replacement for the rest? Also: carry out any software and infra changes as needed.


 * Lead: SRE Data Persistence Team (Manuel).
 * In collaboration with Performance Team (Tim, Aaron)
 * Task: T212129

Updates:


 * June 2020: The plan is to move this data to a new small MariaDB cluster. This project requires fixing "makeGlobalKey"" in SqlBagOStuff, and new hardware. This is being procured and set up in Q2 2020-2021 by the Data Persistence Team. The Performance Team will take care of migrating the Main Stash as soon as the new database cluster is available, i.e. between Oct 2020 and Mar 2021 (FY 2020-2021 Q2 or Q3).
 * July 2020: SqlBagOStuff now supports makeGlobalKey and can work with separate DB connections outside the local wiki. - T229062
 * Sep 2020: Hardware procurement submitted. Oct 2020: Procurement approved as part of larger order. Dec 2020: Hardware arrived. - T264584
 * Jan 2021: Hardware racked and being provisioned. - T269324
 * Feb 2021: MySQL service online and replication configured. - T269324
 * June 2022: Test config in production.
 * June 222: Enable on all wikis.

MariaDB cross-datacenter secure writes
MediaWiki being active-active means that writes still only go to the primary datacenter, however a fallback is required for edge cases where a write is attempted in a secondary datacenter. In order to preserve our users' privacy, writes need to be sent encrypted across datacenters. Multiple solutions are being considered, but a decision has yet to be made on which one will be implemented. This work will be a collaboration between the Data Persistence Team and the Performance Team. We hope for it to happen during fiscal year 2020-2021.


 * Lead: SRE Data Persistence Team (Manuel).
 * In collaboration with Performance Team (Tim, Timo, Aaron).
 * Task: T134809

Updates:


 * July 2020: Potential solutions evoked so far: Connect with TLS to MariaDB from PHP directly, ProxySQL, dumb TCP tunnel, Envoy as TCP tunnel?, HAProxy in TCP mode.
 * Dec 2020: Leaning toward a tunnel approach, ProxySQL would take too long to set up and test from scratch.
 * May 2022: Decision is reached, led by Tim. TLS connection to be established directly from MediaWiki without additional proxies or tunnels.
 * May 2022: Configuration written.
 * June 2022: MariaDB-TLS tested and enabled for all wikis.

ResourceLoader file dependency store
Currently written to a core wiki table using a primary DB connection, must be structured such that writes are done within a secondary DC and then replicated. The plain is to migrate it toward the Main Stash instead.


 * Lead: Performance Team (Timo, Aaron).
 * Task: T113916

Updates:


 * July 2019: Implement DepStore abstraction, decoupled from using primary DB, and now includes a KeyValue implementation that supports Main Stash.
 * May 2020: Rolled out to Beta Cluster and mediawiki.org.
 * June 2022: MainStashDB went live. Roll out to group0 wikis.
 * July 2022: Gradually rolled out to all wikis (details on task).

CDN routing
Remaining work is to agree on the MW requirements, and then write, test and deploy the traffic routing configuration.


 * Lead: Performance Team (Tim, Aaron, Timo).
 * In collaboration with SRE Traffic and SRE Service Ops.
 * Task: T91820 (implement CDN switch), T279664 (plan gradual rollout)

Updates:


 * May 2020: Aaron and Timo have thought through all relevant scenarioes and drafted the requirements at T91820.
 * June 2020: Audit confirms that relevant routing cookies and headers are in place on the MW side.
 * May 2022: Traffic routing logic being developed by Tim.
 * June 2022: ATS routing logic deployed to Beta and prod, no-op but enabled.

History
For notes from 2015-2016, see Multi datacenter strategy for MediaWiki/Progress.