User:Legoktm/Blog/June 2021 switchover

Earlier this week, the Wikimedia Foundation's Site Reliability Engineering team switched most user traffic from our primary datacenter in Virginia ("eqiad") to our secondary one in Texas ("codfw", learn more about our different datacenters). This is an exercise we've done multiple times over the past 5 years and this was the smoothest and fastest one yet.

The main reason we perform a datacenter switchover is to verify that in an emergency, we can switch to a different datacenter with minimal interruptions for users. All of our services and datacenters have redundant networking, power, disks and more, but even then, freak accidents can happen, and we need to be prepared.

We're also using this time to perform maintenance in Virginia that's cumbersome to do when still actively serving user traffic. For example, we're swapping out about 45 MediaWiki application servers for brand new hardware, giving users a slight performance boost. There's also a large list of pending database maintenance that was waiting on the switchover to happen.

The switchover itself was divided into 3 primary sections: Services, Traffic (caches) and MediaWiki.

Services
[How to describe what "services" covers???] For this switchover, we included more services in this list, notably Swift, which handles all of our media storage.

Most of these are active/active, in that they could run out of both datacenters at the same time, but we're choosing just to run them in Texas to ensure we have enough capacity there to handle the load. Here's an example of traffic shifting from Virginia to Texas for the Citoid service, which fetches and generates reference templates and metadata.

During this process we identified a few issues:


 * T285707: Our helm-charts service doesn't have a service IP, causing it to fail verification that it switched over properly. This also interrupted the verification for the rest of the services, so we had to check them by hand.
 * T285710: Monitoring for the Wikidata Query Service required manually switching the datacenter being monitored, causing lag to be misreported. Most Wikidata bots do check the amount of lag before editing, so they were stalled until it was manually switched.

Traffic
Most requests for articles never hit MediaWiki itself, they're served out of our caches, typically the one closest to them out of Virginia, Texas, California, Amsterdam and Singapore. We depooled Virginia by updating our DNS and within a few minutes nearly all of that traffic was going to Texas instead. We didn't run into any issues during this step.

MediaWiki
MediaWiki is the application that powers all of our wikis. Work is ongoing to make it possible to run it out of multiple datacenters at the same time, but for now it can only be active in one at a time. The process for switching MediaWiki is complex, but in brief entails setting the primary databases as read-only, waiting for replication to finish across into the other datacenter, and then lifting read-only mode in the new datacenter.

Because of how disruptive stopping edits is for wikis, we've been cutting down how long this read-only period takes each time. This time it only lasted 1 minute and 57 seconds, the fastest yet.

After the switch, the Turkish Wikivoyage was unavailable for a few minutes because of a typo in the configuration. An incident report was written for this and a patch is pending review to prevent it from happening again.

Various other improvements to the automation around switching have been filed in Phabricator as well.

Next steps
We will switch back to our primary Virginia datacenter some time in August once most maintenance has finished, allowing us to test the procedure once again. We also have Datacenter-Switchover and MediaWiki-MultiDC Phabricator projects tracking our work in this area to make Wikimedia wikis more resilient and available on a technical level.