User:Anomie/Schema-change-post-draft

We've recently done some significant normalization of MediaWiki's database schemas. I'd like to share the methods we used to do it and the lessons we learned.

Background
For each "action", meaning a revision to a page, a log entry, an image version uploaded, a block, and so on, MediaWiki was storing the comment/summary text and the user name or IP for each row. As this was causing some of our tables, particularly, to become very large, we decided to normalize these fields to store the strings once and refer to them from each "action" row by integer ID.

Our sites are large and popular, we can't just take an entire site offline in order to update the databases and the code in one go, even for a few minutes. And the database updates needed would take far longer than that. So we have to devise a process to make the changes online. Fortunately we're well used to doing that sort of thing. The usual method for doing schema changes in the past has gone something like this:
 * 1) Define the new tables, columns, indexes, and so on ("schemas", for lack of a better word), through normal code review processes.
 * 2) * This may also involve some changes to the old schemas, such as adding defaults to columns so these being missing from INSERTs when we get to step 5 won't result in errors.
 * 3) Deploy the new schemas to Wikimedia's databases.
 * 4) Update the code to start populating both the old and new schemas, probably in multiple steps. Test that things seem to be working properly.
 * 5) Run a migration script to back-populate the new schemas for old rows.
 * 6) Update the code to remove all access to the old schemas, and the old schemas themselves, probably in multiple steps.
 * 7) * This may also involve some changes to the new schemas, such as removing defaults from columns that should always be included in INSERTs now.
 * 8) Deploy the removal of the old schemas to Wikimedia's databases.

On the whole this has worked rather well for us, but it does have some drawbacks:
 * There are a lot of separate code changes involved: one in step 1, multiple in step 3, and multiple in step 5. It can be hard to see the whole plan when reviewing the earlier pieces.
 * During intermediate stages, the intended behavior of any particular piece of code can be unclear. It would be easy to accidentally leave some piece behind.
 * The pace of rolling out each stage of the change across our multiple sites is tied to the deployment train. We can't easily go slower for additional testing, and rolling back in case of bugs is an intensive process.

That last bullet isn't just a problem here, we have that in rollouts of many major features. And we have a standard solution too: a "feature flag" configuration setting, so the new feature only shows up on a wiki when the configuration setting is set true. This schema change process is more complicated than a simple on/off flag, but extending the idea to a multi-step process is straightforward.

The feature flag can help us improve the situation for the other bullets too. We can do the majority of the changes at step 1 since the feature flag selects which code is run, the presence of the feature flag makes affected code much easier to find, and the associated conditional control structures can make intention clearer as well.

Overview
For various reasons, we wound up doing the normalizations of comments and user names/IPs ("actors") separately and mostly serially.

Both normalizations were complicated by the fact that it was feared that we were caught in a dilemma: the  tables on various wikis, and possibly the   table on Wikimedia Commons, were thought to be too large in their denormalized state to alter without blocking other work for several months. But to normalize them, we would need to alter them. We broke the dilemma by adding some temporary additional complexity to the schemas: instead of altering these tables right away, we'd create temporary "linking" tables to hold the additional fields that would be needed for the normalization. We could then empty the old denormalized fields, and generally hope that one change dropping the now-empty old columns and adding the new at the same time would be workable.

At this time we're almost done with both the comment and actor normalizations. The process for Multi-Content Revisions is in progress.

In the near future, we'll likely be following this process again to replace the temporary tables mentioned above with the fields in  tables.

The comment normalization

 * The tracking task for this normalization was T166733

The plan
The plan was to have a four-state feature "flag", covering 7 stages of migration. The plan here was chosen to minimize the amount of duplicate data in the database, i.e. avoiding writing both the old denormalized field and the new normalized data to the database for any one "action". It was also (somewhat unconsciously) informed by the one-way nature of the relationship between "action" and comment: we never look up, filter, or order "actions" by their comment.

These stages were carefully chosen to have the property that cross-wiki access between wikis in different stages is always safe as long as the stages are only one step apart. That same property ensures that, in case of problems, we can go back one stage without having major issues due to missing data. One major implication of this is that we must always upgrade every wiki from S-1 to S before migrating any wiki to S+1.

The transitions between stages, and the reasons for compatibility between the stages, are as follows.
 * 0→1
 * This is one of only two mandatory code and schema deployment transitions, representing the deployment of the code adding the feature flag and all associated conditional control structures and the schema change adding all the new columns and tables to the databases.
 * These are compatible because in either case only the old schemas are accessed. Since the new schemas are never accessed, errors will not be generated due to attempts to access columns or tables that don't exist on the stage-0 wiki.


 * 1→2
 * This is the transition to starting to write data to both the old and new schemas. Read accesses load both the new and old data, using the new data in preference to the old.
 * Technically there's no reason this couldn't have been "read-both/write-old" to further reduce the amount of duplicate data written. But doing write-both allows us to test that the new data is being written completely and correctly, with recovery being as simple as deleting the incorrect new data (falling back to the old).
 * These are compatible because the stage-2 wiki still writes the old-schema data needed by the stage-1 wiki, and still uses the old-schema data when the stage-1 wiki doesn't write the new-schema data.


 * 2→3
 * This transition stops writing the old schema, only the new schema will be populated for new "actions". Read accesses still load both the new and old data, as old rows still lack new data.
 * These are compatible because the stage-2 wiki will ignore the old-schema data when new-schema data is present, so it's ok if the stage-3 wiki doesn't write the old data anymore.


 * 3→4
 * This transition represents the running of the migration script to back-populate the new schema in all the old rows. The maintenance script may in fact blank the old-schema columns as it copies data over.
 * These are running the same code with the same configuration. All this is doing is converting old-schema rows to new-schema rows.


 * 4→5
 * This transition removes all accesses to the old schema. Only the new schema is read now; missing new-schema data should start being treated as an error.
 * These are compatible because all rows always have new-schema data, so it's safe to stop reading old-schema data entirely.


 * 5→6
 * This is the second mandatory code and schema deployment transition, representing the deployment of the code removing the feature flag and associated conditional control structures and the schema change removing the old columns and tables from the databases.
 * These are compatible because in either case only the new schemas are accessed. Since the old schemas are never accessed anymore, it's safe to remove them.

The implementation
We combined a few other feature requests into the project: the long-standing request to allow comments longer than 255 bytes, and a request from Wikibase to allow for structured data for comments.

We introduced the CommentStore class as a combination "migration manager" and storage manager for the new table. The "migration manager" aspect refers to the centralization of common logic for handling the migration stages, so all callers don't have to themselves contain conditional branches on the feature flag.

Roll outs to Wikimedia production were conducted in stages, each separated by at least a few days to allow bugs to be found and reported: Beta Cluster first, then test wikis and mediawiki.org, then group 0, then group 1, and finally group 2.

Stage 1: The patch was first uploaded 8 June 2017, and merged 30 August 2017. The schema change for Wikimedia production was begun 2 November, and completed 6 February 2018. There was some delay involved due to an issue with s5, plus the holidays.

Stage 2: Beta Cluster was tested starting 17 November 2017. A false start for production on 4 December 2017 pointed out the necessity of all wikis being at stage 1 first. Roll out began again on 6 Feburary 2018 and completed 27 February.

Stage 3: Stage 3 was delayed by changing the Cloud Services replica views and by a followup production schema change that was needed to correct for a difference between the varchar type MediaWiki specified for certain columns and the blob type actually used in Wikimedia production on some wikis. The latter in particular was greatly delayed by a need for a master failover which wound up being several months delayed.

Roll out of stage 3 began 17 September 2018 and completed 25 October.

Stage 4: The maintenance script runs began 30 October, and completed 9 November. Consistency checks were run 12–13 November.

Stage 5: Began 26 November, completed 19 December.

Stage 6: The patch was uploaded 4 January 2019, merged 7 February. The Cloud Services replica views took until 20 May, due to some errors in the initial attempt. The production schema change was delayed since the actor change was almost to the same stage, and was begun 18 September 2019.

The actor normalization

 * The tracking task for this normalization was T188327

The plan
The original plan was to use the same plan as for the comment normalization. We got as far as starting stage 2 when we found out that that plan failed horribly: unlike comments, we do look things up by actor, filter by actor, and even order results by the actor. The queries trying to handle read-both mode were unable to make use of indexes and so were extremely inefficient.

The new plan, based on some early planning of a migration for Multi-Content Revisions, was to read from only the old schemas until the new was fully populated, after which we would switch to reading only the new schemas and then stop writing the old. This is a significant duplication of data, but the resulting queries are much simpler.

As with the previous plan, these stages were carefully chosen to have the property that cross-wiki access between wikis in different stages is always safe as long as the stages are only one step apart. That same property ensures that, in case of problems, we can go back one stage without having major issues due to missing data. One major implication of this is that we must always upgrade every wiki from S-1 to S before migrating any wiki to S+1.

The transitions between stages, and the reasons for compatibility between the stages, are as follows.
 * 0→1
 * This is one of only two mandatory code and schema deployment transitions, representing the deployment of the code adding the feature flag and all associated conditional control structures and the schema change adding all the new columns and tables to the databases.
 * These are compatible because in either case only the old schemas are accessed. Since the new schemas are never accessed, errors will not be generated due to attempts to access columns or tables that don't exist on the stage-0 wiki.


 * 1→2
 * This is the transition to starting to write data to both the old and new schemas. Read accesses load from the old schema only.
 * These are compatible because only the old schemas are read, and they are written by both stages.


 * 2→3
 * This transition represents the running of the migration script to back-populate the new schema in all the old rows. The maintenance script cannot blank the old data, though, as it is still in use.
 * These are running the same code with the same configuration. All this is doing is filling in new-schema rows.


 * 3→4
 * This transition switches reads from the old schema to the new schema. Both are still written, though.
 * These are compatible because both the old and new schemas remain fully populated, so it doesn't matter whether the old or new is read.


 * 4→5
 * This transition removes all accesses to the old schema. Only the new schema is written now.
 * These are compatible because all reads use only the new schema, so it's safe to stop writing old-schema data entirely.


 * 5→6
 * This is the second mandatory code and schema deployment transition, representing the deployment of the code removing the feature flag and associated conditional control structures and the schema change removing the old columns and tables from the databases.
 * These are compatible because in either case only the new schemas are accessed. Since the old schemas are never accessed anymore, it's safe to remove them.

The implementation
This change also introduced a migration manager, ActorMigration. This does not also serve as a storage manager, though, since management of actor data was added to the existing User class.

The temporary  linking table for this change includes some other fields denormalized from   to allow for queries looking revisions up by actor to use functionally the same indexes they'll use once the temporary table is removed.

Roll outs to Wikimedia production were conducted as for the comment normalization.

Stage 1: The patch was first uploaded 25 September 2017, and merged 23 February 2018. The schema change for Wikimedia production was split into two parts: the first was begun 10 April and completed 30 May 2018, while the second was combined with the followup production schema change for the comment normalization while didn't complete until 17 September.

Stage 2: Roll out was attempted to Beta Cluster and test wikis on 17 September 2018. It was soon reverted, as query performance for read-both was awful as noted above. After some attempts to fix individual slow queries, we formed the new plan described above.

Stage 1 again: The patch implementing the new plan was submited 19 September, merged 10 October. No additional schema changes were needed.

Stage 2: Roll out began 13 November, and completed 13 December.

Stage 3: The maintenance script runs began 17 December, but queries were slow and it was aborted. The fix was submitted 17 December 2018, and merged 10 January 2019. Script runs began again on 11 January, and completed 30 January. Some ancient database inconsistencies were fixed manually.

Consistency checks finished 4 February, and identified one bug which was fixed by 3 April.

Stage 4: Roll out began 3 April 2019 and completed 13 May. Some slow queries were found and fixed during the rollout.

Also of note is that Cloud Services replica view updates were finished by 5 October 2018, and in some cases contained "read-both" logic as for the original plan.

Stage 5: Roll out began 13 May, since Beta Cluster and test wikis had been stable at stage 4 for some time. It completed 3 June.

The Cloud Services replica view updates to remove the read-old bits began 15 May, completed 3 June.

Stage 6: The patch was uploaded 25 July 2019, merged 9 September. The production schema change began 18 September 2019.

Retrospective
The plan used for the actor normalization should serve as a model for future schema migrations, as it proved to work well. The plan from the comment normalization should only be used if the relation between the source rows and the new normalized table is one-way.

Plan for a lot of "down" time during the process. Even when everything goes right, the production schema changes and maintenance script runs can each take several weeks. At the same time, be ready to react quickly if something does break. Roll outs of the other stages take a few minutes for the schema change, then several days of waiting-time watching for bugs to show up.

Code review was a major bottleneck, for the typical reasons it's a bottleneck for any project without multiple engineers working on it as their top priority project.

Plan interactions with other teams ahead of time, and file tasks early when possible to allow them time to plan their work and make any needed announcements. In the "actor plan" the Cloud Services view changes can be defined at the same time as the stage 1 code, can be started at the same time as stage 4, and will block further progress only once we reach the point of starting production schema changes in stage 6. The interactions with DBAs are less amenable to early filing (and consultation is required during the pre-stage-1 planning phase), but still benefit from status communication particularly as stages 1 and 6 approach.

The configuration change stage transitions should include changing of the default, so CI will test the new default before it starts being pushed to the wikis. This wasn't done reliably in either case study, but proved useful during actor transition 4→5.

While both plans here have almost all of the code changes upfront in stage 1 (stage 6 is mostly removal of old code), that isn't required. For example in the "actor plan" you might omit the backfill maintenance script and read-new code paths from the stage 1 patch since they aren't needed until stages 3 and 4, respectively. You could also split both the stage 1 and 6 patches into schema-change and code-change parts, particularly since the production schema changes by the DBAs depend only on the schema-change part being merged. On the other hand, more code upfront makes it easier to account for all the use cases and for reviewers to see those use cases during review.

Thanks
Thanks to Tim Starling for reviewing the major patches, to Jaime Crespo and Manuel Arostegui for doing the production schema changes, and to Brooke Storm for doing the Cloud Services replica view updates. If any of my comments above about errors or delays came across as critical of their work, that was not my intention.