Multi-Content Revisions/Content Meta-Data

For accessing the content of the different slots of a revision, a RevisionContentLookup service is defined:

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492; See also T142980)

For virtual slots, RevisionContentLookup::getRevisionSlot or RevisionSlot::getContent would generate the desired information on the fly. In general, RevisionSlot::getContent should be considered to be potentially expensive, since it may trigger lazy loading of the content blob.

Side note: the use of Title in this interface is unfortunate. We would want somethign more light weight, like TitleValue, but with the information from the page table, most importantly the page ID. A PageEntry or PageRecord, prehaps.

TBD: Shall we use the full Conent interface also for derived content, or should we define a more narrow interface, with only simple methods like getModel and isEmpty?

The initial implementation of RevisionContentLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot (and virtual slots) are supported. Implementation steps:


 * Move storage layer code for accessing revision content from Revision into RevisionContentLookup and an appropriate RevisionSlot implementation.
 * Change Revision to use a RevisionSlot to access revision content.
 * The initial implementation of RevisionContentLookup and RevisionSlot will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

To allow virtual slots to be defined, a dispatching implementation of RevisionContentLookup can be used. It would delegate to other specialized implementations of RevisionContentLookup based on the slot name and/or content type. This way, suitable RevisionContentLookup implementations can be created to e.g. provide access to the ParserCache, or RESTbase services, or use the new VRS infrastructure to provide information about revisions.

Database Schema
Please refer to the Multi-Content Revisions/Glossary for a description of the entities modeled by this schema!

In order to allow multiple content objects per revision, the revision table needs to be split, so that the information about the content lives in a separate table, while information about the revision as such stays where it is. The structure of this table was discussed in the Create a content meta-data table RFC, in a somewhat different context.

Introducing this table is the core feature needed for multi-content revision support. It introduces a level of indirection that provides a new degree of freedom to modeling the wiki content: pages can have multiple "streams" now. We go from the straight forwards to a more indirect model: The table structure is as follows:

(Code experiment: https://gerrit.wikimedia.org/r/#/c/302056/ and http://sqlfiddle.com/#!9/0b847/7)

A  field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). would reference a  table defined in the same way as the   and   tables proposed in T105652 and T142980. and  form a unique key - that is, there can only be one content object per role in each revision.

In some cases, it may be sufficient to access the  (or , see below) table, and bypass the   table completely. For instance, content (resp. ) can be joined directly against   to find the content relevant for showing the current revision of a page.

When joining against,  , etc.,   will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

The auto-increment  field is not strictly necessary unless we want to re-use content rows, see below. But being able to identify specific content with a unique id seems like a good idea in any case.

Re-using Content Rows
If we assume that it will be common for one stream of a page to be edited frequently (e.g. the main stream), while other streams are only updated rarely (e.g. categories), it seems inefficient to create rows in the content table for every slot of every revision. Instead, we want to re-use a row in the content table for multiple rows in the revision table. To allow this, we can introduce another table that records the association of content with revisions. The idea is that instead of relying on  (a n:1 relationship), we use a separate table to allow an n:m relationship:

Note that we still need the  field in the content table to track which revision introduced a specific bit of content.

Also note that multiple rows in the content table may refer to the same blob (that is, they have the same value in ). So with this approach, there are two levels of indirection that allow re-use: revision -> content (n:m), and content -> blob (n:1).

Removing Redundant Information
Once we have the  table, the following columns in the old   and   table are redundant, and can be dropped: Since the old way to store content_model and content_format was rather inefficient, we should free some space by doing this (even though the vast majorities of these fields are currently NULL).

The following fields in the  table do not become redundant, since they act as summary fields for all content objects of the revision: The purpose of these fields is to compare revisions, which would not be possible in an efficient way if they were removed: rev_leng is used to indicate how much an edit added or removed, while rev_sha1 can be used to check whether two revisions have the same content (e.g. to identify edits that restored an old revision).
 * : the sum of the (logical) size of the content objects of the revision's primary slots.
 * : the aggregated hash of the content objects of the revision's primary slots, i.e. sha1( content3, sha1( content2, sha1( content1 ) ) ). This way, the revision hash for a revisions with only a single slot is the same as the slot's hash.

Scalability
One major concern with the new database design is of course scalability. On wikis like en.wikipedia.org, there are several hundred million revisions in the database, so we have to be careful to keep the model scalable while introducing the additional levels of indirection.

To get an idea of the scalability of the proposed schema, consider the following example: Note that we assume that edits made after the conversion to MCR will on average touch 1.5 slots, and that pages will come to have 3 streams on average.

Efficiency
Since we will have one entry per revision and stream (resp slot) in  (perhaps 3 on average), it is going to be quite "tall", but since it is very "narrow" (only two integers per row), this will hopefully not be much of a problem. Since we will have one entry in the  table per revision and slot touched (perhaps 1.5 on average), it is somewhat taller than the old   table. The  table is rather broad, since it contains the   and   fields.

This implies that the largest scalability concern stems from the fact that we store blob addresses as URLs instead of an integer id pointing to text table. Considering however that with External Storage, we are already storing these URLs now in the  table for each revision, which we will not do with the new scheme, the new scheme should not need much more space for a single slot revision than the old system.

Duplication of data for single-slot revisions is also a concern. This is particularly relevant since it affects all legacy revisions that get converted to the new schema. For a single-slot revision, the following fields are the same: Additionally, there are some fields added that act as foreign keys, which introduces overhead: Some fields contain truly new information:
 * is the same as  if there is only one slot. This is probably the biggest concern.
 * is the same as   if there is only one slot.
 * is the same as  if there is only one slot.
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)

Since  is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions,  will be set to an address that points to a row in the text table, e.g.   (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like  (note that relevant   need to be encoded in the address). This ES URL can be moved to  during migration (or later, on the fly), and the now redundant row in the text table can be deleted.

Migration Plan
This document describes a migration strategy for introducing the content table.

NOTE: This is intended as a guide for manual migration for large wikis, with millions of rows in the revision table. Wikis with only a moderate number of revisions can rely on the update.php script[*].

Phase 0: Create new tables
The following tables need to be created:
 * content
 * content_models
 * content_formats
 * content_roles

Phase I: Populate ar_rev_id

 * Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
 * Reserve m (or m+k, for good measure) IDs in the revision table:
 * Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
 * Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
 * For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Phase II: Population

 * Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
 * Wait a bit and watch for performance issues caused by writing to the new table.
 * Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
 * Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:
 * Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
 * For each such row, construct a corresponding row for the content table[***]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a  transaction.
 * The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId function in order to get a numeric ID for content models,  formats, and roles.
 * When all rows in one chunk have been processed, insert/commit the new rows in the content table.
 * Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table  when creating new rows in revision.
 * Repeat the same procedure as above for the archive table.

Phase III: Normalization

 * Set MediaWiki to read content meta-data from the new content table.
 * Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
 * Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
 * Optional: Drop the redundant columns from the page, revision, and archive tables. Besides model and format as well as size and hash, this is in particular the rev_text_id field.

Phase IV: Consolidation
If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided). Then the corresponding row can be deleted from the text table.