Multi-Content Revisions/Content Meta-Data

See also T142980.

For accessing the content of the different slots of a revision, a RevisionContentLookup service is defined:

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492)

For virtual slots, RevisionContentLookup::getRevisionSlot or RevisionSlot::getContent would generate the desired information on the fly. In general, RevisionSlot::getContent should be considered to be potentially expensive, since it may trigger lazy loading of the content blob.

Side note: the use of Title in this interface is unfortunate. We would want somethign more light weight, like TitleValue, but with the information from the page table, most importantly the page ID. A PageEntry or PageRecord, prehaps.

TBD: Shall we use the full Conent interface also for derived content, or should we define a more narrow interface, with only simple methods like getModel and isEmpty?

The initial implementation of RevisionContentLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot (and virtual slots) are supported. Implementation steps:


 * Move storage layer code for accessing revision content from Revision into RevisionContentLookup and an appropriate RevisionSlot implementation.
 * Change Revision to use a RevisionSlot to access revision content.
 * The initial implementation of RevisionContentLookup and RevisionSlot will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

To allow virtual slots to be defined, a dispatching implementation of RevisionContentLookup can be used. It would delegate to other specialized implementations of RevisionContentLookup based on the slot name and/or content type. This way, suitable RevisionContentLookup implementations can be created to e.g. provide access to the ParserCache, or RESTbase services, or use the new VRS infrastructure to provide information about revisions.

Database Schema
Please refer to the Multi-Content Revisions/Glossary for a description of the entities modeled by this schema!

Create a separate table that holds meta-data about revision content (at least the model and format, but we will also want things like the role/slot, blob address, and hash there later, for multi-content-revision support). The table would have at least these fields: cont_logical_size INT(10) UNSIGNED, -- the logical sizes of the Content object, according to Content::getSize, populated from. cont_hash VARBINARY(32) NOT NULL, -- the hash of the serialized content, populated from. cont_is_primary TINYINT DEFAULT 1, -- whether this is primary, user generated content. Always 1 for now. cont_deleted TINYINT DEFAULT 0 -- whether this content was suppressed, populated from. PRIMARY KEY (cont_revision) ) /*$wgDBTableOptions*/; CREATE UNIQUE INDEX /*i*/cont_revision_role ON /*_*/content_role (cont_revision, cont_role);

CREATE TABLE /*_*/content_role (  cr_id SMALLINT NOT NULL PRIMARY KEY AUTO_INCREMENT,   cr_role VARBINARY(32) NOT NULL ) /*$wgDBTableOptions*/; CREATE INDEX /*i*/cr_role ON /*_*/content_role (cr_role);

A  field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). would reference a content_role table defined in the same way as content_model and content_format. and  form a unique key.

When joining against,  , etc.,   will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

The auto-increment  field is not strictly necessary, but being able to identify specific content with a unique id seems like a good idea for the future. For instance, it will allow us to re-use the same  row for multiple revisions by adding another table for relating revisions to content.

The definition of the  table given here is in accordance with the definition of the   and   tables suggested in T105652. All of these could be changed to use an UNSIGNED SMALLINT, and to enforce uniqueness of the readable name using a UNIQUE INDEX.

Re-using Content Rows
If we assume that it will be common for one stream of a page to be edited frequently, while other streams are only updated rarely, it seems inefficient to create rows in the content table for every slot of every revision. Instead, we want to re-use a row in the content table for multiple rows in the revision table. To allow this, we can introduce another table that records the association of content with revisions. The idea is that instead of relying on  (a n:1 relationship), we use a separate table to allow an n:m relationship:

Note that we still need the  field in the content table to track which revision introduced a specific bit of content.

Also note that multiple rows in the content table may refer to the same blob (that is, they have the same value in ). So with this approach, there are two levels of indirection that allow re-use: revision -> content (n:m), and content -> blob (n:1).

Migration Plan
This document describes a migration strategy for introducing the content table.

NOTE: This is intended as a guide for manual migration for large wikis, with millions of rows in the revision table. Wikis with only a moderate number of revisions can rely on the update.php script[*].

Phase 0: Create new tables
The following tables need to be created:
 * content
 * content_models
 * content_formats
 * content_roles

Phase I: Populate ar_rev_id

 * Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
 * Reserve m (or m+k, for good measure) IDs in the revision table:
 * Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
 * Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
 * For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Phase II: Population

 * Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
 * Wait a bit and watch for performance issues caused by writing to the new table.
 * Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
 * Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:
 * Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
 * For each such row, construct a corresponding row for the content table[***]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a  transaction.
 * The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId function in order to get a numeric ID for content models,  formats, and roles.
 * When all rows in one chunk have been processed, insert/commit the new rows in the content table.
 * Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table  when creating new rows in revision.
 * Repeat the same procedure as above for the archive table.

Phase III: Normalization

 * Set MediaWiki to read content meta-data from the new content table.
 * Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
 * Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
 * Optional: Drop the redundant columns from the page, revision, and archive tables. Besides model and format as well as size and hash, this is in particular the rev_text_id field.

Phase IV: Consolidation
If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided). Then the corresponding row can be deleted from the text table.