Multi-Content Revisions/Content Meta-Data

The central component of Multi-Content-Revision support is the management of content meta-data. To allow multiple content objects to be managed per revision, part of the storage layer needs to be adapted by adding a new level of indirection between revisions and content objects. This provides a new degree of freedom to modeling the wiki content: pages can have multiple "streams" now. We go from the straight forwards to a more indirect model: The introduction of this additional layer of indirection requires a new storage interface in PHP, but more importantly it requires a database schema change and data migration described later on this page.

Data Model
An overview in bullet points (refer to the Glossary for a definition of the terms used):
 * Each revision has named slots, the slot names define the slot's role. Each slot may contain a content object.
 * There two basic types of slots (resp. content): primary (user created content that can be edited, like wikitext, etc) and derived (automatically generated based on the user created content). Derived content may be materialized (stored along with the primary content), or virtual (generated on the fly). Virtual content is not relevant for the meta-data storage layer.
 * There is always at least one primary slot defined: the main slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
 * Primary slots can be enumerated. The link between a revision ID and the associated primary slots is maintained in the main database (see Database Schema below). Listing all primary slots is needed for viewing, to create diffs, generate XML dumps, perform revert/undo, etc.
 * Derived slots cannot be enumerated. The link between a revision ID and the associated derived slots is also stored in the main database if they are "materialized" (i.e. not virtual),
 * A role may be associated with a specific content model (e.g. the "categories" role would use the "categories" model). The main slot however may contain any kind of content (the default model depending on the page tile, etc), and some other roles may also not require a specific model to be used.
 * Slots can be uniquely identified by revision ID and slot name.
 * There is meta-data associated with each slot. The content meta-data (content model and format, logical size, and hash) and slot meta-data (slot name, revision id).
 * Slots have no intrinsic order. Two revisions are considered equal if their primary slots have the same content. Two equal revisions have the same hash and length.
 * A revision's hash is aggregated from the hashes of its primary slots. If there is only one primary slot, the revision's hash is the same as the slot's hash. Similarly, a revision's length is calculated as the sum of the (logical) sizes of the primary slots.

Content Meta-Data Service (DAO)
For accessing the content of the different slots of a revision, a  service is defined as described below. For a description of the storage interface for retrieving actual content data and listing available slots per revision, see Revision Retrieval.

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492; See also T142980)

Note that getRevisionSlots will return information for derived slots only if specifically asked for them; per default, it will return all primary slots. getRevisionSlots can not return any information about virtual slots, they will be treated as unknown. To access virtual slots, use the RevisionRecord interface.

TBD:  needs to know the parent revision, so it can copy or link any content from the parent revision that was not changed in the present edit. The caller (WikiPage or Revision) typically already knows the parent revision ID, so we could pass it in, but that would pollute the interface, leaking implementations to allow optimization.

Initial Implementation
The initial implementation of RevisionSlotLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot is supported. Implementation steps: (Code experiment: https://gerrit.wikimedia.org/r/#/c/246433/6)
 * Move storage layer code for accessing revision content from Revision into RevisionSlotLookup.
 * Change Revision to use a RevisionSlot to access revision content.
 * The initial implementation of RevisionSlotLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

Once the application code uses the new interface for accessing revision content, a new implementation can be created that uses the new database schema described below. For the migration period, we will need an implementation that can write to the old and the new schema at once.

Database Schema
Please refer to the Multi-Content Revisions/Glossary for a description of the entities modeled by this schema!

In order to allow multiple content objects per revision, the revision table needs to be split, so that the information about the content lives in a separate table, while information about the revision as such stays where it is. The structure of this table was discussed in the Create a content meta-data table RFC, in a somewhat different context.

Introducing the  and   tables is the core feature needed for multi-content revision support. In terms of entities and relations, it allows revision content to be modeled as follows:  Old schema: [page] --page_current--> [revision] --rev_text_id--> [text] --old_text--> (external)  [revision] <--cont_revision-- [content] --cont_address--> (text|external)  [revision] <--slot_revision-- [slots] --slot_content--> [content] --cont_address--> (text|external) <rev_page           <cont_revision-

The table structure is as follows:

(Code experiment: https://gerrit.wikimedia.org/r/#/c/302056/ and http://sqlfiddle.com/#!9/0b847/7)

A  field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). would reference a  table defined in the same way as the   and   tables proposed in T105652 and T142980. [TBD: instead,  , and   could have a single   table, mapping arbitrary names to integers for compact storage].

A content row can be uniquely identified by  and   - that is, there can only be one content object per role in each revision.

In some cases, it may be sufficient to access the  (or , see below) table, and bypass the   table completely. For instance, content (resp. ) can be joined directly against   to find the content relevant for showing the current revision of a page.

When joining against,  , etc.,   will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

The auto-increment  field is not strictly necessary unless we want to re-use content rows, see below. But being able to identify specific content with a unique id seems like a good idea in any case.

Re-using Content Rows
If we assume that it will be common for one stream of a page to be edited frequently (e.g. the main stream), while other streams are only updated rarely (e.g. categories), it seems inefficient to create rows in the content table for every slot of every revision. Instead, we want to re-use a row in the content table for multiple rows in the revision table. To allow this, we can introduce another table that records the association of content with revisions: the  table (which was called   in earlier versions of this proposal). The idea is that instead of relying on  (a n:1 relationship), we use a separate table to allow an n:m relationship:

Note that we still need the  field in the content table to track which revision introduced a specific bit of content.

Also note that multiple rows in the content table may refer to the same blob (that is, they have the same value in ). So with this approach, there are two levels of indirection that allow re-use: revision -> content (n:m), and content -> blob (n:1).

An alternative design for associating  rows with   rows (TBD):

With this design, the  field would be removed from the   table, and would be maintained as   in the   table instead. This is nicer semantically, since the role is really a property of the the connection between content and revision, and the same content could in theory be re-used in different roles. However, in practice, it seems unlikely that the same content could be re-used in different roles, and we would be repeating the role information for all slots in each revision, even for slots there were not modified. On the other hand, this setup allows us to enforce the constraint that each revision must only have one content for each role via a unique index. Without the slot_role field, this constraint cannot be enforced directly in the database.

Multi-Content Archive
In order to allow multi-content revisions to be deleted and restored, content can not longer be stored directly in the  table. Instead, rows from the  and   tables will remain unchanged when a revision (or page) is deleted. They remain accessible via the  field.

Full migration to MCR requires the  and   fields to be copied to the   table (or, if we can integrate ExternalStore immediately, into  ). This also requires any legacy rows in the archive table that have no  set to be initialized.

However, as an interim solution, it would be possible to set  and   to NULL for new rows in the   table, relying on the information in   and , while old archive entries continue to use   and   directly.

Removing Redundant Information
Once we have the  table and have migrated existing data, the following columns in the old   and   table are redundant, and can be dropped: Since the old way to store content_model and content_format was rather inefficient, we should free some space by doing this (even though the vast majorities of these fields are currently NULL).

The following fields in the  table do not become redundant, since they act as summary fields for all content objects of the revision: The purpose of these fields is to compare revisions, which would not be possible in an efficient way if they were removed: rev_len is used to indicate how much an edit added or removed, while rev_sha1 can be used to check whether two revisions have the same content (e.g. to identify edits that restored an old revision).
 * : the sum of the (logical) size of the content objects of the revision's primary slots.
 * : the aggregated hash of the content objects of the revision's primary slots, i.e. sha1( content3, sha1( content2, sha1( content1 ) ) ). This way, the revision hash for a revisions with only a single slot is the same as the slot's hash.

Scalability
One major concern with the new database design is of course scalability. On wikis like en.wikipedia.org, there are several hundred million revisions in the database, so we have to be careful to keep the model scalable while introducing the additional levels of indirection.

To get an idea of the scalability of the proposed schema, consider the following example, based on the numbers on Size of Wikipedia and site statistics of en.wikipedia.org: Note that we assume that edits made after the conversion to MCR will on average touch 1.5 slots, and that pages will come to have 3 streams on average. For the extrapolation into the future, a doubling time of 8 years is assumed for the x2 and x4 factors, and a linear growth of 15k pages/day is assumed for the +10 Million column.

Efficiency
Since we will have one entry per revision and stream (resp slot) in  (perhaps 3 on average), it is going to be quite "tall", but since it is very "narrow" (only two integers per row), this will hopefully not be much of a problem. Since we will have one entry in the  table per revision and slot touched (perhaps 1.5 on average), it is somewhat taller than the old   table. The  table is rather broad, since it contains the   and   fields.

This implies that the largest scalability concern stems from the fact that we store blob addresses as URLs instead of an integer id pointing to text table. Considering however that with External Storage, we are already storing these URLs now in the  table for each revision, which we will not do with the new scheme, the new scheme should not need much more space for a single slot revision than the old system.

Duplication of data for single-slot revisions is also a concern. This is particularly relevant since it affects all legacy revisions that get converted to the new schema. For a single-slot revision, the following fields are the same: Additionally, there are some fields added that act as foreign keys, which introduces overhead: Some fields contain truly new information:
 * is the same as  if there is only one slot. This is probably the biggest concern. Perhaps a more compact storage of the hash can be used like a   field as suggested in.
 * is the same as   if there is only one slot.
 * is the same as  if there is only one slot.
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)

Since  is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions,  will be set to an address that points to a row in the text table, e.g.   (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like. This ES URL can be moved to  during migration (or later, on the fly), and the now redundant row in the   table can be deleted. Note that relevant  need to be encoded in the address, perhaps like.

Migration Plan
This document describes a migration strategy for introducing the content table.

NOTE: This is intended as a guide for manual migration for large wikis, with millions of rows in the revision table. Wikis with only a moderate number of revisions can rely on the update.php script[*].

Phase 0: Create new tables
The following tables need to be created:
 * content
 * slots
 * content_models
 * content_formats
 * content_roles

Phase I: Fix Legacy Archive Rows
Populate empty  fields:
 * Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
 * Reserve m (or m+k, for good measure) IDs in the revision table:
 * Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
 * Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
 * For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Make  and   unused:
 * For each row in  that has a non-null   field, insert a row into the   table, copying   to   and   to  . Set   to the   from the newly created   row.
 * Set  and   to the empty string everywhere.

Phase II: Population

 * Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
 * Wait a bit and watch for performance issues caused by writing to the new table.
 * Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
 * Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:
 * Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
 * For each such row, construct a corresponding row for the content and slots table[***][****]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a  transaction.
 * The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId function in order to get a numeric ID for content models,  formats, and roles.
 * When all rows in one chunk have been processed, insert/commit the new rows in the content table and wait for slaves to catch up.
 * Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table  when creating new rows in revision.

The same procedure can be applied to the archive table respectively.

Phase III: Finalize

 * Set MediaWiki to read content meta-data from the new content table.
 * Set MediaWiki to not populate the ar_text and ar_flags fields.
 * Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
 * Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
 * Enable MCR support in the API and UI (as far as implemented).
 * Optional: Drop the redundant columns from the page, revision, and archive tables, see Removing Redundant Information above.

Phase IV: Migrate External Store URLs
If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided, see External Store Integration). Then the corresponding rows can be deleted from the text table.