Multi-Content Revisions/Content Meta-Data

The central component of Multi-Content-Revision support is the management of content meta-data. It provides an additional degree of freedom for modeling the wiki content: pages can have multiple "slots" (or "streams") now. We go from the straight forwards to a more indirect model:

Note that "meta-data" in this context refers to the information needed to find, load, deserialize and interpret the content associated with a revision. It does not refer to information extracted or derived from the actual content, which is discussed as "derived content" below.

Data Model
An overview in bullet points (refer to the Glossary for a definition of the terms used): For the future, it may prove useful to support the association of derived content with a revision.
 * Each revision has named slots, the slot names define the slot's role. Each slot may contain a content object.
 * There is always at least one slot defined: the main slot. It is used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
 * Slots can be enumerated (if they are primary, i.e. not derived slots, see below). The link between a revision ID and the associated slots is maintained in the main database (see Database Schema below). Listing all slots is needed for viewing, to create diffs, generate XML dumps, perform revert/undo, etc.
 * A role may be associated with a specific content model (e.g. the "categories" role would use the "categories" model). The main slot however may contain any kind of content (the default model depending on the page tile, etc), and some other roles may also not require a specific model to be used.
 * Slots can be uniquely identified by revision ID and slot name.
 * There is meta-data associated with each slot. The content meta-data (content model and format, logical size, and hash) and slot meta-data (slot name, revision id).
 * Slots have no intrinsic order. Two revisions are considered equal if their (primary) slots have the same content. Two equal revisions have the same hash and length.
 * A revision's hash is aggregated from the hashes of its (primary) slots. If there is only one slot, the revision's hash is the same as the slot's hash. Similarly, a revision's length is calculated as the sum of the (logical) sizes of the slots.
 * A generic way to associate materialized derived content with a revision could be provided, similar to (or integrated with) the mechanism used for associating primary, user-generated content with a revision.
 * Derived content may be generated and saved ("materialized") when a revision is created, or it may be generated on the fly ("virtual content"). The link between a revision ID and associated (materialized) derived content would be stored in the main database.
 * Derived slots cannot be enumerated.

Content Meta-Data Service (DAO)
For accessing the content of the different slots of a revision, a  service is defined as described below. For a description of the storage interface for retrieving actual content data and listing available slots per revision, see Revision Retrieval.

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492; See also T142980)

TBD:  needs to know the parent revision, so it can copy or link any content from the parent revision that was not changed in the present edit. The caller (WikiPage or Revision) typically already knows the parent revision ID, so we could pass it in, but that would pollute the interface, leaking implementations to allow optimization.

Initial Implementation
The initial implementation of RevisionSlotLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot is supported. Implementation steps: (Code experiment: https://gerrit.wikimedia.org/r/#/c/246433/6)
 * Move storage layer code for accessing revision content from Revision into RevisionSlotLookup.
 * Change Revision to use a RevisionSlot to access revision content.
 * The initial implementation of RevisionSlotLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

Once the application code uses the new interface for accessing revision content, a new implementation can be created that uses the new database schema described below. For the migration period, we will need an implementation that can write to the old and the new schema at once.

Database Schema
See Multi-Content Revisions/Database Schema for the database schema design!

Scalability
One major concern with the new database design is of course scalability. On wikis like en.wikipedia.org, there are several hundred million revisions in the database, so we have to be careful to keep the model scalable while introducing the additional levels of indirection.

To get an idea of the scalability of the proposed schema, consider the following example, based on the numbers on Size of Wikipedia and site statistics of en.wikipedia.org: Note that we assume that edits made after the conversion to MCR will on average touch 1.5 slots, and that pages will come to have 3 streams on average. For the extrapolation into the future, a doubling time of 8 years is assumed for the x2 and x4 factors, and a linear growth of 15k pages/day is assumed for the +10 Million column.

Efficiency
Since we will have one entry per revision and stream (resp slot) in  (perhaps 3 on average), it is going to be quite "tall", but since it is very "narrow" (only two or three integers per row), this will hopefully not be much of a problem. Since we will have one entry in the  table per revision and slot touched (perhaps 1.5 on average), it is somewhat taller than the old   table. The  table is rather broad, since it contains the   and   fields.

This implies that the largest scalability concern stems from the fact that we store blob addresses as URLs instead of an integer id pointing to text table. Considering however that with External Storage, we are already storing these URLs now in the  table for each revision, which we will not do with the new scheme, the new scheme should not need much more space for a single slot revision than the old system.

Duplication of data for single-slot revisions is also a concern. This is particularly relevant since it affects all legacy revisions that get converted to the new schema. For a single-slot revision, the following fields are the same: Additionally, there are some fields added that act as foreign keys, which introduces overhead: Some fields contain truly new information:
 * is the same as  if there is only one slot. This is probably the biggest concern. Perhaps a more compact storage of the hash can be used like a   field as suggested in.
 * is the same as   if there is only one slot.
 * is the same as  if there is only one slot.
 * resp.
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)

Per default,  is set to an address that points to a row in the text table, e.g.   (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like. This ES URL can be moved to  in the future, and the now redundant row in the   table can be deleted. Note that relevant  need to be encoded in the address, perhaps like.

Optimization Considerations
''Note that the optimizations discussed below are not part of the MCR proposal. They should be investigated and discussed separately. They are mentioned here in order to provide context to the discussion of both issues, multi-content-revisions and revision storage optimization. For instance, schema changes needed for optimization could be combined with the schema changes that are part of the Migration Plan described below.''

The proposed schema offers some potential for optimization, which would make the schema less clean and obvious, but could improve performance.

Split cont_address: The reduce the storage requirements for, it would be possible to split off separate fields for common prefixes and suffixes, which would be stored as integer IDs, referencing a table that contains the actual strings, just like   does for roles. For example,  could be split into       which would be saved as e.g. ( 834234, , 787432 ).

Avoid rev_user_text: with,   ,   gone, revision rows are already smaller. They could be further reduced by setting  to NULL if   is not 0. could be removed completely If  was replaced with a 128 bit binary, this could be used to represent IPv4 and IPv6 addresses directly, as well as to refer to user accounts using their internal ID encoded in a reserved IPv6 address range.

Factor out rev_comment: the revision comment is potentially very wide,. Since it is not needed in all scenarios, it could be factored out into a separate table. Since revision comments often contain common prefixes, it may also be possible to split them in the same way suggested for  above.

Fixed width revision rows: with the above optimizations of  and , and with   and    gone, the only remaining variable width field is  , which could easily be converted to a fixed width   or even.

Partitioning: the,  , and   tables could benefit from partitioning. Further investigation is needed to determine which criterion would be useful for partitioning, and what the goal of such a partitioning should be. Should the number of partitions used per query be minimized (improve locality) or maximized (improve parallelism)? Assuming we want to optimize access to "hot" data versus the access to "stale" data, partitioning by blocks of  resp. would be the obvious choice. For optimizing localize, partitioning by modulo of  resp. would be an option, since it keeps information about all a revisions of a page in a single partition.

Migration Plan
This document describes the migration strategy for introducing the content table.

Phase 0: Create new tables (done)
The following tables need to be created:
 * content
 * slots
 * content_models
 * content_formats
 * content_roles

Phase I: Fix Legacy Archive Rows (done)
Populate empty  fields:
 * Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
 * Reserve m (or m+k, for good measure) IDs in the revision table:
 * Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
 * Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
 * For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Make  and   unused:
 * For each row in  that has a non-null   field, insert a row into the   table, copying   to   and   to  . Set   to the   from the newly created   row.
 * Set  and   to the empty string everywhere.

Phase II: Population (done)
See also the schema migration notes.
 * Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
 * Wait a bit and watch for performance issues caused by writing to the new table.
 * Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
 * Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:
 * Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
 * For each such row, construct a corresponding row for the content and slots table[***][****]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a  transaction.
 * The content_models, content_formats, and content_roles tables are populated as a side-effect, by virtue of calling the assignId function in order to get a numeric ID for content models,  formats, and roles.
 * When all rows in one chunk have been processed, insert/commit the new rows in the content table and wait for slaves to catch up.
 * Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table  when creating new rows in revision.

The same procedure can be applied to the archive table respectively.

Phase III: Finalize (done)

 * Set MediaWiki to read content meta-data from the new content table.
 * Set MediaWiki to not populate the ar_text and ar_flags fields.
 * Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
 * Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
 * Enable MCR support in the API and UI (as far as implemented).
 * Optional: Drop the redundant columns from the page, revision, and archive tables, see Removing Redundant Information above. Schema changes desired for revision storage optimization may be applied at the same time.

Phase IV: Migrate External Store URLs
If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided, see External Store Integration). Then the corresponding rows can be deleted from the text table.