Multi-Content Revisions/Content Meta-Data

The central component of Multi-Content-Revision support is the management of content meta-data. To allow multiple content objects to be managed per revision, part of the storage layer needs to be adapted by adding a new level of indirection between revisions and content objects. This provides a new degree of freedom to modeling the wiki content: pages can have multiple "streams" now. We go from the straight forwards to a more indirect model: The introduction of this additional layer of indirection requires a new storage interface in PHP, but more importantly it requires a database schema change and data migration described later on this page.

Data Model
An overview in bullet points (refer to the Glossary for a definition of the terms used): For the future, it may prove useful to support the association of derived content with a revision.
 * Each revision has named slots, the slot names define the slot's role. Each slot may contain a content object.
 * There is always at least one slot defined: the main slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
 * Slots can be enumerated (if they are primary, i.e. not derived slots, see below). The link between a revision ID and the associated slots is maintained in the main database (see Database Schema below). Listing all slots is needed for viewing, to create diffs, generate XML dumps, perform revert/undo, etc.
 * A role may be associated with a specific content model (e.g. the "categories" role would use the "categories" model). The main slot however may contain any kind of content (the default model depending on the page tile, etc), and some other roles may also not require a specific model to be used.
 * Slots can be uniquely identified by revision ID and slot name.
 * There is meta-data associated with each slot. The content meta-data (content model and format, logical size, and hash) and slot meta-data (slot name, revision id).
 * Slots have no intrinsic order. Two revisions are considered equal if their (primary) slots have the same content. Two equal revisions have the same hash and length.
 * A revision's hash is aggregated from the hashes of its (primary) slots. If there is only one slot, the revision's hash is the same as the slot's hash. Similarly, a revision's length is calculated as the sum of the (logical) sizes of the slots.
 * A generic way to associate materialized derived content with a revision could be provided, similar to (or integrated with) the mechanism used for associating primary, user-generated content with a revision.
 * Derived content may be generated and saved ("materialized") when a revision is created, or it may be generated on the fly ("virtual content"). The link between a revision ID and associated (materialized) derived content would be stored in the main database.
 * Derived slots cannot be enumerated.

Content Meta-Data Service (DAO)
For accessing the content of the different slots of a revision, a  service is defined as described below. For a description of the storage interface for retrieving actual content data and listing available slots per revision, see Revision Retrieval.

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492; See also T142980)

TBD:  needs to know the parent revision, so it can copy or link any content from the parent revision that was not changed in the present edit. The caller (WikiPage or Revision) typically already knows the parent revision ID, so we could pass it in, but that would pollute the interface, leaking implementations to allow optimization.

Initial Implementation
The initial implementation of RevisionSlotLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot is supported. Implementation steps: (Code experiment: https://gerrit.wikimedia.org/r/#/c/246433/6)
 * Move storage layer code for accessing revision content from Revision into RevisionSlotLookup.
 * Change Revision to use a RevisionSlot to access revision content.
 * The initial implementation of RevisionSlotLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

Once the application code uses the new interface for accessing revision content, a new implementation can be created that uses the new database schema described below. For the migration period, we will need an implementation that can write to the old and the new schema at once.

Database Schema
Please refer to the Multi-Content Revisions/Glossary for a description of the entities modeled by this schema!

In order to allow multiple content objects per revision, the revision table needs to be split, so that the information about the content lives in a separate table, while information about the revision as such stays where it is. The structure of this table was discussed in the Create a content meta-data table RFC, in a somewhat different context.

Introducing the  and   tables is the core feature needed for multi-content revision support. In terms of entities and relations, it allows revision content to be modeled as follows:  Old schema: [page] --page_current--> [revision] --rev_text_id--> [text] --old_text--> (external)  [revision] <--cont_revision-- [content] --cont_address--> (text|external)  [revision] <--slot_revision-- [slots] --slot_content--> [content] --cont_address--> (text|external) <rev_page           <cont_revision-

The table structure is as follows:

(Code experiment: https://gerrit.wikimedia.org/r/#/c/302056/ and http://sqlfiddle.com/#!9/0b847/7)

A  field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). would reference a  table defined in the same way as the   and   tables proposed in T105652 and T142980. [TBD: instead,  , and   could have a single   table, mapping arbitrary names to integers for compact storage].

A content row can be uniquely identified by  and   - that is, there can only be one content object per role in each revision.

In some cases, it may be sufficient to access the  (or , see below) table, and bypass the   table completely. For instance, content (resp. ) can be joined directly against   to find the content relevant for showing the current revision of a page.

When joining against,  , etc.,   will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

The auto-increment  field is not strictly necessary unless we want to re-use content rows, see below. But being able to identify specific content with a unique id seems like a good idea in any case.

TBD: should it be, or  ?

TBD: do we need  and  ? it seems we do need them to be able to calculate  and   for a new revision, without loading the full content of unchanged slots.

TBD: instead of having separate,  , and   tables, we could just as well have a single   table.

Re-using Content Rows
If we assume that it will be common for one stream of a page to be edited frequently (e.g. the main stream), while other streams are only updated rarely (e.g. categories), it seems inefficient to create rows in the content table for every slot of every revision. Instead, we want to re-use a row in the content table for multiple rows in the revision table. To allow this, we can introduce another table that records the association of content with revisions: the  table (which was called   in earlier versions of this proposal). The idea is that instead of relying on  (a n:1 relationship), we use a separate table to allow an n:m relationship:

Note that we still need the  field in the content table to track which revision introduced a specific bit of content.

Also note that multiple rows in the content table may refer to the same blob (that is, they have the same value in ). So with this approach, there are two levels of indirection that allow re-use: revision -> content (n:m), and content -> blob (n:1).

An alternative design for associating  rows with   rows (TBD):

With this design, the  field would be removed from the   table, and would be maintained as   in the   table instead. This is nicer semantically, since the role is really a property of the the connection between content and revision, and the same content could in theory be re-used in different roles. However, in practice, it seems unlikely that the same content could be re-used in different roles, and we would be repeating the role information for all slots in each revision, even for slots there were not modified. On the other hand, this setup allows us to enforce the constraint that each revision must only have one content for each role via a unique index. Without the slot_role field, this constraint cannot be enforced directly in the database.

Multi-Content Archive
In order to allow multi-content revisions to be deleted and restored, content can not longer be stored directly in the  table. Instead, rows from the  and   tables will remain unchanged when a revision (or page) is deleted. They remain accessible via the  field.

Full migration to MCR requires the  and   fields to be copied to the   table (or, if we can integrate ExternalStore immediately, into  ). This also requires any legacy rows in the archive table that have no  set to be initialized.

However, as an interim solution, it would be possible to set  and   to NULL for new rows in the   table, relying on the information in   and , while old archive entries continue to use   and   directly.

Removing Redundant Information
Once we have the  table and have migrated existing data, the following columns in the old   and   table are redundant, and can be dropped: Since the old way to store content_model and content_format was rather inefficient, we should free some space by doing this (even though the vast majorities of these fields are currently NULL).

The following fields in the  table do not become redundant, since they act as summary fields for all content objects of the revision: The purpose of these fields is to compare revisions, which would not be possible in an efficient way if they were removed: rev_len is used to indicate how much an edit added or removed, while rev_sha1 can be used to check whether two revisions have the same content (e.g. to identify edits that restored an old revision).
 * : the sum of the (logical) size of the content objects of the revision's slots.
 * : the aggregated hash of the content objects of the revision's slots, i.e. sha1( content3, sha1( content2, sha1( content1 ) ) ). This way, the revision hash for a revisions with only a single slot is the same as the slot's hash.

Note that rev_len and rev_sha1 would not cover any derived slots, if support for such was added.

Scalability
One major concern with the new database design is of course scalability. On wikis like en.wikipedia.org, there are several hundred million revisions in the database, so we have to be careful to keep the model scalable while introducing the additional levels of indirection.

To get an idea of the scalability of the proposed schema, consider the following example, based on the numbers on Size of Wikipedia and site statistics of en.wikipedia.org: Note that we assume that edits made after the conversion to MCR will on average touch 1.5 slots, and that pages will come to have 3 streams on average. For the extrapolation into the future, a doubling time of 8 years is assumed for the x2 and x4 factors, and a linear growth of 15k pages/day is assumed for the +10 Million column.

Efficiency
Since we will have one entry per revision and stream (resp slot) in  (perhaps 3 on average), it is going to be quite "tall", but since it is very "narrow" (only two or three integers per row), this will hopefully not be much of a problem. Since we will have one entry in the  table per revision and slot touched (perhaps 1.5 on average), it is somewhat taller than the old   table. The  table is rather broad, since it contains the   and   fields.

This implies that the largest scalability concern stems from the fact that we store blob addresses as URLs instead of an integer id pointing to text table. Considering however that with External Storage, we are already storing these URLs now in the  table for each revision, which we will not do with the new scheme, the new scheme should not need much more space for a single slot revision than the old system.

Duplication of data for single-slot revisions is also a concern. This is particularly relevant since it affects all legacy revisions that get converted to the new schema. For a single-slot revision, the following fields are the same: Additionally, there are some fields added that act as foreign keys, which introduces overhead: Some fields contain truly new information:
 * is the same as  if there is only one slot. This is probably the biggest concern. Perhaps a more compact storage of the hash can be used like a   field as suggested in.
 * is the same as   if there is only one slot.
 * is the same as  if there is only one slot.
 * resp.
 * (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)

Since  is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions,  will be set to an address that points to a row in the text table, e.g.   (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like. This ES URL can be moved to  during migration (or later, on the fly), and the now redundant row in the   table can be deleted. Note that relevant  need to be encoded in the address, perhaps like.

Optimization
''Note that the optimizations discussed below are not part of the MCR proposal. They should be investigated and discussed separately. They are mentioned here in order to provide context to the discussion of both issues, multi-content-revisions and revision storage optimization. For instance, schema changes needed for optimization could be combined with the schema changes that are part of the Migration Plan described below.''

The proposed schema offers some potential for optimization, which would make the schema less clean and obvious, but could improve performance.

Split cont_address: The reduce the storage requirements for, it would be possible to split off separate fields for common prefixes and suffixes, which would be stored as integer IDs, referencing a table that contains the actual strings, just like   does for roles. For example,  could be split into       which would be saved as e.g. ( 834234, , 787432 ).

Avoid rev_user_text: with,   ,   gone, revision rows are already smaller. They could be further reduced by setting  to NULL if   is not 0. could be removed completely If  was replaced with a 128 bit binary, this could be used to represent IPv4 and IPv6 addresses directly, as well as to refer to user accounts using their internal ID encoded in a reserved IPv6 address range.

Factor out rev_comment: the revision comment is potentially very wide,. Since it is not needed in all scenarios, it could be factored out into a separate table. Since revision comments often contain common prefixes, it may also be possible to split them in the same way suggested for  above.

Fixed width revision rows: with the above optimizations of  and , and with   and    gone, the only remaining variable width field is  , which could easily be converted to a fixed width   or even.

Partitioning: the,  , and   tables could benefit from partitioning. Further investigation is needed to determine which criterion would be useful for partitioning, and what the goal of such a partitioning should be. Should the number of partitions used per query be minimized (improve locality) or maximized (improve parallelism)? Assuming we want to optimize access to "hot" data versus the access to "stale" data, partitioning by blocks of  resp. would be the obvious choice. For optimizing localize, partitioning by modulo of  resp. would be an option, since it keeps information about all a revisions of a page in a single partition.

Migration Plan
This document describes a migration strategy for introducing the content table.

NOTE: This is intended as a guide for manual migration for large wikis, with millions of rows in the revision table. Wikis with only a moderate number of revisions can rely on the update.php script[*].

Phase 0: Create new tables
The following tables need to be created:
 * content
 * slots
 * content_models
 * content_formats
 * content_roles

Phase I: Fix Legacy Archive Rows
Populate empty  fields:
 * Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
 * Reserve m (or m+k, for good measure) IDs in the revision table:
 * Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
 * Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
 * For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Make  and   unused:
 * For each row in  that has a non-null   field, insert a row into the   table, copying   to   and   to  . Set   to the   from the newly created   row.
 * Set  and   to the empty string everywhere.

Phase II: Population

 * Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
 * Wait a bit and watch for performance issues caused by writing to the new table.
 * Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
 * Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:
 * Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
 * For each such row, construct a corresponding row for the content and slots table[***][****]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a  transaction.
 * The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId function in order to get a numeric ID for content models,  formats, and roles.
 * When all rows in one chunk have been processed, insert/commit the new rows in the content table and wait for slaves to catch up.
 * Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table  when creating new rows in revision.

The same procedure can be applied to the archive table respectively.

Phase III: Finalize

 * Set MediaWiki to read content meta-data from the new content table.
 * Set MediaWiki to not populate the ar_text and ar_flags fields.
 * Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
 * Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
 * Enable MCR support in the API and UI (as far as implemented).
 * Optional: Drop the redundant columns from the page, revision, and archive tables, see Removing Redundant Information above. Schema changes desired for revision storage optimization may be applied at the same time.

Phase IV: Migrate External Store URLs
If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided, see External Store Integration). Then the corresponding rows can be deleted from the text table.