Multi-Content Revisions/Content Meta-Data

From MediaWiki.org
Jump to navigation Jump to search
This page is part of the MCR proposal. Status: stable draft, comments welcome.

The central component of Multi-Content-Revision support is the management of content meta-data. To allow multiple content objects to be managed per revision, part of the storage layer needs to be adapted by adding a new level of indirection between revisions and content objects. This provides a new degree of freedom to modeling the wiki content: pages can have multiple "streams" now. We go from the straight forwards

  • page -> revision -> text ( -> external store )

to a more indirect model:

  • page -> ( revision -> ) slots -> content -> ( text | external store )

The introduction of this additional layer of indirection requires a new storage interface in PHP, but more importantly it requires a database schema change and data migration described later on this page.

Note that "meta-data" in this context refers to the information needed to find, load, deserialize and interpret the content associated with a revision. It does not refer to information extracted or derived from the actual content, which is discussed as "derived content" below.

Data Model[edit]

An overview in bullet points (refer to the Glossary for a definition of the terms used):

  • Each revision has named slots, the slot names define the slot's role. Each slot may contain a content object.
  • There is always at least one slot defined: the main slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
  • Slots can be enumerated (if they are primary, i.e. not derived slots, see below). The link between a revision ID and the associated slots is maintained in the main database (see Database Schema below). Listing all slots is needed for viewing, to create diffs, generate XML dumps, perform revert/undo, etc.
  • A role may be associated with a specific content model (e.g. the "categories" role would use the "categories" model). The main slot however may contain any kind of content (the default model depending on the page tile, etc), and some other roles may also not require a specific model to be used.
  • Slots can be uniquely identified by revision ID and slot name.
  • There is meta-data associated with each slot. The content meta-data (content model and format, logical size, and hash) and slot meta-data (slot name, revision id).
  • Slots have no intrinsic order. Two revisions are considered equal if their (primary) slots have the same content. Two equal revisions have the same hash and length.
  • A revision's hash is aggregated from the hashes of its (primary) slots. If there is only one slot, the revision's hash is the same as the slot's hash. Similarly, a revision's length is calculated as the sum of the (logical) sizes of the slots.

For the future, it may prove useful to support the association of derived content with a revision.

  • A generic way to associate materialized derived content with a revision could be provided, similar to (or integrated with) the mechanism used for associating primary, user-generated content with a revision.
  • Derived content may be generated and saved ("materialized") when a revision is created, or it may be generated on the fly ("virtual content"). The link between a revision ID and associated (materialized) derived content would be stored in the main database.
  • Derived slots cannot be enumerated.

Content Meta-Data Service (DAO)[edit]

For accessing the content of the different slots of a revision, a RevisionSlotLookup service is defined as described below. For a description of the storage interface for retrieving actual content data and listing available slots per revision, see Revision Retrieval.

/**
* @todo find a better name. ContentMetaData?
**/
class RevisionContentInfo {
  public function getBlobAddress();
  public function getContentFormat();
  public function getContentModel();
  public function getLogicalSize();
  public function getHash();
}

class RevisionSlot extends RevisionContentInfo {
  public function getRevisionId();
  public function getRole();
}

interface RevisionSlotLookup {
    /**
    * @param string[] $slots The names of the slots to return. Null means all primary slots.
    **/
    function getRevisionSlots( $revisionId, array $slots = null ): RevisionSlot[];
}

interface RevisionSlotStore extends RevisionSlotLookup {
    /**
    * @param RevisionSlot[] $slots The slot data to save.
    **/
    function insertRevisionSlots( $revisionId, array $slots );
}

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492; See also phab:T142980)

TBD: insertRevisionSlots needs to know the parent revision, so it can copy or link any content from the parent revision that was not changed in the present edit. The caller (WikiPage or Revision) typically already knows the parent revision ID, so we could pass it in, but that would pollute the interface, leaking implementations to allow optimization.

Initial Implementation[edit]

The initial implementation of RevisionSlotLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot is supported. Implementation steps:

  • Move storage layer code for accessing revision content from Revision into RevisionSlotLookup.
  • Change Revision to use a RevisionSlot to access revision content.
  • The initial implementation of RevisionSlotLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

(Code experiment: https://gerrit.wikimedia.org/r/#/c/246433/6)

Once the application code uses the new interface for accessing revision content, a new implementation can be created that uses the new database schema described below. For the migration period, we will need an implementation that can write to the old and the new schema at once.

Database Schema[edit]

See Multi-Content Revisions/Database Schema for the database schema design!

Scalability[edit]

One major concern with the new database design is of course scalability. On wikis like en.wikipedia.org, there are several hundred million revisions in the database, so we have to be careful to keep the model scalable while introducing the additional levels of indirection.

To get an idea of the scalability of the proposed schema, consider the following example, based on the numbers on Size of Wikipedia and site statistics of en.wikipedia.org:

Description Assumed value

when introducing MCR

Extrapolated value

for +10 million pages

(~2 years)

Extrapolated value

for x2 pages

(~8 years)

Extrapolated value

for x4 pages

(~16 years)

Extr. value without MCR

and x4 pages

p page rows 50 million 60 million 100 million 200 million 200 million
n revision rows (avg 20 per page) 1000 million 1200 million 2000 million 4000 million 4000 million
t text rows (with External Store) 1000 million 1000 million (legacy only) 1000 million (legacy only) 1000 million (legacy only) 4000 million
k avg slots per revision 1 by design 1.33 incl. legacy,

3 for new revisions

2 incl. legacy,

3 for new revisions

2.5 incl. legacy,

3 for new revisions

1
e avg slots changed per edit 1 by design 1.08 incl. legacy,

1.5 for new edits

1.2 incl. legacy,

1.5 for new edits

1.375 incl. legacy,

1.5 for new edits

1
c content rows

= n for old + n * e for new edits

1000 million

Most heavy: cont_hash
and cont_address

1300 million 2500 million 5500 million 4000 million

(as part of the revision row)

r slots rows

= n for old + n * k for new edits

1000 million 1600 million 4000 million 10,000 million -

Note that we assume that edits made after the conversion to MCR will on average touch 1.5 slots, and that pages will come to have 3 streams on average. For the extrapolation into the future, a doubling time of 8 years is assumed for the x2 and x4 factors, and a linear growth of 15k pages/day is assumed for the +10 Million column.

Efficiency[edit]

Since we will have one entry per revision and stream (resp slot) in slots (perhaps 3 on average), it is going to be quite "tall", but since it is very "narrow" (only two or three integers per row), this will hopefully not be much of a problem. Since we will have one entry in the content table per revision and slot touched (perhaps 1.5 on average), it is somewhat taller than the old revision table. The content table is rather broad, since it contains the cont_address VARCHAR(255) and cont_hash VARCHAR(32) fields.

This implies that the largest scalability concern stems from the fact that we store blob addresses as URLs instead of an integer id pointing to text table. Considering however that with External Storage, we are already storing these URLs now in the text table for each revision, which we will not do with the new scheme, the new scheme should not need much more space for a single slot revision than the old system.

Duplication of data for single-slot revisions is also a concern. This is particularly relevant since it affects all legacy revisions that get converted to the new schema. For a single-slot revision, the following fields are the same:

  • cont_hash VARCHAR(32) is the same as rev_sha1 VARCHAR(32) if there is only one slot. This is probably the biggest concern. Perhaps a more compact storage of the hash can be used like a BINARY(20) field as suggested in [1].
  • cont_logical_size INT UNSIGNED is the same as rev_len INT UNSIGNED if there is only one slot.
  • cont_deleted TINYINT is the same as rev_deleted TINYINT if there is only one slot.

Additionally, there are some fields added that act as foreign keys, which introduces overhead:

  • cont_id BIGINT UNSIGNED
  • cont_revision INT UNSIGNED
  • slot_content BIGINT UNSIGNED
  • slot_revision INT UNSIGNED

Some fields contain truly new information:

  • cont_role SMALLINT UNSIGNED resp. slot_role SMALLINT UNSIGNED
  • cont_address VARCHAR(255) (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)

Since cont_address is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions, cont_address will be set to an address that points to a row in the text table, e.g. "TT:123456" (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like "DB://cluster5/878675423". This ES URL can be moved to cont_address during migration (or later, on the fly), and the now redundant row in the text table can be deleted. Note that relevant old_flags need to be encoded in the address, perhaps like "DB://cluster5/878675423;utf8,gzip".

Optimization[edit]

Note that the optimizations discussed below are not part of the MCR proposal. They should be investigated and discussed separately. They are mentioned here in order to provide context to the discussion of both issues, multi-content-revisions and revision storage optimization. For instance, schema changes needed for optimization could be combined with the schema changes that are part of the Migration Plan described below.

The proposed schema offers some potential for optimization, which would make the schema less clean and obvious, but could improve performance.

Split cont_address: The reduce the storage requirements for cont_address, it would be possible to split off separate fields for common prefixes and suffixes, which would be stored as integer IDs, referencing a table that contains the actual strings, just like content_roles does for roles. For example, "DB://cluster5/878675423;utf8,gzip" could be split into "DB://cluster5/" "878675423" ";utf8,gzip" which would be saved as e.g. ( 834234, "878675423", 787432 ).

Avoid rev_user_text: with rev_content_model, rev_content_format, rev_text_id gone, revision rows are already smaller. They could be further reduced by setting rev_user_text to NULL if rev_user is not 0. rev_user_text could be removed completely If rev_user was replaced with a 128 bit binary, this could be used to represent IPv4 and IPv6 addresses directly, as well as to refer to user accounts using their internal ID encoded in a reserved IPv6 address range.

Factor out rev_comment: the revision comment is potentially very wide, varbinary(767). Since it is not needed in all scenarios, it could be factored out into a separate table. Since revision comments often contain common prefixes, it may also be possible to split them in the same way suggested for cont_address above.

Fixed width revision rows: with the above optimizations of rev_user_text and rev_comment, and with rev_content_model and rev_content_format gone, the only remaining variable width field is rev_sha1, which could easily be converted to a fixed width binary(32) or even binary(20.

Partitioning: the revision, content, and slots tables could benefit from partitioning. Further investigation is needed to determine which criterion would be useful for partitioning, and what the goal of such a partitioning should be. Should the number of partitions used per query be minimized (improve locality) or maximized (improve parallelism)? Assuming we want to optimize access to "hot" data versus the access to "stale" data[2], partitioning by blocks of rev_id resp. cont_revision would be the obvious choice. For optimizing localize, partitioning by modulo of page_id resp. rev_page would be an option, since it keeps information about all a revisions of a page in a single partition.

Migration Plan[edit]

This document describes a migration strategy for introducing the content table.

NOTE: This is intended as a guide for manual migration for large wikis, with millions of rows in the revision table. Wikis with only a moderate number of revisions can rely on the update.php script[*].

Phase 0: Create new tables[edit]

The following tables need to be created:

  • content
  • slots
  • content_models
  • content_formats
  • content_roles

Phase I: Fix Legacy Archive Rows[edit]

Populate empty ar_rev_id fields:

  • Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
  • Reserve m (or m+k, for good measure) IDs in the revision table:
  • Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
  • Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
  • For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Make ar_text and ar_flags unused:

  • For each row in archive that has a non-null ar_text field, insert a row into the text table, copying ar_text to old_text and ar_flags to old_flags. Set ar_text_id to the old_id from the newly created text row.
  • Set ar_text and ar_flags to the empty string everywhere.

Phase II: Population[edit]

See also the schema migration notes.

  • Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
  • Wait a bit and watch for performance issues caused by writing to the new table.
  • Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
  • Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:

  • Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
  • For each such row, construct a corresponding row for the content and slots table[***][****]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a transaction.
  • The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId() function in order to get a numeric ID for content models, formats, and roles.
  • When all rows in one chunk have been processed, insert/commit the new rows in the content table and wait for slaves to catch up.
  • Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table when creating new rows in revision.

The same procedure can be applied to the archive table respectively.

Phase III: Finalize[edit]

  • Set MediaWiki to read content meta-data from the new content table.
  • Set MediaWiki to not populate the ar_text and ar_flags fields.
  • Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
  • Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
  • Enable MCR support in the API and UI (as far as implemented).
  • Optional: Drop the redundant columns from the page, revision, and archive tables, see Removing Redundant Information above. Schema changes desired for revision storage optimization may be applied at the same time.

Phase IV: Migrate External Store URLs[edit]

If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided, see External Store Integration). Then the corresponding rows can be deleted from the text table.

Footnotes[edit]

[*] update.php creates the tables, populates the content table using the same code as populateContentTable.php, and then drops the redundant columns.

[**] Instead of introducing new global settings for each schema migration, $wgMigrationSettings is introduced, with fields for each migration step, as needed. The above migration could use, $wgMigrationSettings['content-meta-data/write'] and $wgMigrationSettings['content-meta-data/read'].

[***] This generally does not involve loading the content blob, except for cases where rev_hash or rev_len is NULL, and need to be computed.

[****] We could migrate text rows that contain ExternalStore URLs at this point, but it seems to be prudent to not do even more at the same time. That migration can be done on the fly, or done by script later.