Requests for comment/Multi-Content Revisions

Multi-Content Revision support (also MCR or MuCoRe) allows multiple content "slots" per revisions (comprising multiple "streams" per page). Each slot has a specific role (revision+role are unique), and typically each slot has a different content model. Some slots (called "primary" slots) are user editable, while others (called "derived") are maintained automatically.

As of August 2016, MCR is agreed to be desirable, but the details are still under discussion, see Requests for comment/Multi-content revisions.

Introducing MCR support means introducing a level of indirection between the revision and the content object. The concept of a wiki page changes from a sequence of revisions of a single document to a sequence of revisions of multiple documents (slots). For simplicity and compatibility, each revision must have a "main" slot that corresponds to what is now the page content.

Introducing MCR support does not change how content blobs are stored (they continue to be treated as blobs), just how how and where content meta-data is stored (the content model, format, hash, etc), although the blob storage interface will likely be refactored in the process. Introducing MCR also does not change how derived index information, such as the imagelinks or page_props table, are managed and used.

Please refer to the Glossary for a clarification of the terms used in this document.

Use Cases
The ability to manage different independent content objects (documents) on the same page, is useful for several things:
 * Structured "media info", especially license data, associated with a file description page, in addition to the free form wikitext. This would replace the template on commons.
 * Managing file upload history. Managing meta-data about uploaded files along with the wikitext would remove the need for an oldimage table. This would however require quite a bit of refactoring.
 * Article quality assessments ("needs love", "featured", etc)
 * Blame maps (as derived content) that record which part of an article was added by which user.
 * Rendered HTML (as derived content). This would require the ability to update derived content without creating a new revision. It would also require "sub-slots", since different HTML may be produced for different devices or target languages.
 * Template schema information (aka template data), instead of embedding XML in wikitext.
 * Template styles: the CSS used by a template would live in a separate slot, but could be edited along with the template.
 * Template documentation (instead of using a subpage)
 * Gadget styles (instead of having a separate .css page)
 * Lua module documentation
 * Infobox data. Removing the infobox parameters from the wikitext allows the infobox to easily be stripped, or shown separately, or formatted according to the target device.
 * ORC text from DeJaVu and PDF files (as derived content)
 * Workflow state

Rationale
The above use cases indicate that there is a need for bundling different kinds of information on a single page, to allow atomic editing of the different kinds of content together, provide a shared version history, and allow the different kinds of information to be watched, protected, moved, and deleted together. That is, a revision can have multiple "slots" containing different kinds of data, each having a unique role. The requirements for each slot correspond to what is already defined in the Content interface: we need to be able to store and load, import and export, render, edit, and diff them.

To address this need, MCR introduces an indirection between revisions and content objects, changing the relationship from 1:0 to a 1:n (or m:n, if content meta-data can be re-used by multiple revisions). This requires a substantial refactoring in the storage layer of MediaWiki, and some changes to the database schema. By requiring all revisions to have a "main" slot which will be used as the default in many situations, we can achieve interoperability with code that does not know about MCR.

An alternative approach discussed earlier was Multipart Content: Multiple Content objects would be coded into a single meta-objects, a multipart content object, using a mechanism similar to the MIME-Multipart mechanism used for email attachments. This however has two major disadvantages: performance and interoperability. The performance is not optimal, since the entire content blob must always be loaded to access any part of the content. Interoperability is not optimal, since it is unclear how external interfaces like the web API should behave to be backwards compatible: should they return the entire multipart object as the revision content, or always the main part of the content? The former breaks code that manipulates the content, while the latter causes data loss with code that treats content as opaque blobs. MCR faces a similar problem, but allows the new structures to be made explicit in the API response data and request parameters, instead of requiring clients to decode nested blobs.

Another advantage of MCR over the Multipart approach is flexibility. Derived content can be updated or added to revisions after the time of creation. Also, different kinds of content can use different optimized blob store mechanisms.

For these reasons, MCR was identified as the preferred approach.

Architecture

 * Content Meta-Data Storage defines a DAO layer for content meta-data. It is the preferred way to access the content table.
 * Blob Storage defines a generic mechanism for storing and retrieving arbitrary data blobs, similar to ExternalStore.
 * Transaction Management specifies a generic transaction management interface.
 * Revision Retrieval Interface defines a (lazy loading) RevisionRecord object that provides access to revision meta-data as well as the Content objects of each slot, based on the Content Meta-Data and Blob Storage interfaces.
 * Page Update Interface uses the Content Meta-Data Storage and Blob Storage components to provide a builder object for new revisions which is responsible for managing the transactional context of the update.
 * The legacy Revision object uses the Revision Lookup and Page Updater components to retain the legacy interface for the main slot.
 * Rendering and Parser Cache needs to be aware of slots. The web cache will be for a rendering of all slots of the current revision.

External Interfaces

 * User Interface
 * Composite view, edit, diff
 * Web API
 * Dumps (Import/Export)

Development
TBD

Migration
TBD