Requests for comment/Multi-Content Revisions

Multi-Content Revision support (also MCR or MuCoRe) allows multiple content "slots" per revisions (comprising multiple "streams" per page). Each slot has a specific role (revision+role are unique), and typically each slot has a different content model. Some slots (called "primary" slots) are user editable, while others (called "derived") are maintained automatically. Please refer to the Glossary for a definition of the terms used in this document.

As of August 2016, MCR is agreed to be desirable, but the details are still under discussion, see Requests for comment/Multi-content revisions.

On an abstract level, the need that drives MCR is allowing a joined history (and atomic edits) of multiple documents that are logically bound to a page. For example, a template page may have additional "slots" that contain the template documentation, associated CSS, and a schema for the parameters. With MCR, all parts of the page (logical documents or "content objects") can be edited at once, creating a single new revision in a shared page history.

On the technical level, this translates to introducing a level of indirection between the revision and the content object, so multiple content objects can be managed per page revision. The concept of a wiki page changes from a sequence of revisions of a single document to a sequence of revisions of multiple documents (streams of slots). For simplicity and compatibility, each revision must have a "main" slot that corresponds to what is without MCR the singular page content.

Introducing MCR support does not change how content is stored (it continues to be treated as blobs, and use the text table or External Storage), although the PHP interface for blob storage will likely need to be refactored. Introducing MCR also does not change how derived index information, such as the imagelinks or page_props table, are managed and used. MCR does change how and where content meta-data is stored (the content model, format, hash, etc).

System Design
A top-down view of the service architecture for MCR support: More detailed design documents can be found on the following sub-pages:
 * Application logic uses a RevisionLookup service to access revision (resp page) content. Some of the database level code currently in WikiPage and Revision will move there.
 * Application logic uses a PageUpdateController to create new revisions (to update pages). Much of the update logic currently in WikiPage will go there.
 * The existing Revision class will continue to provide read and write access to the main slot, implemented based on RevisionLookup and PageUpdateController.
 * RevisionLookup uses a RevisionSlotLookup to load revision content meta-data, and a BlobLookup to load the serialized content. A ContentLookup layer may be added between RevisionLookup and RevisionSlotLookup to support structured storage and virtual slots. Access to the revision table may be implemented directly in RevisionLookup, or abstracted into a DAO.
 * PageUpdateController uses a RevisionSlotStore to write revision content meta-data, and a BlobStore to store the serialized content. A ContentStore layer may be added between the PageUpdateController and the BlobStore, to provide structured storage. Access to the revision table may be implemented directly in PageUpdateController, or abstracted into a DAO.
 * BlobStore and BlobLookup are initially implemented based on the text table and the External Storage logic. Much of the logic in Revision that currently implements access to revision content will move here.
 * Note that BlobLoockup and BlobStore will typically be implemented by a single class, as will RevisionSlotLookup and RevisionSlotStore.
 * Content Meta-Data Storage defines a DAO layer for content meta-data. It is the preferred way to access the content table.
 * Blob Storage defines a generic mechanism for storing and retrieving arbitrary data blobs, similar to ExternalStore.
 * Transaction Management specifies a generic transaction management interface.
 * Revision Retrieval Interface defines a (lazy loading) RevisionRecord object that provides access to revision meta-data as well as the Content objects of each slot, based on the Content Meta-Data and Blob Storage interfaces.
 * Page Update Interface uses the Content Meta-Data Storage and Blob Storage components to provide a builder object for new revisions which is responsible for managing the transactional context of the update.
 * The legacy Revision object uses the Revision Lookup and Page Updater components to retain the legacy interface for the main slot.
 * Rendering and Parser Cache needs to be aware of slots. The web cache will be for a rendering of all slots of the current revision.

External Interface Compatibility
Support for multi-content revisions will have to be integrated into the external interfaces (both UI and API).
 * User Interface
 * Views (view, edit, diff, etc)
 * Web API
 * Dumps (Import/Export)

Content Model
An overview in bullet points (refer to the Glossary for a definition of the terms used):
 * Each revision has named slots, the slot names define the slot's role. Each slot may contain a content object.
 * There are three types of slots (resp. content):
 * primary: user created content that can be edited (wikitext, etc)
 * derived: information derived from the user created content and stored along with it (e.g. diffs, blame map, rendered HTML)
 * virtual: information derived from the user created content on the fly, and is not stored. Virtual slots are always also "derived".
 * There is always at least one primary slot defined: the "main" slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
 * Primary slots can be enumerated. The link between a revision ID and the associated primary slots is maintained in the main database. Listing all primary slots is needed for viewing, to create diffs, generate XML dumps, perform revert/undo, etc.
 * Derived slots cannot be enumerated. The link between a revision ID and the associated derived slots may be stored in the database, or in some other place, or may be purely programmatic.
 * Each role (except "main") is typically associated with a specific content model (e.g. the "categories" role would use the "categories" model). The main slot however may contain any kind of content, and some other roles may also not require a specific model to be used.
 * The content model to use for the main slot is determined based on the page title (namespace, extension), just like we already do.
 * Slots can be uniquely identified by revision ID and slot name. Streams can be referenced by page title(+namespace) and slot name.
 * TBD: we may need a "closed" representation in a single string, for use in wiki links. For streams, we could use "namespace:title##slot" or "namespace:title||slot".
 * Which slots are available for a given page may be configured per namespace, or per content model of the main slot (which in turn would typically be based on the namespace). The details of this are TBD.
 * There is meta-data associated with each slot (see Content Meta-Data). At least:
 * revision-id
 * content-model
 * logical size
 * blob address
 * serialization format
 * hash
 * Two revisions are considered equal if their primary slots have the same content. Two equal revisions have the same hash and length.
 * A revision's hash is aggregated from the hashes of its primary slots. If there is only one primary slot, the revision's hash is the same as the slot's hash. Similarly, a revision's length is calculated as the sum of the (logical) sizes of the primary slots.

Page Update Process
See details at Page Update Controller.
 * When a page is edited, the content of at least one primary slot is updated. It does not matter whether a slot with the same role existed in the previous revision.
 * One edit (user interaction) creates one revision, regardless of how many slots were updated (see also Content Meta-Data).
 * Unchanged content of primary slots is re-used from previous revisions. E.g.:
 * Revision 1 has two slots, A and B, with content ( A1, B1 )
 * Now, slot A is edited, but slot B is untouched. Then revision 2 is ( A2, B1 ). That is, slot B in revision two is the same content as slot B of revision 1.
 * Derived slots are re-calculated when primary slots change (similarly to how we already handle secondary data updates like LinksUpdate, and pre-cached data like ParserOutput).
 * TBD: should the ContentHandler of the primary slot determine which derived content is generated? If not, what component is responsible for this decision?
 * Derived content of a revision can be updated without creating a new revision.


 * SecondaryDataUpdates are created and executed for all content objects of a revision, old or new, primary or derived.

Use Cases
The ability to manage different independent content objects (documents) on the same page, is useful for several things (given in no particular order):
 * Manage categories as structured data. Compare T87686.
 * Structured "media info", especially license data, associated with a file description page, in addition to the free form wikitext. This would replace the template on commons.
 * Managing file upload history. Managing meta-data about uploaded files along with the wikitext would remove the need for an oldimage table (see Requests for comment/image and oldimage tables). This would however require quite a bit of refactoring.
 * Article quality assessments ("needs love", "featured", etc)
 * Blame maps (as derived content) that record which part of an article was added by which user.
 * Rendered HTML (as derived, possibly virtual "on-thy-fly" content). This would require the ability to update derived content without creating a new revision, on "events" like a template update. It would also require "sub-slots", since different HTML may be produced for different devices or target languages.
 * Template schema information (aka template data), instead of embedding XML in wikitext.
 * Template styles: the CSS used by a template would live in a separate slot, but could be edited along with the template.
 * Template documentation (instead of using a subpage)
 * Gadget styles (instead of having a separate .css page)
 * Lua module documentation
 * Infobox data. Removing the infobox parameters from the wikitext allows the infobox to easily be stripped, or shown separately, or formatted according to the target device.
 * ORC text from DeJaVu and PDF files (as derived content)
 * Workflow state

Rationale
The above use cases indicate that there is a need for bundling different kinds of information on a single page, to allow atomic editing of the different kinds of content together, provide a shared version history, and allow the different kinds of information to be watched, protected, moved, and deleted together. That is, a revision can have multiple "slots" containing different kinds of data, each having a unique role. The requirements for each slot correspond to what is already defined in the Content interface: we need to be able to store and load, import and export, render, edit, and diff them.

To address this need, MCR introduces an indirection between revisions and content objects, changing the relationship from 1:0 to a 1:n (or m:n, if content meta-data can be re-used by multiple revisions). This requires a substantial refactoring in the storage layer of MediaWiki, and some changes to the database schema. By requiring all revisions to have a "main" slot which will be used as the default in many situations, we can achieve interoperability with code that does not know about MCR.

An alternative approach discussed earlier was Multipart Content: Multiple Content objects would be coded into a single meta-objects, a multipart content object, using a mechanism similar to the MIME-Multipart mechanism used for email attachments. This however has two major disadvantages: performance and interoperability. The performance is not optimal, since the entire content blob must always be loaded to access any part of the content. Interoperability is not optimal, since it is unclear how external interfaces like the web API should behave to be backwards compatible: should they return the entire multipart object as the revision content, or always the main part of the content? The former breaks code that manipulates the content, while the latter causes data loss with code that treats content as opaque blobs. MCR faces a similar problem, but allows the new structures to be made explicit in the API response data and request parameters, instead of requiring clients to decode nested blobs.

Another advantage of MCR over the Multipart approach is flexibility. Derived content can be updated or added to revisions after the time of creation. Also, different kinds of content can use different optimized blob store mechanisms.

For these reasons, MCR was identified as the preferred approach.

Development
TBD

Migration

 * DB Schema migration
 * TBD...

Virtual Slots

 * Virtual slots will need to be implemented at a structured storage layer placed between RevisionLookup and the BlobLookup.
 * Virtual slots typically need access to the slot name and to the slots of the revision (per definition, virtual content is derived from other content). Perhaps it should have access to a RevisionRecord.
 * Some virtual slots may need access to data in the parent revision (e.g. for blame maps and diffs).
 * Virtual slots should have a way to signal whether the generated content should be stored (as persistent derived content) once it has been generated, or not. Virtual slots that do not allow generated content to be stored are considered "volatile". There could also be an option to persist virtual content only for a limited time ("cacheabl").
 * Straw man interface: . Caution: accessing slot content via $revRec may cause other virtual slots to be triggered. Circular dependencies must be avoided.

Sub-slots

 * We may want to model sub-slots and/or sub-revisions.


 * Sub-slots: Slot names may be suffixed to allow multiple "sub-slots", all with the same storage backend and content model. E.g. there could be an "html" slot with one sub-slot per language, e.g. "html.de", "html.nl", etc.
 * Sub-revisions would model "events" such as a template bein updated.

Architecure Overview (old)
Note: primary (user generated) content slots must be enumerable. Which revision has which primary slots is recorded in the database. Secondary (derived) content slots may also be persistent in the database, but can just as well be purely virtual. As a point in case, we'd want a) a ParserCache implementation based on persistent derived slots as well as b) a virtual slot implementation based on the existing ParserCache.
 * PageStore -> create/update/delete pages. Uses RevisionStore. Does all the secondary data update stuff.
 * RevisionStore -> returns RevisionBuilder; Caller adds RevisionSlots and meta-data to RevisionBuilder
 * RevisionBuilder maintains transactional context. Needs to be aware of base rev id for "late" conflict detection!
 * late add support for RevisionUpdater, for updating persistent derived revision data
 * RevisionLookup returns RevisionRecord objects; LazyRevisionRecord for lazy loading?
 * RevisionRecord can enum RevisionSlots for primary content. LazyRevisionSlot for lazy loading of content.
 * RevisionSlots has Content and meta-data (size, hash, content model, change date, etc); Do we need a RevisionSlotLookup/RevisionSlotStore?
 * Primary content implements Content. Derived content implements Data(?!); Content extends Data.
 * RevisionStore/RevisionLookup is based on BlobStoreMultiplexer. Read/write is routed based on a prefix in the blob id.
 * BlobStoreMultiplexer manages multiple BlobStores
 * RevisionStore turns blobs into ContentObjects and creates RevisionSlot and RevisionRecord objects from them (or creates a LazyRevisionRecord that loads data on demand)