Requests for comment/Multi-Content Revisions

"''This page is a full description for 'Multi-Content Revisions', an RFC tracked as T107595."

Multi-Content Revision support (also MCR) is a technology change to the back-end of MediaWiki that is being made in 2017, led by the MediaWiki and Wikidata teams. It will allow multiple content "slots" per revisions (comprising multiple "streams" per page). Each slot has a specific role (revision+role are unique), and typically each slot has a different content model. Please refer to the Glossary for a definition of the terms used in this document.

The main objectives of introducing MCR can be summarized as follows: On the technical level, this translates to introducing a level of indirection between the revision and the content object, so multiple content objects can be managed per page revision. The concept of a wiki page changes from a sequence of revisions of a single document to a sequence of revisions of multiple documents (streams of slots). For simplicity and compatibility, each revision must have a "main" slot that corresponds to what is without MCR the singular page content.
 * Allow auxiliary information that is currently embedded in wikitext (e.g. infoboxes, categories, template schema) to be factored out, and managed as a separate document (slot), without losing the shared history and page-level functionality associated with the main content, still allowing the auxiliary data to be edited at the same time as the main content, creating a single revision.
 * Allow auxiliary information that is currently managed on associated pages (e.g. template documentation, quality assessment, gadget styles) to share the edit history and page-level functionality associated with the main content, and allowing the auxiliary data to be edited at the same time as the main content, creating a single revision.
 * Potentially, provide a generic way to associate derived data with a revision (e.g. blame maps, rendered content variants).

Introducing MCR support does not change how content is stored (it continues to be treated as blobs, and use the text table or External Storage), although the PHP interface for blob storage will likely need to be refactored. Introducing MCR support also does not change how derived index information, such as the imagelinks or page_props table, are managed and used. Introducing MCR support also does not change the editing experience for any content, though later changes using this technology might. MCR does change how and where content meta-data is stored (the content model, format, hash, etc.).

System Design
A top-down view of the service architecture for MCR support:
 * Application logic uses a RevisionLookup service to access revision (resp page) content. Some of the database level code currently in WikiPage and Revision will move there.
 * Application logic uses a PageUpdateController to create new revisions (to update pages). Much of the update logic currently in WikiPage will go there.
 * The existing Revision class will continue to provide read and write access to the main slot, implemented based on RevisionLookup and PageUpdateController.
 * RevisionLookup uses a RevisionSlotLookup to load revision content meta-data, and a BlobLookup to load the serialized content. A ContentLookup layer may be added between RevisionLookup and RevisionSlotLookup to support structured storage and virtual slots. Access to the revision table may be implemented directly in RevisionLookup, or abstracted into a DAO.
 * PageUpdateController uses a RevisionSlotStore to write revision content meta-data, and a BlobStore to store the serialized content. A ContentStore layer may be added between the PageUpdateController and the BlobStore, to provide structured storage. Access to the revision table may be implemented directly in PageUpdateController, or abstracted into a DAO.
 * BlobStore and BlobLookup are initially implemented based on the text table and the External Storage logic. Much of the logic in Revision that currently implements access to revision content will move here.
 * Note that BlobLoockup and BlobStore will typically be implemented by a single class, as will RevisionSlotLookup and RevisionSlotStore.

Necessary configuration includes:
 * content model per slot role (not for the main slot)
 * blob store per slot role and (as a fallback) per content model.
 * legal slots per namespace (and/or main slot model)
 * rules for combined display and editing of slots (per namespace and/or main slot model)
 * Various compatibility flags to manage the transition phase.

More detailed design documents can be found on the following sub-pages:
 * Content Meta-Data Storage defines a DAO layer for content meta-data. It is the preferred way to access the content table.
 * Blob Storage defines a generic mechanism for storing and retrieving arbitrary data blobs, similar to ExternalStore.
 * Transaction Management specifies a generic transaction management interface.
 * Revision Retrieval Interface defines a (lazy loading) RevisionRecord object that provides access to revision meta-data as well as the Content objects of each slot, based on the Content Meta-Data and Blob Storage interfaces.
 * Page Update Interface uses the Content Meta-Data Storage and Blob Storage components to provide a builder object for new revisions which is responsible for managing the transactional context of the update.
 * The legacy Revision and WikiPage classes uses the Revision Lookup and Page Updater components to retain the legacy interface for the main slot.
 * Rendering and Parser Cache needs to be aware of slots. The web cache will be for a rendering of all slots of the current revision.

External Interface Compatibility
Support for multi-content revisions will have to be integrated into the external interfaces (both UI and API).
 * User Interface
 * Views (view, edit, diff, etc)
 * Action API
 * Dumps (Import/Export)

Key Components

 * Revision Storage Interface is the structural heart of MCR, defining how content is associated with revision via slots.
 * Blob Storage provides an generic interface for storing serialized page content.
 * Page Update Interface manages the creation of new revision, and the update of any associated information in the database.
 * Dumps MCR support has been added to XML dumps with version 0.11 of the XML schema (released with MediaWiki 1.35).

Use Cases
The ability to manage different independent content objects (documents) on the same page may be useful for a variety of applications, some of which are given below. Note that only a few of the use cases mentioned are actually planned or identified as desirable. Others are just ideas so far.

Primary (user editable) content: Derived (possibly virtual) content:
 * Structured media info, especially license data, associated with a file description page, in addition to the free form wikitext. Currently stored inline using the template. There is a desire by the reading and multimedia teams to have media info integrated closely with the file description page. This is one of the use cases that drive this proposal. See also Reading/Multimedia/Structured Data/Technical requirements.
 * Store extension data in slots - for tools such as Graph, Easytimeline, Map, json, etc.
 * References - Store references in a separate slot (https://phabricator.wikimedia.org/T130663)
 * Provide a way to specify what text/statement is supported by a block - ( https://phabricator.wikimedia.org/T23209)
 * Page components - ways of improving the handling of common content structures like infoboxes, data tables, citations or navboxes
 * Infobox data. Removing the infobox parameters from the wikitext allows the infobox to easily be stripped, or shown separately, or formatted according to the target device. It would also make form-based editing simpler. Currently stored as parameters of the infobox template, inline in wikitext. There is a desire by both the reading and the editing team to do this. This is one of the use cases that drive this proposal, even though the details are still unclear.
 * Categories - Manage as structured data. Compare T87686. Currently stored inline in wikitext. The editing teams seems to find this desirable, but it's not a high priority. Has been suggested several times over the years.
 * Meta-data editor - Use a dedicated interface for adding meta-data like interwiki links, rather than wikitext
 * Associated namespaces
 * Template schema information (aka template data). Currently stored inline in wikitext using JSON syntax. There appears to be desire from the editing team to have atomic edits for template styles, documentation, and schema. This is one of the use cases that drive this proposal.
 * Template styles: the CSS used by a template would live in a separate slot, but could be edited along with the template. Currently stored in MediaWiki:Common.css. There appears to be desire from the editing team to have atomic edits for template styles, documentation, and schema.
 * Template documentation. Currently using a subpage or inline in a section. There appears to be desire from the editing team to have atomic edits for template styles, documentation, and schema.
 * Lua module documentation, currently usually maintained in the Help namespace. Has been discussed in the context of a global repository for Lua modules. That combination would require a localization mechanism.
 * TimedMediaHandler-TimedText -  TimedText content as an integral part of the File page
 * Pages collections - ability to bind pages together
 * Article quality assessments ("needs love", "featured", etc), see PageAssessments. Currently stored inline in wikitext with templates. There is a desire by the Community Tech team to move this out of wikitext. This is one of the use cases that drive this proposal.
 * Focus area of images, for generating header images for the mobile app. Interest in this by the mobile team was discussed at the 2016 summit.
 * MassMessage target lists, which are part wikitext and part structured JSON.
 * Allow JS, CSS, HTML and wikitext for CentralNotice to be managed as separate documents and edited using a dedicated editor.
 * ProofreadPage pages, which are composed of 3 separate wikitext content sections (header, body, footer).
 * Gadget styles, currently stored on a parallel page with the .css extension. Has been discussed in the context of a global repository for Gadgets.
 * Workflow state. Currently stored in the form of templates and template parameters in the wikitext. There are currently no plans to implement this.
 * Wikidata description override. Currently going to be implemented as a parser function.
 * Article annotations / inline discussion
 * Page forms?
 * Page specific edit notices, T124379.
 * Blame maps (as derived content) that record which part of an article was added by which user. There are currently no plans to implement this.
 * OCR text from DeJaVu and PDF files (as derived content). Works best if the upload history is managed using MCR. There are currently no plans to implement this.

Other:
 * Managing file upload history. Managing meta-data about uploaded files along with the wikitext would remove the need for an oldimage table (see Requests for comment/image and oldimage tables). This would however require quite a bit of refactoring, and would introduce a large number of revisions (one per upload). There are currently no plans to implement this.
 * Rendered HTML (as derived, possibly virtual "on-the-fly" content). This would require the ability to update derived content without creating a new revision, on "events" like a template update. It would also require "sub-slots", since different HTML may be produced for different devices or target languages. There are currently no plans to implement this.

Rationale
The above use cases indicate that there is a need for bundling different kinds of information on a single page, to allow atomic editing of the different kinds of content together, provide a shared version history, and allow the different kinds of information to be watched, protected, moved, and deleted together. That is, a revision can have multiple "slots" containing different kinds of data, each having a unique role. The requirements for each slot correspond to what is already defined in the Content interface: we need to be able to store and load, import and export, render, edit, and diff them.

The ability to use a dedicated editing interface and specialized diffing code seems particularly desirable for information that is currently maintained inline as part of the wikitext, such as categories, infobox parameters or media licensing information.

To address this need, MCR introduces an indirection between revisions and content objects, changing the relationship from 1:0 to a 1:n (or m:n, if content meta-data can be re-used by multiple revisions). This requires a substantial refactoring in the storage layer of MediaWiki, and some changes to the database schema. By requiring all revisions to have a "main" slot which will be used as the default in many situations, we can achieve interoperability with code that does not know about MCR.

Advantages over Multipart Content
An alternative approach discussed earlier was Multipart Content: Multiple Content objects would be coded into a single meta-objects, a multipart content object, using a mechanism similar to the MIME-Multipart mechanism used for email attachments.

The advantage of this approach seems to be that most code can be oblivious to this change, so only code that actually need to manipulate or interpret the content needs to know about multi-part content model. However, it's not always clear which code that is. All extensions that access page content would need to become aware of the multipart mechanism. Also, all clients accessing content though the API would need to become aware of multipart, as should programs that process dumps. If they remained oblivious to multipart, they would likely treat multipart content as an unknown content model they cannot process. In the worst case, they would treat multipart content as wikitext, or replace it with plain wikitext, causing data loss or corruption.

Another problem with Multipart Content is performance and scalability: even if only one part (aka "slot") of the content is needed, the entire blob has to be loaded; even if only one part is changed, the entire blob needs to be stored again. With MCR, not all content needs to be loaded from storage if only the content of one slot is needed, and different kinds of content can use different optimized blob store mechanisms.

The first problem, "partial" opacity, could be overcome by "unwrapping" the multipart content objects on the storage layer, and expose a new slot-aware interface to application logic and clients. This allows legacy code to continue working on just the "main" part of the content, just like MCR does. However, to achieve this, much of the refactoring needed for MCR would still be needed - only the database schema could stay the same. So this approach would not avoid the the refactoring work, and it would not provide the benefit of the better scalability and flexibility that MCR gives. This is especially true since it has become clear that the database schema will need improvements in any case, see for instance T161671: Compacting the Revision Table.

So, the multipart approach will either be problematic for interoperability (for users of the API as well as extensions), or require just as much work as MCR, with less benefits. For these reasons, MCR was identified as the preferred approach. However, if need be, the multipart mechanism may still be used as a stepping stone that would allow us to work on the refactoring, and make use of the new features, before the database schema changes for MCR are deployed.

Streams

 * Streams are not modeled directly, they exist purely as the history of a given slot for a given page and can be referenced by page title(+namespace) and slot name.
 * TBD: we may need a "closed" representation in a single string, for use in wiki links. For streams, we could use "namespace:title##slot" or "namespace:title||slot".
 * Which slots (resp. streams) are available for a given page may be configured per namespace, or per content model of the main slot (which in turn would typically be based on the namespace). The details of this are TBD.

Derived Content

 * Derived content can be stored using the generic blob storage service interface
 * The association between derived content and revisions may need a more specialized mechanism than slots provide.

Virtual Slots

 * Virtual slots will need to be implemented at a structured storage layer placed between RevisionLookup and the BlobLookup.
 * Virtual slots typically need access to the slot name and to the slots of the revision (per definition, virtual content is derived from other content). Perhaps it should have access to a RevisionRecord.
 * Some virtual slots may need access to data in the parent revision (e.g. for blame maps and diffs).
 * Virtual slots should have a way to signal whether the generated content should be stored (as persistent derived content) once it has been generated, or not. Virtual slots that do not allow generated content to be stored are considered "volatile". There could also be an option to persist virtual content only for a limited time ("cacheable").
 * Straw man interface: . Caution: accessing slot content via $revRec may cause other virtual slots to be triggered. Circular dependencies must be avoided.

Sub-slots

 * We may want to model sub-slots and/or sub-revisions.


 * Sub-slots: Slot names may be suffixed to allow multiple "sub-slots", all with the same storage backend and content model. E.g. there could be an "html" slot with one sub-slot per language, e.g. "html.de", "html.nl", etc.
 * Sub-revisions would model "events" such as a template bein updated.

Architecture Overview (old)
Note: primary (user generated) content slots must be enumerable. Which revision has which primary slots is recorded in the database. Secondary (derived) content slots may also be persistent in the database, but can just as well be purely virtual. As a point in case, we'd want a) a ParserCache implementation based on persistent derived slots as well as b) a virtual slot implementation based on the existing ParserCache.
 * PageStore -> create/update/delete pages. Uses RevisionStore. Does all the secondary data update stuff.
 * RevisionStore -> returns RevisionBuilder; Caller adds RevisionSlots and meta-data to RevisionBuilder
 * RevisionBuilder maintains transactional context. Needs to be aware of base rev id for "late" conflict detection!
 * later add support for RevisionUpdater, for updating persistent derived revision data
 * RevisionLookup returns RevisionRecord objects; LazyRevisionRecord for lazy loading?
 * RevisionRecord can enum RevisionSlots for primary content. LazyRevisionSlot for lazy loading of content.
 * RevisionSlots has Content and meta-data (size, hash, content model, change date, etc); Do we need a RevisionSlotLookup/RevisionSlotStore?
 * Primary content implements Content. Derived content implements Data(?!); Content extends Data.
 * RevisionStore/RevisionLookup is based on BlobStoreMultiplexer. Read/write is routed based on a prefix in the blob id.
 * BlobStoreMultiplexer manages multiple BlobStores
 * RevisionStore turns blobs into ContentObjects and creates RevisionSlot and RevisionRecord objects from them (or creates a LazyRevisionRecord that loads data on demand)