User:Duesentrieb/Multi-Content Revisions

This is a draft for a proposal to support multiple content objects per revision (multiple "streams" page).

The idea of this RFC is to allow multiple Content objects to be associated with a single revision (one per "slot"), resulting in multiple content "streams" for each page. The "main" slot being reserved for the primary content of the page (that is, for what is currently considered the content of the page).

The RFC is structured to first give an overview in the sections Use Cases and Rationale and Concepts and Principles, and then goes on to provide an outline of the implementation, roughly structured in three levels. These levels can be implemented incrementally, each providing additional functionality and flexibility.

Use Cases and Rationale
The idea is to provide a mechanism for associating multiple (independant or interdependent) Content objects (generally of different types) for each page (and reach revision of each page). This allows for greater flexibility with regards to what MediaWiki can store and manage where. It effectively adds another dimension to the revision management system, providing a standard mechanism for "attaching" derived or editable data to a page.

Potential use cases:
 * HTML rendering(s) - instead of the parser cache.
 * A "blame map" for tracking authorship on the word level
 * categories etc. maintained as structured, user editable data outside the wikitext
 * lead image and focus area maintained as structured, user editable data
 * Structured meta-data for media files (on file description pages)
 * Media file streams could be managed in the same way (see {T96384})
 * Expanded versions of templates used on the page; this would allow re-rendering just these blobs, and the dependant HTML rendering, when the template changes.

The approach described below revolves around managing several Content objects per revision, organizing them into "slots". This approach is distinct from the idea of supporting multi-part content blobs, which would use MIME-Multipart or a similar encoding to combine multiple content objects into one master object. Multi-content revisions have two advantages:
 * 1) they are more flexible and more efficient with respect to blob storage (different storage machanisms can be used for different kinds of content).
 * 2) they avoid breaking changes to APIs the allow access to raw page content, by presenting the content of "main" slot there per default. Attempting the same with multi-part revisions would lead to round-trip issues when only the main part of the content gets posted back from an edit.

Addressing

 * Each Revision has named "slots". Each slot may contain a Content object.
 * There are three types of slots:
 * primary: user created content that can be edited (wikitext, structured data, etc)
 * derived: information derived from the user created content and stored along with it (e.g. diffs, blame map, rendered HTML)
 * virtual: information derived from the user created content on the fly, and is not stored
 * There is always at least one primary slot defined: the "main" slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
 * Each slot name, except the main slot, is associated with a content model to be used with that slot (per default).
 * The content model to use for the main slot is determined based on the page title (namespace, extension), just like we already do.
 * TBD: we need a name for the history of a page regarding a specific slot. "stream"? "facet"?
 * Slots can be referenced by revision ID and slot name. Streams can be referenced by page title(+namespace) and slot name.
 * TBD: we may need a "closed" representation in a single string. For page titles, we could use "namespace:title##slot" or "namespace:title||slot".
 * Which slots are available for a given page may be configured per namespace, or per content model of the main slot (which in turn would typically be based on the namespace)
 * There is metadata associated with each slot. Namely:
 * content-model
 * touch-date (null for virtual slots)
 * in practice, there would also be a blob URL and a serialization format associated with any non-virtual slot. But that should be considered an implementation detail, and not be exposed on this level, if it can be avoided.
 * additional information that might be tracked include a SHA1 hash and the actual blob size in bytes (or the content size in bogo-bytes, as provided by Content::getSize)

Display

 * The handler for the view action for a given page is determined by the content model of the main slot.
 * The action handler shall have access to all slots, so the content of other slots can be used for display
 * In diff views, diffs for each primary slot are calculated and shown, one below the other.
 * Note that each Content for each slot can provide a separate ParserOutput object, which can be used to include style sheets and scripts, record the usage of images, store information in page_props, etc.
 * As soon as more than just the main slot's content is used to show pages to the user, the ParserCache needs to cache the ParserOutput for each slot. Taht is, the slot name must go into the parser cache key.

Updating

 * When a page is edited, the content of at least one primary slot is created (or added)
 * Unchanged content of primary slots is re-used from previous revisions. E.g.:
 * Revision 1 has two slots, A and B, with content ( A1, B1 )
 * Now, slot A is edited, but slot B is untouched. Then revision 2 is ( A2, B1 ). That is, slot B in revision two has the same as slot B of revision 1.
 * Derived slots are re-calculated when primary slots change (similarly to how we already handle secondary data like links, and pre-cached data like ParserOutput).
 * TBD: should the ContentHandler of the primary slot determine which derived content is generated? If not, what component is responsible for this decision?
 * SecondaryDataUpdates are determined for all content objects of a revision, old or new, primary or derived.
 * Derived content of a revision can be updated without creating a fresh revision. For example, if rendered HTML is stored as derived content, it could be updated because a template used on the page has changed.

Representation

 * The XML export format will need to change to accomodate multiple content objects per revision.
 * API output (and input!) will be based on the main slot per default. Different or additional slots need to be explicitly requested.

Level I: RevisionContentLookup
For accessing the content of the different slots of a revision, a RevisionContentLookup service is defined:

For virtual slots, getRevisionContent and getRevisionSlot would generate the desired information on the fly.

The initial implementation of RevisionContentLookup (we may not even need a separate interface, just this implementation) would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot (and virtual slots) are supported. Implementation steps:


 * Move storage layer code for accessing revision content from Revision into RevisionContentLookup
 * Change Revision to use a (global default) instance of RevisionContentLookup to access revision content.
 * The initial implementation of RevisionContentLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.
 * The initial implementation of RevisionContentLookup (or a suitable decorator) should provide in-proces LRU caching for slot content (and perhaps also slot meta-data).
 * Expension points shall be provided in getRevisionContent, getRevisionSlot and listRevisionSlots for implementing virtual slots. It may be sufficient to use the standard Hook mechanism, but it is probably more useful in the long run to define a registry for slot handlers ("revision content providers"), which can be accessed by the RevisionContentLookup:

Note that the RevisionContentProvider gets a RevisionContentLookup, so it can retrieve the content of the slots it needs to operate. This implies that RevisionContentLookup should apply aggressive in-process caching on Content objects.

A RevisionContentProvider could be implemented to provide access to the ParserCache, or RESTbase services, or use the new VRS infrastructure to provide information about revisions.

Side note: Storage layer code dealing with revision meta-data should also be removed from the Revision object to a service object that provides access to information about revisions. That information would be modeled by a new "dumb" data object called RevisionInfo or some such. But all that is outside the scope of this proposal.

Level II: PageUpdater
Updating a revision is a complex process, with complicated requirements with regards to the usage of transactional logic and deferred updates. To honor these requirements, a stateful "interactor" objects are defiend in addition to the stateless storage service:

PageUpdater shall be used to create new revisions when a page is edited (or created). RevisionUpdater can be used to update derived content when some dependency (e.g. a template) changes.

Updater behavior:
 * setSlotContent will fail for any virtual slot.
 * RevisionUpdater::setSlotContent will fail for any primary.
 * $content can be set to null to explicitly remove the slot from the revision, preventing any content of that slot to be re-used from earlier revisions.
 * as indicated by the names, begin, rollback and createRevision resp updateRevision shall be used to create a transactional bracket around the calls to setSlotContent. However, no transactional behavior is guaranteed.
 * Implementations shall free any resources (such as database connections) when concluding the transaction.
 * Implementations shall take care to undo any effect any calls to setSlotContent may have had.

createRevision (resp updateRevision) would take the place of WikiPage::doUpdateConent. All the relevant code should be moved to PageUpdater. WikiPage::doUpdateConent should then be re-implemented based on a (global default) instance of RevisionContentStore. Maintaining backwards compatibility for hooks may be a major challange.

The initial implementation of the updaters should allow the storage of additional primary and derived slots using MediaWiki's standard storage mechanism (i.e. the text table and/or the ExternalStore mechanism). The association between revisions and content blobs, along with the meta-data for each slot, shall be maintained in a new database table revision_slots with the following fields:

The following fields in the revision table should be modified or deprecated (or dropped, at least for new installs):
 * rev_text_id (deprecated, can be kept for B/C if the main slot content is in the text table or external store)
 * rev_deleted (as before - or do we need this per slot?)
 * rev_len (sum of the sized of all primary slots associated with this revision)
 * rev_sha1 (hash of the hashes of all primary slots; for B/C, could be the main slot's hash)
 * rev_content_format (deprecated; could be the format of the main slot for B/C)
 * rev_content_model (deprecated; could be the model of the main slot for B/C)

Note that virtual slots would not have entries in the revision_slots table. It is sufficient for the respective RevisionContentProvider to declare that the slot is available for the given revision. The association between the revision and the content of virtual slots is purely programmatic.

Level III: BlobStore and SlotDataStore
It should be possible to use different storage backends for different slots (or for different content models, TBD). For each storage mechanism, a BlobStore service would be provided:

Besides the SQL-Based storage used currently used by MediaWiki, BlobStores could be implemented on top of the raw file system, Cassandra, Apache Swift, higher level HTTP based services like RESTBase, etc.

Addressing blobs:
 * The BlobStore has full control over the address that shall be used later to retrieve the blob. storeBlob returns the address that can be used with loadBlob to retrieve the blob.
 * The address is completely opaque. It may be based on the blob's content hash, use incremental numbering, or GUIDs, or some other scheme.

TBD: should the BlobStore be abloe to store, and be aware of, any meta-data such as the blob's MIME type?

The mapping between slots and storage backends is maintained by the SlotDataStore service:

Addressing slot data:
 * The mapping between slots and storage backends is implemented in two steps:
 * BlobStore names are associated with BlobStore implementatiosn and configurations, designating a concrete storage location. This association must NEVER change, otherwise any stored data will become inaccessible (this is similar to how externalstore clusters are configured).
 * slot names (or content models, TBD) are associated with a BlobStore name. This indicates which store is to be used when storing new data. This association can be changed at will.
 * The string returned by storeSlotData is an opaque URL for later loading the slot data using loadSlotData. It is composed of two parts: the name of the BlobStore, and the address returned by the BlobStore.
 * loadSlotData relies on the prefix in the $url to find the correct BlobStore to load the slot data.

Once the SlotDataStore service is available, PageUpdater, RevisionUpdater, and RevisionContentLookup (resp. a RevisionContentProvider) shall be implemented on top of them. Maintaining any meta-information about the slots (in the new revision_slots table), and handling serialization, remains the responsibility of PageUpdater and RevisionUpdater. Interpreting such meta-data from the revision_slots table, and handling deserialization, remains the responsibility of RevisionContentLookup resp. an appropriate implementation of RevisionContentProvider.