User:Duesentrieb/Multi-Content Revisions

From mediawiki.org

This is a draft for a proposal to support multiple content objects per revision (multiple "streams" page).

The idea of this RFC is to allow multiple Content objects to be associated with a single revision (one per "slot"), resulting in multiple content "streams" for each page. The "main" slot being reserved for the primary content of the page (that is, for what is currently considered the content of the page).

The RFC is structured to first give an overview in the sections Use Cases and Rationale and Concepts and Principles, and then goes on to provide an outline of the implementation, roughly structured in three levels. These levels can be implemented incrementally, each providing additional functionality and flexibility.

Use Cases and Rationale[edit]

The idea is to provide a mechanism for associating multiple (independant or interdependent) Content objects (generally of different types) for each page (and reach revision of each page). This allows for greater flexibility with regards to what MediaWiki can store and manage where. It effectively adds another dimension to the revision management system, providing a standard mechanism for "attaching" derived or editable data to a page.

Potential use cases:

  • HTML rendering(s) - instead of the parser cache.
  • A "blame map" for tracking authorship on the word level
  • categories etc. maintained as structured, user editable data outside the wikitext
  • lead image and focus area maintained as structured, user editable data
  • Structured meta-data for media files (on file description pages)
  • Media file streams could be managed in the same way (see {T96384})
  • Expanded versions of templates used on the page; this would allow re-rendering just these blobs, and the dependant HTML rendering, when the template changes.

The approach described below revolves around managing several Content objects per revision, organizing them into "slots". This approach is distinct from the idea of supporting multi-part content blobs, which would use MIME-Multipart or a similar encoding to combine multiple content objects into one master object. Multi-content revisions have two advantages:

  1. they are more flexible and more efficient with respect to blob storage (different storage machanisms can be used for different kinds of content).
  2. they avoid breaking changes to APIs the allow access to raw page content, by presenting the content of "main" slot there per default. Attempting the same with multi-part revisions would lead to round-trip issues when only the main part of the content gets posted back from an edit.

Concepts and Principles[edit]

Addressing[edit]

  • Each Revision has named "slots". Each slot may contain a Content object.
  • There are three types of slots:
    • primary: user created content that can be edited (wikitext, structured data, etc)
    • derived: information derived from the user created content and stored along with it (e.g. diffs, blame map, rendered HTML)
    • virtual: information derived from the user created content on the fly, and is not stored
  • There is always at least one primary slot defined: the "main" slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
  • Each slot name, except the main slot, is associated with a content model to be used with that slot (per default).
  • The content model to use for the main slot is determined based on the page title (namespace, extension), just like we already do.
  • TBD: we need a name for the history of a page regarding a specific slot. "stream"? "facet"?
  • Slots can be referenced by revision ID and slot name. Streams can be referenced by page title(+namespace) and slot name.
    • TBD: we may need a "closed" representation in a single string. For page titles, we could use "namespace:title##slot" or "namespace:title||slot".
  • Which slots are available for a given page may be configured per namespace, or per content model of the main slot (which in turn would typically be based on the namespace)
  • There is metadata associated with each slot. Namely:
    • content-model
    • touch-date (null for virtual slots)
    • in practice, there would also be a blob URL and a serialization format associated with any non-virtual slot. But that should be considered an implementation detail, and not be exposed on this level, if it can be avoided.
    • additional information that might be tracked include a SHA1 hash and the actual blob size in bytes (or the content size in bogo-bytes, as provided by Content::getSize())

Display[edit]

  • The handler for the view action for a given page is determined by the content model of the main slot.
  • The action handler shall have access to all slots, so the content of other slots can be used for display
  • In diff views, diffs for each primary slot are calculated and shown, one below the other.
  • Note that each Content for each slot can provide a separate ParserOutput object, which can be used to include style sheets and scripts, record the usage of images, store information in page_props, etc.
  • As soon as more than just the main slot's content is used to show pages to the user, the ParserCache needs to cache the ParserOutput for each slot. Taht is, the slot name must go into the parser cache key.

Updating[edit]

  • When a page is edited, the content of at least one primary slot is created (or added)
  • Unchanged content of primary slots is re-used from previous revisions. E.g.:
    • Revision 1 has two slots, A and B, with content ( A1, B1 )
    • Now, slot A is edited, but slot B is untouched. Then revision 2 is ( A2, B1 ). That is, slot B in revision two has the same as slot B of revision 1.
  • Derived slots are re-calculated when primary slots change (similarly to how we already handle secondary data like links, and pre-cached data like ParserOutput).
    • TBD: should the ContentHandler of the primary slot determine which derived content is generated? If not, what component is responsible for this decision?
  • SecondaryDataUpdates are determined for all content objects of a revision, old or new, primary or derived.
  • Derived content of a revision can be updated without creating a fresh revision. For example, if rendered HTML is stored as derived content, it could be updated because a template used on the page has changed.

Representation[edit]

  • The XML export format will need to change to accomodate multiple content objects per revision.
  • API output (and input!) will be based on the main slot per default. Different or additional slots need to be explicitly requested.

Level I: RevisionContentLookup[edit]

For accessing the content of the different slots of a revision, a RevisionContentLookup service is defined:

interface RevisionContentLookup {
    function getRevisionContent( TitleValue $title, $revisionId, $slot = 'main' ): Content;
    function getRevisionSlot( TitleValue $title, $revisionId, $slot = 'main' ): RevisionSlot;
    function listRevisionSlots( TitleValue $title, $revisionId ): string[];
}

class RevisionSlot {
    function getContentModel(): string;
    function getTouchedTime(): string;
}

For virtual slots, getRevisionContent() and getRevisionSlot() would generate the desired information on the fly.

The initial implementation of RevisionContentLookup (we may not even need a separate interface, just this implementation) would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot (and virtual slots) are supported. Implementation steps:

  • Move storage layer code for accessing revision content from Revision into RevisionContentLookup
  • Change Revision to use a (global default) instance of RevisionContentLookup to access revision content.
  • The initial implementation of RevisionContentLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.
  • The initial implementation of RevisionContentLookup (or a suitable decorator) should provide in-proces LRU caching for slot content (and perhaps also slot meta-data).
  • Expension points shall be provided in getRevisionContent(), getRevisionSlot() and listRevisionSlots() for implementing virtual slots. It may be sufficient to use the standard Hook mechanism, but it is probably more useful in the long run to define a registry for slot handlers ("revision content providers"), which can be accessed by the RevisionContentLookup:
interface RevisionContentProviderRegistry {
    function getContentProvider( $slot ): RevisionContentProvider;
    function listAvailableSlots( $slot ): string[];
}

class RevisionContentProvider {
    function getSlotName(): string;
    function getRevisionContent( RevisionContentLookup $lookup, TitleValue $title, $revisionId ): Content;
    function getRevisionSlot( RevisionContentLookup $lookup, TitleValue $title, $revisionId ): RevisionSlot;
}

Note that the RevisionContentProvider gets a RevisionContentLookup, so it can retrieve the content of the slots it needs to operate. This implies that RevisionContentLookup should apply aggressive in-process caching on Content objects.

A RevisionContentProvider could be implemented to provide access to the ParserCache, or RESTbase services, or use the new VRS infrastructure to provide information about revisions.

Side note: Storage layer code dealing with revision meta-data should also be removed from the Revision object to a service object that provides access to information about revisions. That information would be modeled by a new "dumb" data object called RevisionInfo or some such. But all that is outside the scope of this proposal.

Level II: PageUpdater[edit]

Updating a revision is a complex process, with complicated requirements with regards to the usage of transactional logic and deferred updates. To honor these requirements, a stateful "interactor" objects are defiend in addition to the stateless storage service:

interface RevisionContentStore {
    function newPageUpdater( TitleValue $title, $baseRevisionId ): PageUpdater;
    function newRevisionUpdater( TitleValue $title, $revisionId ): RevisionUpdater;
}

interface PageUpdater {
    function begin();
    function setSlotContent( $slot, Content $content );
    function createRevision( $summary, User $user, $flags = 0 );
    function rollback();
}

interface RevisionUpdater {
    function begin();
    function setSlotContent( $slot, Content $content );
    function updateRevision();
    function rollback();
}

PageUpdater shall be used to create new revisions when a page is edited (or created). RevisionUpdater can be used to update derived content when some dependency (e.g. a template) changes.

Updater behavior:

  • setSlotContent() will fail for any virtual slot.
  • RevisionUpdater::setSlotContent() will fail for any primary.
  • $content can be set to null to explicitly remove the slot from the revision, preventing any content of that slot to be re-used from earlier revisions.
  • as indicated by the names, begin(), rollback() and createRevision() resp updateRevision() shall be used to create a transactional bracket around the calls to setSlotContent(). However, no transactional behavior is guaranteed.
  • Implementations shall free any resources (such as database connections) when concluding the transaction.
  • Implementations shall take care to undo any effect any calls to setSlotContent may have had.

createRevision() (resp updateRevision()) would take the place of WikiPage::doUpdateConent. All the relevant code should be moved to PageUpdater. WikiPage::doUpdateConent() should then be re-implemented based on a (global default) instance of RevisionContentStore. Maintaining backwards compatibility for hooks may be a major challange.

The initial implementation of the updaters should allow the storage of additional primary and derived slots using MediaWiki's standard storage mechanism (i.e. the text table and/or the ExternalStore mechanism). The association between revisions and content blobs, along with the meta-data for each slot, shall be maintained in a new database table revision_slots with the following fields:

TABLE revision_slots (
  slot_revision INT -- -> revision_id
  slot_name VARCHAR -- or int ID
  slot_blob VARCHAR -- url/uri string, e.g. text:12345 or extstore:.... or restbase:...
  slot_flags INT -- primary, derived, virtual (ENUM would be nice, but isn't portable)
  slot_touched CHAR(14) -- timestamp; similar to page_touched; for updatable derived content
  slot_content_size INT -- content size in bogo-bytes, as provided by Content::getSize()
  slot_blob_size INT -- blob size in bytes; maybe better put this into the blob store, not into the relation table
  slot_hash CHAR -- blob sha1; maybe better put this into the blob store, not into the relation table
  slot_format VARCHAR -- the serialization format of the blob; perhaps drop this, since in practice, this has proven rather useless
  slot_model VARCHAR -- content model of the slot's content objects; there should probably be a default content model per slot name, except for the main slot, where the default would depend on the namespace and title, as it does now
  slot_model_version VARCHAR -- or INT; proposed; allows the representation of content to evolve, which could be useful for things like "wikitext without nasty exceptions" or "wikibase entity structure version 3"
)

The following fields in the revision table should be modified or deprecated (or dropped, at least for new installs):

  • rev_text_id (deprecated, can be kept for B/C if the main slot content is in the text table or external store)
  • rev_deleted (as before - or do we need this per slot?)
  • rev_len (sum of the sized of all primary slots associated with this revision)
  • rev_sha1 (hash of the hashes of all primary slots; for B/C, could be the main slot's hash)
  • rev_content_format (deprecated; could be the format of the main slot for B/C)
  • rev_content_model (deprecated; could be the model of the main slot for B/C)

Note that virtual slots would not have entries in the revision_slots table. It is sufficient for the respective RevisionContentProvider to declare that the slot is available for the given revision. The association between the revision and the content of virtual slots is purely programmatic.

Level III: BlobStore and SlotDataStore[edit]

It should be possible to use different storage backends for different slots (or for different content models, TBD). For each storage mechanism, a BlobStore service would be provided:

interface BlobStore {
    function storeBlob( $data ): string;
    function loadBlob( $address ): string;
}

Besides the SQL-Based storage used currently used by MediaWiki, BlobStores could be implemented on top of the raw file system, Cassandra, Apache Swift, higher level HTTP based services like RESTBase, etc.

Addressing blobs:

  • The BlobStore has full control over the address that shall be used later to retrieve the blob. storeBlob() returns the address that can be used with loadBlob() to retrieve the blob.
  • The address is completely opaque. It may be based on the blob's content hash, use incremental numbering, or GUIDs, or some other scheme.

TBD: should the BlobStore be abloe to store, and be aware of, any meta-data such as the blob's MIME type?

The mapping between slots and storage backends is maintained by the SlotDataStore service:

interface SlotDataStore {
    function storeSlotData( $data, $slot ): string;
    function loadSlotData( $url ): string;
}

Addressing slot data:

  • The mapping between slots and storage backends is implemented in two steps:
    • BlobStore names are associated with BlobStore implementatiosn and configurations, designating a concrete storage location. This association must NEVER change, otherwise any stored data will become inaccessible (this is similar to how externalstore clusters are configured).
    • slot names (or content models, TBD) are associated with a BlobStore name. This indicates which store is to be used when storing new data. This association can be changed at will.
  • The string returned by storeSlotData() is an opaque URL for later loading the slot data using loadSlotData(). It is composed of two parts: the name of the BlobStore, and the address returned by the BlobStore.
  • loadSlotData() relies on the prefix in the $url to find the correct BlobStore to load the slot data.

Once the SlotDataStore service is available, PageUpdater, RevisionUpdater, and RevisionContentLookup (resp. a RevisionContentProvider) shall be implemented on top of them. Maintaining any meta-information about the slots (in the new revision_slots table), and handling serialization, remains the responsibility of PageUpdater and RevisionUpdater. Interpreting such meta-data from the revision_slots table, and handling deserialization, remains the responsibility of RevisionContentLookup resp. an appropriate implementation of RevisionContentProvider.