Wikibase/Notes/ContentHandler

From mediawiki.org

This page describes how MediaWiki could be made to support content different from wikitext natively as page content.

Quick Launch[edit]

A prototype implementation exists in a branch in gerrit, see the wikidata branch in the core repo. Note that this is an early proof of concept, not in any way a finished implementation.

Core functionality is defined in the following files:

Some core classes of MediaWiki have undergone heavy changes, most importantly:

Some fields have been added to the database tables page, revision and archive:

A very basic extension providing a handler for JSON data is available as the WikidataClient extension.

Design Overview[edit]

Rationale[edit]

The rationale behind this rather radical change is that being forced to rely on wikitext for all content makes a lot of things quite cumbersome in MediaWiki. A pluggable architecture for arbitrary types of page content would allow us to:

  • use a different syntax on some or all pages, like tex or markdown or whatever
  • do away with special cases for CSS and JavaScript pages
  • store and edit structured configuration data in a more sensible way than e.g. what the Gadgets extension uses on MediaWiki:gadgets-definition.
  • provide data "attachments" to wikitext pages, e.g. for geodata (using a "multipart" content model for the page, similar the way email attachments are implemented using the multipart message format)
  • transition to a system where categories etc are not maintained in the wikitext itself, while still being stored and versioned in the usual way (again, using a multipart content model)
  • store structured data for Wikidata easily and natively as page content.

Design Idea[edit]

The idea is to store other kinds of data in exactly the same way as wikitext is stored currently, but make MediaWiki aware of the type of content it is dealing with for every page. This way, any kind of data can be used as the content of a wiki page, and it would be stored and versioned exactly as before. To achieve this, the following must be implemented in the MediaWiki core:

  • keep track of the content model of every page. This is done primarily in the page table in the database (also in the revision and archive table), and made accessible through the relevant core classes such as Title, Revision, WikiPage and Article. The content model defines the native form of the content, be it a string containing text, a nested structure of arrays, or a PHP object. All operations on the content are performed on it's native form.
  • keep track of the content format (serialization format) of every revision. This is done primarily in the revision table in the database (also archive table, but not in the page table), and made accessible through the relevant core classes such as Revision, WikiPage and Article (but not Title). Note that the serialization format is only relevant when loading and storing the revision, no operations are performed on the serialized form of the content.
    • Note: in case of flat text content (such as wikitext), the native form of the conten is the same as the serialized form (namely, a string). However, conceivably, the native form of wikitext could be some form of AST or DOM in the future.
    • Note: the page table records the content model for the current revision, while the revision records the content model and serialization format. Model and format may both change from revision to revision, though this may be confusing, and doesn't allow for meaningful diffs.

This means that all code that needs to perform any operation on the content must be aware of the content's native form. This knowledge is encapsulated using a pluggable framework of handlers, based on two classes:

  • The Content class represents the content's native form as such, and provides an interface for all standard operations to be performed on the content. It does not have any knowledge of the page or revision the text belongs to. Content objects are imutable.
  • The ContentHandler class, representing the knowledge about the specifics of a content model without access to concrete Content. Most importantly, instances of ContentHandler act as a factory for Content objects and provide serialization/deserialization. ContentHandler objects are stateless singletons, one for each content model.

The ContentHandler is also used to generate suitable instances of subclasses of Article, EditPage, DifferenceEngine, etc. This way, a specialized UI for each content type can easily be plugged in through the ContentHandler interface.

All core code that currently accesses the revision text in any way should be changed to use the methods provided by the Content object instead. Core classes that provide access the revision text (most importantly, Revision, WikiPage and Article) need to be changed to provide access to the appropriate Content object instead.

Besides the core classes representing the content (the new Content class and the old classes Revision, WikiPage and Article) and the classes used to interact with the content (most notably EditPage), some more classes/interfaces need to be generalized:

  • LinkUpdate shall be refactored to allow updating any kind of secondary data storage for indexed access to information from the content, be it though special database tables or even using another external storage mechanism.
  • ParserOutput will also likely need to be generalized to facilitate additional types of information extracted from the content, e.g. the index entries mentioned above.
  • DifferenceEngine shall be generalized to provide an interface that can be used to show differences of revisions of any kind of content.

Backward Compatibility[edit]

The assumption that pages contain wikitext is widespread through the MediaWiki code base. To remain compatible with parts of the code that still assume this, especially with extensions, is thus quite important. The right way to provide good compatibility is of course not to change public interfaces. Thus, all methods providing access to the revision content (like Revision::getText(), etc) remain in place, and are complemented with an alternative method that allows access to the content object instead (e.g. Revison::getContent()). The text-based methods may be deprecated, but shall function exactly as before for all pages/revisions that contain wikitext. This is also true for the action API.

The text-based methods shall be rewritten based on the content-object based methods, and shall use ContentHandler::getContentText() to provide the text to return. For flat text-based content models such as wikitext (but also JS and CSS), getContentText() will just return the text, so the old text-based method will return the same as it did before for such revisions. However, in case a text-based B/C method is called on a page/revision that does not contain wikitext (or another flat text content model, such a CSS), the behavior depends on the setting of $wgContentHandlerTextFallback: ignore makes is return null, fail causes it to raise an exception, and serialize causes it to return the default serialization of the content. The default is ignore, which is probably the most conservative option in most scenarios.

For editing however, serialization is preferred. EditPage and the respective handlers in the web API are changed to use the default serialization format of the content, so all content can be edited via the traditional interface. However, depending on the serialization format, some input may be rejected as malformed.