Manual:ContentHandler

The ContentHandler facility is a mechanism for supporting arbitrary content types on wiki pages, instead of relying on wikitext for everything. It was developed as part of the Wikidata project and is part of the MediaWiki core since version 1.21.

The canonical architecture documentation of the ContentHandler is in the file docs/contenthandler.txt of every MediaWiki installation. The latest version can be viewed online on the git repository. A copy that may not be up to date can be found at Manual:ContentHandler/Doc.

Rationale
The rationale behind this rather radical change is that being forced to rely on wikitext for all content makes a lot of things quite cumbersome in MediaWiki. The new pluggable architecture for arbitrary types of page content will allow us to hopefully:


 * use a different markup language on some or all pages, like tex or markdown.
 * do away with special cases for CSS and JavaScript pages
 * store and edit structured configuration data in a more sensible way than e.g. what the Gadgets extension uses on MediaWiki:gadgets-definition or the LanguageConverter on MediaWiki:Conversiontable*** pages.
 * provide data "attachments" to wikitext pages, e.g. for geodata (using a "multipart" content model for the page, similar the way email attachments are implemented using the multipart message format)
 * transition to a system where categories etc are not maintained in the wikitext itself, while still being stored and versioned in the usual way (again, using a multipart content model)
 * store structured data for Wikidata easily and natively as page content.

Design idea
The idea is to store other kinds of data in exactly the same way as wikitext is stored currently, but make MediaWiki aware of the type of content it is dealing with for every page. This way, any kind of data can be used as the content of a wiki page, and it would be stored and versioned exactly as before. To achieve this, the following was implemented in the MediaWiki core:


 * keep track of the content model of every page. This is done primarily in the page table in the database (also in the revision and archive tables), and made accessible through the relevant core classes such as Title, Revision and WikiPage. The content model defines the native form of the content, be it a string containing text, a nested structure of arrays, or a PHP object. All operations on the content are performed on its native form.
 * keep track of the content format (serialization format) of every revision. This is done primarily in the revision table in the database (also in the archive</tt> table, but not in the page</tt> table), and made accessible through the relevant core classes such as Revision</tt>. Note that the serialization format is only relevant when loading and storing the revision, no operations are performed on the serialized form of the content.
 * Note: in case of flat text content (such as wikitext), the native form of the content is the same as the serialized form (namely, a string). However, conceivably, the native form of wikitext could be some form of AST or DOM in the future.
 * Note: the page</tt> table records the content model for the current revision, while the revision</tt> records the content model and serialization format. Model and format may in theory both change from revision to revision, though this may be confusing, and doesn't allow for meaningful diffs.

This means that all code that needs to perform any operation on the content must be aware of the content's native form. This knowledge is encapsulated using a pluggable framework of handlers, based on two classes:


 * The Content class represents the content as such, and provides an interface for all standard operations to be performed on the content's native form. It does not have any knowledge of the page or revision the content belongs to. Content objects are generally, but not necessarily, immutable.
 * The ContentHandler class, representing the knowledge about the specifics of a content model without access to concrete Content. Most importantly, instances of ContentHandler act as a factory for Content objects and provide serialization/deserialization. ContentHandler objects are stateless singletons, one for each content model.

The ContentHandler is also used to generate suitable instances of subclasses of Article, EditPage, DifferenceEngine, etc. This way, a specialized UI for each content type can easily be plugged in through the ContentHandler interface.

All code that accesses the revision text in any way should be changed to use the methods provided by the Content object instead. Core classes that provide access the revision text (most importantly, Revision</tt> and WikiPage</tt>) have been adapted to provide access to the appropriate Content object instead of the text.

Backward compatibility
The assumption that pages contain wikitext is widespread through the MediaWiki code base. To remain compatible with parts of the code that still assume this, especially with extensions, is thus quite important. The right way to provide good compatibility is of course not to change public interfaces. Thus, all methods providing access to the revision content (like Revision::getText</tt>, etc) remain in place, and are complemented with an alternative method that allows access to the content object instead (e.g. Revison::getContent</tt>). The text-based methods are now deprecated, but shall function exactly as before for all pages/revisions that contain wikitext. This is also true for the web API.

A convenience method, ContentHandler::getContentText</tt>, is provided to make it easy to retrieve a page's text. For flat text-based content models such as wikitext (but also JS and CSS), getContentText</tt> will just return the text, so the old text-based method will return the same as it did before for such revisions. However, in case a text-based B/C method is called on a page/revision that does not contain wikitext (or another flat text content model, such a CSS), the behavior depends on the setting of $wgContentHandlerTextFallback: ignore makes it return null, fail causes it to raise an exception, and serialize causes it to return the default serialization of the content. The default is ignore, which is probably the most conservative option in most scenarios.

For editing however, non-text content is not supported per default. EditPage and the respective handlers in the web API are changed to fail for non-textual content.

Links

 * Content and ContentHandler classes:
 * Content.php
 * ContentHandler.php


 * Settings:
 * $wgContentHandlers
 * $wgContentHandlerTextFallback
 * $wgContentHandlerUseDB
 * $wgNamespaceContentModels