Requests for comment/Content model storage

From mediawiki.org
Request for comment (RFC)
Content model storage
Component General
Creation date
Author(s) Legoktm
Document status accepted
Approved but not implemented in 2015, and revisited in 2016. See Phab:T105652

See Phabricator.

A proposal to change how we store content model and format in the page, revision, and archive tables. This has mainly come up as we are changing namespaces from a default content model of "wikitext" to "flow-board".

Background[edit]

Current abbreviated schema:

-- page
-- content model, see CONTENT_MODEL_XXX constants
page_content_model varbinary(32) DEFAULT NULL,

-- revision
-- content model, see CONTENT_MODEL_XXX constants
rev_content_model varbinary(32) DEFAULT NULL,

-- content format, see CONTENT_FORMAT_XXX constants
rev_content_format varbinary(64) DEFAULT NULL

-- archive
-- content model, see CONTENT_MODEL_XXX constants
ar_content_model varbinary(32) DEFAULT NULL,

-- content format, see CONTENT_FORMAT_XXX constants
ar_content_format varbinary(64) DEFAULT NULL

Problems[edit]

The current system for storing a page's (and revision's) content model in the database is non-optimal.

  • The content model and format of a revision is stored as NULL if it is the default. This makes changing the default problematic, as it requires updating all rows in the revision and archive table where the default is changing
    • Moving a page from "MediaWiki:FooBar.js" -> "MediaWiki:FooBar.css" changes the default content model, so it has to update all revision history rows to set an explicit rev_content_model (gerrit:226938)
  • content model and content format are stored as strings ("wikitext" and "text/x-wiki" respectively) which is inefficient from a storage point of view

Proposal[edit]

Create two new tables, one assign content models an id, and another to assign content formats an id.

CREATE TABLE /*_*/content_model (
  -- primary key
  cm_id smallint NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- content model, see CONTENT_MODEL_XXX constants
  cm_model VARBINARY(32) NOT NULL
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/cm_model ON /*_*/content_model (cm_model);

CREATE TABLE /*_*/content_format (
  -- primary key
  cf_id smallint NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- content format, see CONTENT_FORMAT_XXX constants
  cf_format VARBINARY(32) NOT NULL
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/cf_format ON /*_*/content_format (cf_format);

The page, revision, and archive tables will be modified to only have *_content_model_id column, and *_content_format_id columns. This field will always be populated, even if it is the default value (with the exception of the time period when the schema change is taking place and the column hasn't been populated yet).

The numerical ids will only be stored in the database and utilized classes that directly interact with it (Revision, PageArchive, LinkCache) through a quick lookup interface that is cached in APC or something similar. It is expected that PHP code will continue to use the PHP API (Title::getContentModel(), Revision::getContentModel(), etc.) which will use the text form of the model and format. api.php output and dumps would continue to output the text forms as well.

Creating a separate content_format table will prevent us from needing to do a similar change in the future whenever we'd like to change that default, though there is no current need to set a non-default format AFAIK.

Migration[edit]

  • Add new page_content_model_id, page_content_format_id, rev_content_model_id, rev_content_format_id, ar_content_model_id, ar_content_format_id columns.
  • Update MediaWiki to read those columns and fallback on the old *_content_model,*_content_format ones if it isn't populated yet
  • Run maintenance script to populate the new columns
  • Drop old *_content_model, *_content_format colums

This will be done automatically by update.php.

See also[edit]