Incremental dumps/File format/Specification

From mediawiki.org

This page describes the current format of the incremental dumps file. It is far from finished, which means the format can change daily and that this page can easily become out of date.

The format is binary; the file contains various objects and can also contain free space (remaining after deleted objects).

Data types[edit]

When reading an object, the type of the next piece of data is always known from context. This means that objects don't contain any information about what field is next or what is its type.

The encoding for various data types is as follows:

  • integers: 1, 2, 4 and 6-byte unsigned integers are saved directly in little-endian order. (6-byte integers are used to represent offsets in the file). Signed integers are first casted to unsigned integers of the same size and then saved the same way as unsigned integers.
  • timestamps: timestamps from 1 January 2000 to beyond 2100 with second accuracy are represented as 4-byte unsigned integers. The integer is not the number of seconds from the start date, but is instead directly calculated from parts of the date as (((((Year - 2000) * 12 + Month - 1) * 31 + Day - 1) * 24 + Hour) * 60 + Minute) * 60 + Second.
  • strings: strings are saved as length of the string (n) followed by n bytes of its content. For short strings (those that are guaranteed to be at most 255 bytes long), the length is 1-byte integer, for long strings it's 4-byte integer.
  • generic lists: lists are saved as 4-byte count of items (n) followed by n items. The size of each item depends on its type and can be variable (e.g. for a list of strings). This is basically a representation of vector<T>.
  • generic maps: maps are saved as 2-byte count of items (n) followed by n key-value pairs. This is basically a representation of map<TKey, TValue>.
  • generic pair: pairs are saved as the first item followed by the second item of the pair. Pairs are typically used as the value of a map. This is basically a representation of pair<T1, T2>.

File header[edit]

File header always starts at offset 0 and contains offsets of indexes, which can be used to access the data stored in the file.

  • 4 bytes magic number: MWID
  • 1 byte file format version: 1
  • 1 byte data version: 1
  • 1 byte dump kind flags:
    • 0x01 for pages dump: a dump with revision text
    • 0x02 for current dump: a dump without old revisions of pages
    • 0x04 for articles dump: a dump that doesn't contain pages from talk namespaces and the User namespace
  • 6 bytes offset to the end of the file
  • 6 bytes offset to the root of the page id index
  • 6 bytes offset to the root of the revision id index
  • 6 bytes offset to the root of the text group id index
  • 6 bytes offset to the root of the model & format index
  • 6 bytes offset to the free space index
  • 6 bytes offset to the site info object

Site info[edit]

The site info object contains metadata about the whole wiki and its namespaces.

  • 1 byte object kind: 0x21
  • short string name of the dump (e.g. enwiki)
  • short string timestamp of the dump
  • short string XML language code (e.g. en for English or cs for Czech)
  • short string site name
  • short string base URL
  • short string “generator”: the version of MediaWiki used
  • 1 byte site case:
    • 0x01 for first letter
    • 0x02 for case sensitive
  • map of namespaces
    • the key is signed 2-byte integer namespace id
    • the value is a pair of case (see above) and short string namespace name

Index[edit]

The file currently contains 5 indexes:

  • page id index maps 4 byte page ids to 6 byte offsets of the corresponding page object
  • revision id index maps 4 byte revision ids to 6 byte offsets of the corresponding revision object
  • text group id index maps 4 byte text group ids to 6 byte offsets of the corresponding text group object
  • model format index maps 1 byte synthetic id to a pair of short strings representing model and format
    • this index is used to save space; using it, a revision's model and format can be represented as a single byte
  • free space index maps 6 byte offsets of free space blocks to their 4 byte lengths

The index is saved as a B-tree,[1] with leaf nodes on the last level and inner nodes on levels above.

A leaf node is saved as:

  • 1 byte object kind: 0x01
  • map of keys to values

An inner node is saved as:

  • 1 byte object kind: 0x02
  • 2 bytes count (n)
  • n keys
  • (n + 1) 6 byte offsets to child nodes

Page[edit]

The page object describes a single page and references its revisions. It is saved as:

  • 1 byte object kind: 0x11
  • 4 bytes page id
  • 2 bytes namespace id
  • short string page title
  • short string redirect target; if it's empty, page is not a redirect
  • list of 4 byte ids of revisions of this page

Revision[edit]

The revision object describes a revision of a page. It is saved as:

  • 1 byte object kind: 0x12
  • 4 bytes revision id
  • 1 byte flags (the right values are ORed together)
    • 0x01: minor edit
    • 0x02: the model of this revision is wikitext, the format is text/x-wiki
      • this is a special value to save additional byte for the most common model & format
    • 0x04: the contributor for this revision is not an IP-address anonymous user
    • 0x08: the contributor is an IPv4 anonymous user
    • 0x10: the contributor is an IPv6 anonymous user
    • 0x20: the text of this revision was deleted
    • 0x40: the comment of this revision was deleted
    • 0x80: the contributor of this revision was deleted
  • 4 byte id of the parent revision
  • 4 byte timestamp of the revision
  • if the “contributor was deleted” flag is not set:
    • embedded object describing the user who made this revision
  • if the “comment was deleted” flag is not set:
    • short string comment
  • if the “model & format is wikitext” flag is not set:
    • 1 byte id of model & format of this revision
  • if the “text was deleted” flag is not set:
    • 20 bytes (little endian) SHA-1 of the revision
    • if this is a pages dump (with text):
      • 4 bytes id of the text group that contains the text for this revisions
      • 1 byte index into the the text group of the text of this revision
    • else (a stub dump):
      • 4 byte revision text length

Text group[edit]

A text group contains a group of revision texts compressed together to achieve better compression.

For saving, the texts are concatenated together using the null character ('\0') as delimiter. If a text is deleted from the group, it is replaced by UTF-8 encoded U+FFFF Unicode Noncharacter.

Specific texts from the group are accessed by a 1-byte index (which means a group can contain at most 256 texts).

A text group is saved as:

  • 1 byte object kind: 0x31
  • long string LZMA-compressed string containing the texts from the group (see above for details)

User[edit]

The contributor who made a revision can be saved in one of three possible formats.

For anonymous IPv4 editors:

  • 4 byte integer of the IP address

For anonymous IPv6 editors:

  • 16 bytes of the IP address

For other editors (including normal logged-in editors, historical anomalies, …):

  • 4 byte integer user id (can be 0)
  • short string user name

Notes[edit]

  1. Technically, it's not a B-tree, because it doesn't follow the rules of a B-tree when it comes to deletion.