Incremental dumps/File format/Specification

This page describes the current format of the incremental dumps file. It is far from finished, which means the format can change daily and that this page can easily become out of date.

The format is binary; the file contains various objects and can also contain free space (remaining after deleted objects).

Data types
When reading an object, the type of the next piece of data is always known from context. This means that objects don't contain any information about what field is next or what is its type.

The encoding for various data types is as follows:


 * integers: 1, 2, 4 and 6-byte unsigned integers are saved directly in little-endian order. (6-byte integers are used to represent offsets in the file). Signed integers are first casted to unsigned integers of the same size and then saved the same way as unsigned integers.
 * timestamps: timestamps from 1 January 2000 to beyond 2100 with second accuracy are represented as 4-byte unsigned integers. The integer is not the number of seconds from the start date, but is instead directly calculated from parts of the date as.
 * strings: strings are saved as length of the string (n) followed by n bytes of its content. For short strings (those that are guaranteed to be at most 255 bytes long), the length is 1-byte integer, for long strings it's 4-byte integer.
 * generic lists: lists are saved as 4-byte count of items (n) followed by n items. The size of each item depends on its type an can be variable (e.g. for a list of strings). This is basically a representation of.
 * generic maps: maps are saved as 2-byte count of items (n) followed by n keys and then by n values. This is basically a representation of.
 * generic pair: pairs are saved as the first item followed by the second item of the pair. Pairs are typically used as the value of a map. This is basically a representation of.

File header
File header always starts at offset 0 and contains offsets of indexes, which can be used to access the data stored in the file.


 * 4 bytes magic number:
 * 1 byte file format version: 1
 * 1 byte data version: 1
 * 1 byte dump kind flags:
 * for pages dump: a dump with revision text
 * for current dump: a dump without old revisions of pages
 * for articles dump: a dump that doesn't contain pages from talk namespaces and the User namespace
 * 6 bytes offset to the end of the file
 * 6 bytes offset to the root of the page id index
 * 6 bytes offset to the root of the revision id index
 * 6 bytes offset to the root of the model & format index
 * 6 bytes offset to the free space index
 * 6 bytes offset to the site info object

Index
The file currently contains 3 indexes:


 * page id index maps 4 byte page ids to 6 byte offsets of the corresponding page object
 * revision id index maps 4 byte revision ids to 6 byte offsets of the corresponding revision object
 * free space index maps 6 byte offsets of free space blocks to their 4 byte lengths

The index is saved as:


 * 1 byte object kind:
 * 2 bytes count of items (n)
 * n keys
 * n values

This kind of index object is meant as a leaf in a B-tree, but that's not implemented yet.

Page
The page object describes a single page and references its revisions. It is saved as:


 * 1 byte object kind:
 * 4 bytes page id
 * 2 bytes namespace
 * short string page title
 * short string redirect target; if it's empty, page is not a redirect
 * list of 4 byte ids of revisions of this page

Revision
The revision object describes a revision of a page. It is saved as:


 * 1 byte object kind:
 * 4 bytes revision id
 * 1 byte flags (the right values are ORed together)
 * : minor edit
 * : the contributor for this revision does not fall into any of the following specific categories of users; this usually means that the user is a normal logged-in editor
 * : the contributor is an IPv4 anonymous user
 * 4 byte id of the parent revision
 * 4 byte timestamp of the revision
 * embedded object describing the user who made this revision
 * short string comment
 * long string LZMA-compressed text of the revision

User
The contributor who made a revision can be saved in one of two possible formats.

For anonymous IPv4 editors, the user id of 0 is implied and so they're saved just as:


 * 4 byte integer of the IP address

For other editors (including normal logged-in editors, IPv6 anonymous editors, historical anomalies, …):


 * 4 byte integer user id (can be 0)
 * short string user name