Incremental dumps/File format/Specification

This page describes the current format of the incremental dumps file. It is far from finished, which means the format can change daily and that this page can easily become out of date.

The format is binary; the file contains various objects and can also contain free space (remaining after deleted objects).

Data types
The encoding for various data types is as follows:


 * integers: 1, 2, 4 and 6-byte unsigned integers are saved directly in little-endian order. (6-byte integers are used to represent offsets in the file).
 * timestamps: timestamps from 1 January 2000 to beyond 2100 with second accuracy are represented as 4-byte unsigned integers. The integer is not the number of seconds from the start date, but is instead directly calculated from parts of the date as.
 * strings: strings are saved as length of the string (n) followed by n bytes of its content. For short strings (those that are guaranteed to be at most 255 bytes long), the length is 1-byte integer, for long strings it's 4-byte integer.
 * lists: lists are saved as 4-byte count of items (n) followed by n items. The size of each item depends on its type an can be variable (e.g. for a list of strings). The type of items in the list depends on the context (i.e. it's not stored in the list in any way).

File header
File header always starts at offset 0 and contains offsets of indexes, which can be used to access the data stored in the file.


 * 4 bytes magic number:
 * 1 byte file format version: 1
 * 1 byte data version: 1
 * 6 bytes offset to the end of the file
 * 6 bytes offset to the root of the page id index
 * 6 bytes offset to the root of the revision id index
 * 6 bytes offset to the free space index

Index
The file currently contains 3 indexes:


 * page id index maps 4 byte page ids to 6 byte offsets of the corresponding page object
 * revision id index maps 4 byte revision ids to 6 byte offsets of the corresponding revision object
 * free space index maps 6 byte offsets of free space blocks to their 4 byte lengths

The index is saved as:


 * 1 byte object kind:
 * 2 bytes count of items (n)
 * n keys
 * n values

This kind of index object is meant as a leaf in a B-tree, but that's not implemented yet.

Page
The page object describes a single page and references its revisions. It is saved as:


 * 1 byte object kind:
 * 4 bytes page id
 * 2 bytes namespace
 * short string page title
 * short string redirect target; if it's empty, page is not a redirect
 * list of 4 byte ids of revisions of this page

Revision
The revision object describes a revision of a page. It is saved as:


 * 1 byte object kind:
 * 4 bytes revision id
 * 1 byte flags (the right values are ORed together)
 * : minor edit
 * : the contributor for this revision does not fall into any of the following specific categories of users; this usually means that the user is a normal logged-in editor
 * : the contributor is an IPv4 anonymous user
 * 4 byte id of the parent revision
 * 4 byte timestamp of the revision
 * embedded object describing the user who made this revision
 * short string comment
 * long string LZMA-compressed text of the revision

User
The contributor who made a revision can be saved in one of two possible formats.

For anonymous IPv4 editors, the user id of 0 is implied and so they're saved just as:


 * 4 byte integer of the IP address

For other editors (including normal logged-in editors, IPv6 anonymous editors, historical anomalies, …):


 * 4 byte integer user id (can be 0)
 * short string user name