Incremental dumps/File format

From mediawiki.org

This document describes how I imagine the initial version of incremental dump files to look like, along with some improvements for following versions (some necessary, others just nice to have). This version (black text only) should be completed by 29 July.

A dump file in the new format is a binary file that will contain:

  • At offset 0: header containing some basic information, especially offsets to indexes.
  • Indexes. These are used to map one value to another, e.g. page title to page id or page id to file offset.
  • Page objects. Equivalent to ‎<page> in XML dumps.
  • Revision objects. Equivalent to ‎<revision> in XML.

Most data will be saved directly in a binary format (so, e.g. page id will be 4 bytes). Revision text is handled in a special way.[1]

Indexes[edit]

Each index will map from one value to another. A complete dump file will contain the following indexes:

  • fake namespace id → real namespace id + namespace name[2]
  • page id → file offset of the corresponding page object
  • revision id → file offset of the corresponding revision object
  • user id → user name
  • fake text & model id → text & model, to save space
  • index of free blocks, sorted by size

These indexes can be used for example to quickly look up information about a page based on its title or id, or information about a revision based on its id.

An index will be saved as a B-tree, each node of the tree will be relatively big (4KB?). Each node will be saved in the file separately, so that the index can easily grow (and shrink).

The file header will contain offset of the root node of each index.

To save space, string keys will be saved using incremental encoding, integer keys using simple form of delta compression.

Page object[edit]

The page object is the equivalent of ‎<page> in XML dumps, so it contains page id, page namespace, page title, redirect target and a list of revision ids. What revision ids will be present depends on the dump type: “current” dumps will have just the most recent revision, “history” dumps will have the full list.

Revision object[edit]

The revision object is the equivalent of ‎<revision> in XML dumps, so it contains revision id, parent revision id, timestamp, contributor, comment, minor flag, SHA-1, model, format and deleted flags.

The initial implementation will save revision text (if present) together with the revision, each revision compressed separately using LZMA. The final version will use a better compression scheme, either by compressing a group of revisions together or by using some sort of delta compression. For stub dumps, revision text length is included instead of the text itself.

Free space and deletions[edit]

After an object or index node is deleted, it will leave behind free space. When a new object (or index node) is created, it will first try to fit into one of those free spaces (this will use index of free blocks). If it doesn't fit anywhere, it will be appended to the end of the file. If free space is created next to another one, the two have to be merged.

Because of this, a dump file may contain some unused space. Hopefully, this won't be a big issue, because the small page objects will fill most of the space where the large revisions objects won't fit.

Deleting a revision and then adding another one (usually of another size) has to work well, because this will happen extremely often in “current” dumps (basically, every new revision that's not a new page will cause deletion of previous revision of that page).

With improved compression, text of one revision depends on other revisions. In the case of compressing revisions in groups, deleting a revision would mean recompressing the whole group. In the case of delta compression, this would mean tracking what revisions depend on what revisions and recompressing revisions that directly depend on the deleted one.

Diff dumps[edit]

Another feature not in the initial version are “diff dumps”. Along with normal dumps, a “diff” will be produced that will contain all changes since the last dump. This way, users don't have to download the whole dump every time, they can download just the changes and apply them to their previously downloaded dump.

If delta compression is implemented, diff dumps could use a specific form of it: the revision text can be compressed using revisions that are present in the dump that's being updated, but not in the diff dump itself (even if the base revision is to be deleted). This can be especially useful for diff dumps of “current” dumps.

Reading the dumps[edit]

XML output[edit]

The primary way to read the dumps will be a command-line application that will output the data as an uncompressed XML, in the same format as the current dumps (with some exceptions).

The application will have parameters for limiting the output (e.g. to certain namespace or range of page ids).

This way, current users of dumps can use the new format with only small changes to their code and new users can use any programming language.

Library[edit]

There will be a library for reading the contents of the dump file. It will be written in C++, with “object-oriented C” interface. That is, the interface will be in plain C, but it will contain functions on objects represented as opaque pointers.

For example, some declarations could look like this:

struct Dump;
struct Revision;

struct Revision* dump_get_revision_by_id(struct Dump* dump, int revision_id);
char* revision_get_text(struct Revision* revision);

(The real code is going to be more complicated due to error handling.)

This will be represented in OO language as:

class Dump
{
    Revision GetRevisionById(int revisionId);
}

class Revision
{
    string GetText();
}

Users are not expected to use the C interface directly, instead, they will access the dumps though bindings in a language like Python or C#.

Notes[edit]

  1. What about comment? Should it be compressed too?
  2. Namespaces can have ids like 828, but I don't think there is any Wikimedia wiki that uses more than 256 namespaces, so saving namespaces as 2 bytes (or 4, since page_namespace is a 32-bit integer) would be wasteful. So, assigning a new id to each namespace will save space.