Incremental dumps/File format/Diff specification

This page describes the current format of “diff dumps”, that is, files that contain the changes that occured on a wiki in a certain period of time. A diff dump can be applied to normal incremental dump to update it.

The format is binary; the file contains various objects in series. There is no free space and no need for indexes.

Data types are exactly the same as in normal dumps. Some objects are very similar to objects in normal dumps.

The general structure is that the object starts with a file header and Site info change object. What follows is a sequence of change objects. The order of change objects is undefined, except that revision changes always follow corresponding page change.

File header

 * 4 bytes magic number:
 * 1 byte file format version: 1
 * 1 byte data version: 1
 * 1 byte dump kind flags (values are the same as in normal dumps)

Site info change
This object contains the new site info object (necessary e.g. when a namespace is added to a wiki) along with information about name and timestamps of the dump.


 * 1 byte change kind:
 * short string dump name
 * short string old timestamp; this is the timestamp a dump this diff is applied to has to have
 * short string new timestamp; this is the timestamp a dump will have after application of this diff
 * the Site info object from normal dump, except without object kind, name and timestamp

New page change
This object describes a page that was added to the wiki. Any new revision change, revision change and delete revision objects that follow belong to this page.


 * 1 byte change kind:
 * the Page object from normal dump, except without object kind and list of revision ids; this describes the new page

Page change
This object describes a page that was changed in the wiki. Any new revision change, revision change and delete revision objects that follow belong to this page.


 * 1 byte change kind:
 * 4 bytes page id
 * 1 byte page change flags; this can be  if the page metadata itself didn't change, but something about its revisions did (new revision, or a change within a revision)
 * : namespace changed
 * : title changed
 * : redirect target changed
 * if the “namespace changed” flag is set:
 * 2 bytes new namespace id
 * if the “title changed” flag is set:
 * short string new page title
 * if the “redirect target changed” flags is set:
 * short string new redirect target

Delete page change
This object describes a page that was deleted from the wiki. All of its revisions should be deleted too, unless they were already referenced in this diff dump (this can happen when a page is deleted and then undeleted).


 * 1 byte change kind:
 * 4 bytes deleted page id

New revision change
This object describes a revision that was added to the wiki. This revision belongs to the page from last new page change or page change object.


 * 1 byte change kind:
 * the Revision object from normal dump, except without object kind; this describes the new revision

Revision change
This object describes a page that was changed in the wiki. This revision belongs to the page from last new page change or page change object. That can be a different page than the one the revision originally belonged to, because of deletion followed by undeletion.

If a piece of information (comment, text or contributor) about a revision is deleted, the corresponding changed flag shouldn't be set. If that piece of information is undeleted, the changed flag has to be set.


 * 1 byte object kind:
 * 4 bytes revision id
 * 1 byte revision change flags (this can be  in the case of revision move due to deletion and undeletion)
 * : revision flags changed
 * : parent id changed
 * : timestamp changed
 * : contributor changed
 * : comment changed
 * : text changed
 * : model & format id changed
 * if the “revision flags changed” flag is set:
 * 1 byte new revision flags
 * if the “parent id changed” flag is set:
 * 4 bytes new parent id
 * if the “timestamp changed” flag is set:
 * 4 bytes new timestamp
 * if the “contributor changed” flag is set:
 * embedded User object
 * if the “comment changed” flag is set:
 * short string new comment
 * if the “text changed” flag is set:
 * short string new SHA-1
 * if this is a diff for a pages dump (with text):
 * long string LZMA-compressed new text of the revision
 * else (diff for a stub dump):
 * 4 bytes revision text length
 * if the “model & format id changed” flag is set:
 * 1 byte new model & format id (can be 0)

Delete revision change
This object describes a revision that was deleted from the wiki. AFAIK this should not actually happen without someone directly editing the database, or something like that.


 * 1 byte change kind:
 * 4 bytes deleted revision id

New model format change
This object describes new model & format id that should be added to the model format index. Since new revision change and revision change use just the id to represent model & format, this object has to be included before a change that uses id that was not present in the index previously.


 * 1 byte change kind:
 * 1 byte id
 * short string model
 * short string format