Incremental dumps/File format/Diff specification

From mediawiki.org

This page describes the current format of “diff dumps”, that is, files that contain the changes that occurred on a wiki in a certain period of time. A diff dump can be applied to normal incremental dump to update it.

The format is binary; the file contains various objects in series. There is no free space and no need for indexes.

Data types are exactly the same as in normal dumps. Some objects are very similar to objects in normal dumps.

The general structure is that the object starts with a file header and Site info change object. What follows is a sequence of change objects. The order of change objects is undefined, except that revision changes always follow corresponding page change.

File header[edit]

  • 4 bytes magic number: MWDD
  • 1 byte file format version: 1
  • 1 byte data version: 1
  • 1 byte dump kind flags (values are the same as in normal dumps)

Site info change[edit]

This object contains the new site info object (necessary e.g. when a namespace is added to a wiki) along with information about name and timestamps of the dump.

  • 1 byte change kind: 0x01
  • short string dump name
  • short string old timestamp; this is the timestamp a dump this diff is applied to has to have
  • short string new timestamp; this is the timestamp a dump will have after application of this diff
  • the Site info object from normal dump, except without object kind, name and timestamp

New page change[edit]

This object describes a page that was added to the wiki. Any new revision change, revision change and delete revision objects that follow belong to this page.

Page change[edit]

This object describes a page that was changed in the wiki. Any new revision change, revision change and delete revision objects that follow belong to this page.

  • 1 byte change kind: 0x11
  • 4 bytes page id
  • 1 byte page change flags; this can be 0x00 if the page metadata itself didn't change, but something about its revisions did (new revision, or a change within a revision)
    • 0x01: namespace changed
    • 0x02: title changed
    • 0x04: redirect target changed
  • if the “namespace changed” flag is set:
    • 2 bytes new namespace id
  • if the “title changed” flag is set:
    • short string new page title
  • if the “redirect target changed” flags is set:
    • short string new redirect target

Full delete page change[edit]

This object describes a page that was deleted from the wiki. All of its revisions should be deleted too.

  • 1 byte change kind: 0x12
  • 4 bytes deleted page id

Partial delete page change[edit]

This object describes a page that was deleted from the wiki, but at least some of its revisions weren't (this can happen when a page is deleted and then undeleted). The revisions that were deleted should be listed separately using Delete revision changes.

  • 1 byte change kind: 0x13
  • 4 bytes deleted page id

New revision change[edit]

This object describes a revision that was added to the wiki. This revision belongs to the page from last new page change or page change object.

  • 1 byte change kind: 0x20
  • the Revision object from normal dump, except without object kind; this describes the new revision
    • for pages dump, text is described just as a 1 byte index into a text group; the right text group is the one from the most recent text group object

Revision change[edit]

This object describes a page that was changed in the wiki. This revision belongs to the page from last new page change or page change object. That can be a different page than the one the revision originally belonged to, because of deletion followed by undeletion.

If a piece of information (comment, text or contributor) about a revision is deleted, the corresponding changed flag shouldn't be set. If that piece of information is undeleted, the changed flag has to be set.

  • 1 byte object kind: 0x21
  • 4 bytes revision id
  • 1 byte revision change flags (this can be 0x00 in the case of revision move due to deletion and undeletion)
    • 0x01: revision flags changed
    • 0x02: parent id changed
    • 0x04: timestamp changed
    • 0x08: contributor changed
    • 0x10: comment changed
    • 0x20: text changed
    • 0x40: model & format id changed
  • if the “revision flags changed” flag is set:
    • 1 byte new revision flags
  • if the “parent id changed” flag is set:
    • 4 bytes new parent id
  • if the “timestamp changed” flag is set:
    • 4 bytes new timestamp
  • if the “contributor changed” flag is set:
  • if the “comment changed” flag is set:
    • short string new comment
  • if the “text changed” flag is set:
    • short string new SHA-1
    • if this is a diff for a pages dump (with text):
      • 1 byte index into a text group containing the text for this revision; the right text group is from the most recent text group object
    • else (diff for a stub dump):
      • 4 bytes revision text length
  • if the “model & format id changed” flag is set:
    • 1 byte new model & format id (can be 0)

Delete revision change[edit]

This object describes a revision that was deleted from the wiki.

This can be used when the page this revision belonged to has been deleted, but some of its revisions weren't (see Partial delete page change).

If the page this revisions belonged to wasn't deleted, this change has to follow the corresponding page change object. Otherwise, it can be placed anywhere.

Note: this change has nothing to do with RevisionDelete. For that, see the last thee flags on the Revision object.

  • 1 byte change kind: 0x22
  • 4 bytes deleted revision id

New model format change[edit]

This object describes new model & format id that should be added to the model format index. Since new revision change and revision change use just the id to represent model & format, this object has to be included before a change that uses id that was not present in the index previously.

  • 1 byte change kind: 0x30
  • 1 byte id
  • short string model
  • short string format

Text group[edit]

This object contains a compressed group of texts for following new revision changes and revision changes.

It saved the same as the text group object from normal dump, except with a change kind of 0x40, instead of object kind.