Manual:Mwbzutils

From mediawiki.org

Introduction[edit]

This page describes a collection of C command line utilities used for checking or manipulating MediaWiki XML dump files of page content or metadata. Some of these utilities are used by Wikimedia in the production of XML dumps; they are not required for third party sites or for sites with a mirror of our content.

You aren't likely to need these unless you are doing very speciic work with the bz2 compressed XML dumps or are generating them in bulk yourself.

They have only been tested on 64-bit Linux and 64-bit FreeBSD. They are likely to fail to build or run anywhere else.

Summary of the utilities[edit]

  • checkforbz2footer: determine whether or not a bzipped file terminates with a bz2 footer or not; this can be used as a quick test to see if the file was truncated during generation
  • dumpbz2filefromoffset: given an offset into a bzipped XML file, display the contents from the first page tag til the end of file, prefacing the output with the usual <mediawiki> and <siteinfo> content
  • dumplastbz2block: find the last bz2 block marker in a bzipped file and display whatever can be decompressed after that point; this can be used to determine what is left to process when a dump job dies in the middle
  • findpageidinbz2xml: display the offset of the bz2 block in the specified XML dump file containing the given page id; this assumes the dumps are written with monotonically increaing page ids, as is currently the case
  • recompressxml: reads XML page content from stdin, writes multistream bz2 file out, where each bz2 stream contains the specified number of pages (except of course the last stream which may have less)
  • writeuptopageid: reads XML page content/metadata from stdin, writes it for the specified page range; this assumes the input has monotonically increasing page ids

Setup and use[edit]

Getting the source[edit]

You can get the source from the WMF gerrit repo: git clone https://gerrit.wikimedia.org/r/operations/dumps/mwbzutils.git.

Building the source[edit]

On a 64-bit linux platform with the standard build tools and the libz2 library and headers available, make should build the binary executables.

Installing[edit]

Installation is done by default into /usr/local/bin.

Use[edit]

Each utility, if run with the --help option, will produce a comprehensive usage message.