(Redirected from MWDumper)
Jump to: navigation, search

MWDumper is a tool written in Java for extracting sets of pages from a MediaWiki dump file. For example, it can load Wikipedia's content into MediaWiki.

MWDumper can read MediaWiki XML export dumps (version 0.3, minus uploads), perform optional filtering, and output back to XML or to SQL statements to add things directly to a database in 1.4 or 1.5 schema.

After many years, it is still very much under construction.[1]

Note Note: While this can be used to import XML dumps into a MediaWiki database, it might not be the best choice for for small imports (say, 100 pages or less). See Manual:Importing XML dumps for an overview.

Where to find it[edit | edit source]

To import current XML export dumps, you'll need an up-to-date build of MWDumper...

Current WMF builds are produced by jenkins; check the copy at first, and for the absolute latest version check the jenkins build.

You can find a mostly up-to-date build at

Third-party builds (that start in GUI mode by default so you won't need most of the parameters below, but they may not contain the latest bug fixes) – just run it with java -jar mwdumper.jar)

There are also third party builds without the gui default.

You could build MWDumper from source. (See #How to build MWDumper from source, below).

Usage[edit | edit source]

Prerequisites for imports via MWDumper[edit | edit source]

Before using mwdumper, your page, text, and revision tables must be empty. To empty them, do this (note that this will wipe out an existing wiki):

In maintenance directory: php rebuildall.php

Import dump files with MWDumper[edit | edit source]

Sample command line for a direct database import:

  java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 |
    mysql -u <username> -p <databasename>


  cd mwdumper/src
  javac org/mediawiki/dumper/
  cd ..
  java -classpath ./src org.mediawiki.dumper.Dumper --format=sql:1.5 pages_full.xml.bz2 |
    mysql -u <username> -p <databasename>

Note Note: A third party developer has added features to output in a tab delimited format for processing large dumps. The compiled version is here and the code for the project is here. To use the update you have to specify a seperate output file for pages since you don't want to have the two tab delimited output files dumped together. This was done specifically for processing the large Wikipedia Dumps using Hadoop. Usage is shown below:

cat train.xml | java -jar mwdumper.jar --format=flatfile:pages_output_file.txt - --quiet > train.txt

Hint: The tables 'page', 'revision' and 'text' must be empty for a successful import.

Note: this command will keep going even if MySQL reports an error. This is probably not what you want - if you use the direct connection to MySQL, the import will stop when errors occur.

Note: If you nohup a mwdumper command, be sure to use the --quiet option.

A note on character encoding[edit | edit source]

Make sure the database is expecting utf8-encoded text. If the database is expecting latin1 encoded text (which MySQL does by default), you'll get invalid characters in your tables if you use the output of mwdumper directly. One way to do this is to pass --default-character-set=utf8 to MySQL in the above sample command.

If you want to use the output of mwdumper in a JDBC URL, you should use set characterEncoding=utf8 in the query string.

Also make sure that your MediaWiki tables use CHARACTER SET=binary. Otherwise, you may get error messages like Duplicate entry in UNIQUE Key 'name_title' because MySQL fails to distinguish certain characters.

Complex filtering[edit | edit source]

You can also do complex filtering to produce multiple output files:

  java -jar mwdumper.jar \
    --output=bzip2:pages_public.xml.bz2 \
      --format=xml \
      --filter=notalk \
      --filter=namespace:\!NS_USER \
      --filter=latest \
    --output=bzip2:pages_current.xml.bz2 \
      --format=xml \
      --filter=latest \
    --output=gzip:pages_full_1.5.sql.gz \
      --format=sql:1.5 \
    --output=gzip:pages_full_1.4.sql.gz \
      --format=sql:1.4 \

A bare parameter will be interpreted as a file to read XML input from; if "-" or none is given, input will be read from stdin. Input files with ".gz" or ".bz2" extensions will be decompressed as gzip and bzip2 streams, respectively.

Internal decompression of 7-zip .7z files is not yet supported; you can pipe such files through p7zip's 7za:

  7za e -so pages_full.xml.7z |
    java -jar mwdumper.jar --format=sql:1.5 |
    mysql -u <username> -p <databasename>

Defaults if no parameters are given:

  • read uncompressed XML from stdin
  • write uncompressed XML to stdout
  • no filtering

Output sinks[edit | edit source]

      Send uncompressed XML or SQL output to stdout for piping.
      (May have charset issues.) This is the default if no output
      is specified.
      Write uncompressed output to a file.
      Write compressed output to a file.
      Write compressed output to a file.
  --output=mysql:<jdbc url>
      Valid only for SQL format output; opens a connection to the
      MySQL server and sends commands to it directly.
      This will look something like:

Output formats[edit | edit source]

      Output back to MediaWiki's XML export format; use this for
      filtering dumps for limited import. Output should be idempotent.
      SQL statements formatted for bulk import in MediaWiki 1.4's schema.
      SQL statements formatted for bulk import in MediaWiki 1.5's schema.
      Both SQL schema versions currently require that the table structure
      be already set up in an empty database; use maintenance/tables.sql
      from the MediaWiki distribution.

Filter actions[edit | edit source]

      Skips all but the last revision listed for each page.
      FIXME: currently this pays no attention to the timestamp or
      revision number, but simply the order of items in the dump.
      This may or may not be strictly correct.
      Excludes all pages whose titles do not appear in the given file.
      Use one title per line; blanks and lines starting with # are
      ignored. Talk and subject pages of given titles are both matched.
      As above, but does not try to match associated talk/subject pages.
      Includes only pages in (or not in, with "!") the given namespaces.
      You can use the NS_* constant names or the raw numeric keys.
      Excludes all talk pages from output (including custom namespaces)
      Excludes all pages whose titles do not match the regex.

Misc options[edit | edit source]

      Change progress reporting interval from the default 1000 revisions.
      Don't send any progress output to stderr. Recommended when running under nohup.

Direct connection to MySQL[edit | edit source]

Example of using mwdumper with a direct connection to MySQL[edit | edit source]

java -server -classpath mysql-connector-java-3.1.11/mysql-connector-java-3.1.11-bin.jar:mwdumper.jar \
   org.mediawiki.dumper.Dumper --output=mysql://\&password=wiki \
   --format=sql:1.5 20051020_pages_articles.xml.bz2


  • You will need the mysql-connector JDBC driver. On Ubuntu this comes in package libmysql-java and is installed at /usr/share/java/mysql-connector-java.jar.
  • The JRE does not allow you to mix the -jar and -classpath arguments (hence the different command structure).
  • The --output argument must before the --format argument.
  • The ampersand in the MySQL URI must be escaped on Unix-like systems.

Example of using mwdumper with a direct connection to MySQL on WindowsXP[edit | edit source]

Had problems with the example above... this following example works better on XP....

1.Create a batch file with the following text.

set class=mwdumper.jar;mysql-connector-java-3.1.12/mysql-connector-java-3.1.12-bin.jar
set data="C:\Documents and Settings\All Users.WINDOWS\Documents\\enwiki-20060207-pages-articles.xml.bz2"
java -client -classpath %class% org.mediawiki.dumper.Dumper "--output=mysql://<username>&password=<password>&characterEncoding=UTF8" "--format=sql:1.5" %data%

2.Download the mysql-connector-java-3.1.12-bin.jar and mwdumper.jar

3.Run the batch file.


  1. It still reports a problem with the import files, "duplicate key"...
  2. The class path separator is a ; (semi-colon) in this example; different from the example above.

The "duplicate key" error may result from the page, revision and text tables in the database not being empty, or from character encoding problems. See A note on character encoding.

Performance Tips[edit | edit source]

Please elaborate on these tips if you can.

To speed up importing into a database, you might try the following:

Remove indexes and auto-increment fields[edit | edit source]

Temporarily remove all indexes and auto_increment fields from the following tables: page, revision and text. This gives a tremendous speed bump, because MySQL will otherwise be updating these indexes after each insert.

Don't forget to recreate the indexes afterwards.


(Insert answer here)

Set -server option[edit | edit source]

Java's -server option may significantly increase performance on some versions of Sun's JVM for large files. (Not all installations will have this available.)

Increase MySQL's innodb_log_file_size[edit | edit source]

Increase MySQL's innodb_log_file_size. The default is as little as 5mb, but you can improve performance dramatically by increasing this to reduce the number of disk writes. (See the my-huge.cnf sample config.)

Disable the binary log[edit | edit source]

If you don't need it, disable the binary log (log-bin option) during the import. On a standalone machine this is just wasteful, writing a second copy of every query that you'll never use.

More tips in the MySQL reference manual[edit | edit source]

Various other wacky tips can be found in the MySQL reference manual. If you find any useful ones, please write about them here.

Troubleshooting[edit | edit source]

If strange XML errors are encountered under Java 1.4, try 1.5:

If mwdumper gives java.lang.IllegalArgumentException: Invalid contributor exception, see bugzilla:18328

If it gives java.lang.OutOfMemoryError: Java heap space exception, run it with larger heap size, for example java -Xms128m -Xmx1000m -jar mwdumper.jar ... (first is starting, second maximum size) (bug 21937)

How to build MWDumper from source[edit | edit source]

You can build MWDumper from source. Just

git clone

and let Maven sort out the dependencies.

Programming[edit | edit source]

Reporting bugs[edit | edit source]

Bugs can be reported in the MediaWiki Bugzilla.

Change history (abbreviated)[edit | edit source]

  • 2005-10-25: Switched SqlWriter.sqlEscape back to less memory-hungry StringBuffer
  • 2005-10-24: Fixed SQL output in non-UTF-8 locales
  • 2005-10-21: Applied more speedup patches from Folke
  • 2005-10-11: SQL direct connection, GUI work begins
  • 2005-10-10: Applied speedup patches from Folke Behrens
  • 2005-10-05: Use bulk inserts in SQL mode
  • 2005-09-29: Converted from C# to Java
  • 2005-08-27: Initial extraction code

Todo[edit | edit source]

  • Add some more junit tests
  • Include table initialization in SQL output
  • Allow use of table prefixes in SQL output
  • Ensure that titles and other bits are validated correctly.
  • Test XML input for robustness
  • Provide filter to strip ID numbers
  • <siteinfo> is technically optional; live without it and use default namespaces
  • GUI frontend(s)
  • Port to Python? ;)

Alternate method of loading a huge wiki[edit | edit source]

Warning: This method takes days to run.

If you have to load a huge wiki this might help...

Below is a set of instructions that makes loading a large wiki less error prone and maybe a bit faster. It is not a script but rather a set of commands you can copy into bash (running in a screen session.) You'll have to babysit and customize the process for your needs.

# Dump SQL to disk in even sized chunks.  This takes about 80 Gb of hard drive space and 3 hours for enwiki.
# Setup the db to receive the chunks.  This takes a few seconds.
# Import the chunks.  This takes a few days for enwiki.
# Rebuild the DB.  This takes another day for enwiki.
# Run standard post import cleanup.  I haven't finished this step successfully yet but some of it can be skipped I think.

export DUMP_PREFIX=/public/datasets/public/enwiki/20130604/enwiki-20130604
export DIR_ROOT=/data/project/dump
export DIR=${DIR_ROOT}/enwiki
export DB=enwiki2
export LOG=~/log

bash -c 'sleep 1 && echo y' | mysqladmin drop ${DB} -u root
sudo rm -rf ${DIR}
rm -rf ${LOG}

sudo mkdir -p ${DIR}
sudo chown -R ${USER} ${DIR_ROOT}
mkdir -p ${LOG}

# Dump SQL to disk in even sized chunks.
# Sort by size descending to keep as many threads as possible hopping.
# uconv cleans up UTF-8 errors in the source files.
# grep removes BEGIN and COMMIT statements that mwdumper thinks are good, but I do better below
sudo apt-get install openjdk-7-jdk libicu-dev -y #jdk for mwdumper and libicu-dev for uconv
ls -1S ${DUMP_PREFIX}-pages-meta-current*.xml-p* |
  xargs -I{} -P${EXPORT_PROCESSES} -t bash -c '
  mkdir -p ${DIR}/$(basename {})
  cd ${DIR}/$(basename {})
  bunzip2 -c {} |
    uconv -f UTF-8 -t ascii --callback escape-xml-dec -v 2> ${LOG}/$(basename {}).uconv |
    java -jar ~/mwdumper-1.16.jar --format=sql:1.5 2> ${LOG}/$(basename {}).mwdumper |
    grep INSERT |
    split -l ${EXPORT_FILE_SIZE} -a ${EXPORT_FILE_SUFFIX_LENGTH} 2> ${LOG}/$(basename {}).split

# Setup the db to receive the chunks.
mysqladmin create ${DB} --default-character-set=utf8 -u root
mysql -u root ${DB} < /srv/mediawiki/maintenance/tables.sql
mysql -u root ${DB} <<HERE
  CHANGE page_id page_id INTEGER UNSIGNED,
  DROP INDEX name_title,
  DROP INDEX page_random,
  DROP INDEX page_len,
  DROP INDEX page_redirect_namespace_len;
ALTER TABLE revision 
  DROP INDEX rev_page_id,
  DROP INDEX rev_timestamp,
  DROP INDEX page_timestamp,
  DROP INDEX user_timestamp,
  DROP INDEX usertext_timestamp,
  DROP INDEX page_user_timestamp;

# Import the chunks
# Each chunk is wrapped in a transaction and if the import succeeds the chunk is removed from disk.
# This means you should be able to safely ctrl-c the process at any time and rerun this block and
# it'll pick up where it left off.  The worst case scenario is you'll get some chunk that was added
# but not deleted and you'll see mysql duplicate key errors.  Or something like that.  Anyway, if you
# are reading this you are a big boy and can figure out how clean up the database or remove the file.
find ${DIR} -type f |
  sort -R |
  xargs -I{} -P${IMPORT_PROCESSES} -t bash -c '
    cat ${DIR_ROOT}/BEGIN {} ${DIR_ROOT}/COMMIT | mysql -u root ${DB} &&
    rm {}'

# Rebuild the DB
mysql -u root ${DB} <<HERE
  SELECT page_namespace, page_title, COUNT(*) AS count
  FROM page GROUP BY page_namespace, page_title
  HAVING count > 1;
UPDATE page, bad_page
  SET page.page_title = CONCAT(page.page_title, page.page_id)
  WHERE page.page_namespace = bad_page.page_namespace AND page.page_title = bad_page.page_title;
DROP TABLE bad_page;
  ADD UNIQUE INDEX name_title (page_namespace,page_title),
  ADD INDEX page_random (page_random),
  ADD INDEX page_len (page_len),
  ADD INDEX page_redirect_namespace_len (page_is_redirect, page_namespace, page_len);
ALTER TABLE revision 
  ADD UNIQUE INDEX rev_page_id (rev_page, rev_id),
  ADD INDEX rev_timestamp (rev_timestamp),
  ADD INDEX page_timestamp (rev_page,rev_timestamp),
  ADD INDEX user_timestamp (rev_user,rev_timestamp),
  ADD INDEX usertext_timestamp (rev_user_text,rev_timestamp),
  ADD INDEX page_user_timestamp (rev_page,rev_user,rev_timestamp);

# Run standard post import cleanup
cd /srv/mediawiki
php maintenance/update.php

Notes[edit | edit source]

  1. MIT-style license like our other Java/C# tools; boilerplate to be added.

See also[edit | edit source]

Language: English  • 日本語