Manual:MWDumper

MWDumper is a quick little tool for extracting sets of pages from a MediaWiki dump file.

To import current XML export dumps, you should build MWDumper from source. You can find a mostly up-to-date build at https://integration.mediawiki.org/ci/job/MWDumper/org.wikimedia$mwdumper/.

Third-party builds (which starts in GUI mode by default so you won't need most of the parameters below, just run it with ) may not contain the latest bug fixes. There are also third party builds without the gui default. An old JAR at download.wikimedia.org doesn't work.

It can read MediaWiki XML export dumps (version 0.3, minus uploads), perform optional filtering, and output back to XML or to SQL statements to add things directly to a database in 1.4 or 1.5 schema.

It is still very much under construction.

While this can be used to import XML dumps into a MediaWiki database, it may not always be the best choice for this task. See Manual:Importing XML dumps for an overview.

Usage
Sample command line for a direct database import: java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -p  to MySQL in the above sample command.

If you want to use the output of mwdumper in a JDBC URL, you should use set  in the query string.

Also make sure that your MediaWiki tables use CHARACTER SET=binary. Otherwise, you may get error messages like  because MySQL fails to distinguish certain characters.

Complex filtering
You can also do complex filtering to produce multiple output files: java -jar mwdumper.jar \ --output=bzip2:pages_public.xml.bz2 \ --format=xml \ --filter=notalk \ --filter=namespace:\!NS_USER \ --filter=latest \ --output=bzip2:pages_current.xml.bz2 \ --format=xml \ --filter=latest \ --output=gzip:pages_full_1.5.sql.gz \ --format=sql:1.5 \ --output=gzip:pages_full_1.4.sql.gz \ --format=sql:1.4 \ pages_full.xml.gz

A bare parameter will be interpreted as a file to read XML input from; if "-" or none is given, input will be read from stdin. Input files with ".gz" or ".bz2" extensions will be decompressed as gzip and bzip2 streams, respectively.

Internal decompression of 7-zip .7z files is not yet supported; you can pipe such files through p7zip's 7za:

7za e -so pages_full.xml.7z | java -jar mwdumper.jar --format=sql:1.5 | mysql -u -p  (first is starting, second maximum size) (bug 21937)

Performance Tips
To speed up importing into a database, you might try:


 * Temporarily remove all indexes and auto_increment fields from the following tables: page, revision and text. This gives a tremendous speed bump, because MySQL will otherwise be updating these indexes after each insert. Don't forget to recreate the indexes afterwards.
 * Java's -server option may significantly increase performance on some versions of Sun's JVM for large files. (Not all installations will have this available.)
 * Increase MySQL's innodb_log_file_size. The default is as little as 5mb, but you can improve performance dramatically by increasing this to reduce the number of disk writes. (See the my-huge.cnf sample config.)
 * If you don't need it, disable the binary log (log-bin option) during the import. On a standalone machine this is just wasteful, writing a second copy of every query that you'll never use.
 * Various other wacky tips in the MySQL reference manual.

Reporting bugs
Bugs can be reported to the mwdumper product in the MediaWiki Bugzilla.

Todo

 * Add some more junit tests
 * Include table initialization in SQL output
 * Allow use of table prefixes in SQL output
 * Ensure that titles and other bits are validated correctly.
 * Test XML input for robustness
 * Provide filter to strip ID numbers
 * &lt;siteinfo&gt; is technically optional; live without it and use default namespaces
 * GUI frontend(s)
 * Port to Python? ;)

Change history (abbreviated)

 * 2005-10-25: Switched SqlWriter.sqlEscape back to less memory-hungry StringBuffer
 * 2005-10-24: Fixed SQL output in non-UTF-8 locales
 * 2005-10-21: Applied more speedup patches from Folke
 * 2005-10-11: SQL direct connection, GUI work begins
 * 2005-10-10: Applied speedup patches from Folke Behrens
 * 2005-10-05: Use bulk inserts in SQL mode
 * 2005-09-29: Converted from C# to Java
 * 2005-08-27: Initial extraction code