Manual:MWDumper/ja

MWDumper is a tool written in Java for extracting sets of pages from a MediaWiki dump file. For example, it can load Wikipedia's content into MediaWiki.

MWDumper can read MediaWiki XML export dumps (version 0.3, minus uploads), perform optional filtering, and output back to XML or to SQL statements to add things directly to a database in 1.4 or 1.5 schema.

After many years, it is still very much under construction.

While this can be used to import XML dumps into a MediaWiki database, it might not be the best choice for for small imports (say, 100 pages or less). See Manual:Importing XML dumps for an overview.

Where to find it
To import current XML export dumps, you'll need an up-to-date build of MWDumper...

Current WMF builds are produced by jenkins; check the copy at download.wikimedia.org first, and for the absolute latest version check the jenkins build.

You can find a mostly up-to-date build at https://integration.wikimedia.org/ci/view/Java/job/MWDumper-package/.

Third-party builds (that start in GUI mode by default so you won't need most of the parameters below, but they may not contain the latest bug fixes) – just run it with )

There are also third party builds without the gui default.

You could build MWDumper from ; but you'll need tons of scattered dependencies from 2005 though. (See, below).

Before you get started
Before using mwdumper, your page, text, and revision tables must be empty. To empy them, do this (note that this will wipe out an existing wiki): In SQL: DELETE FROM page; DELETE FROM text; DELETE FROM revision; In maintenance directory: php rebuildall.php

Import dump files with MWDumper
直接データベースのインポートをするためのサンプルのコマンドラインです: java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -p  を MySQL に渡すことです.

If you want to use the output of mwdumper in a JDBC URL, you should use set  in the query string.

Also make sure that your MediaWiki tables use CHARACTER SET=binary. Otherwise, you may get error messages like  because MySQL fails to distinguish certain characters.

複雑なフィルタリング
複数の出力ファイルを生み出すために複雑なフィルタリングをすることも出来ます: java -jar mwdumper.jar \ --output=bzip2:pages_public.xml.bz2 \ --format=xml \ --filter=notalk \ --filter=namespace:\!NS_USER \ --filter=latest \ --output=bzip2:pages_current.xml.bz2 \ --format=xml \ --filter=latest \ --output=gzip:pages_full_1.5.sql.gz \ --format=sql:1.5 \ --output=gzip:pages_full_1.4.sql.gz \ --format=sql:1.4 \ pages_full.xml.gz

裸のパラメータはXML入力を読み込むためにファイルとして解釈されます: "-"もしくは空の値が与えられた場合、入力はstdinから読み込まれます. ".gz"もしくは".bz2"拡張子を持つ入力ファイルはそれぞれgzipもしくはbzip2ストリームとして解凍されます.

7-zip、.7zファイルの内部解凍はまだサポートされていません; p7zipの7zaを通してファイルをそれらのファイルをパイプすることが出来ます:

7za e -so pages_full.xml.7z | java -jar mwdumper.jar --format=sql:1.5 | mysql -u -p  (first is starting, second maximum size) (bug 21937)

How to build MWDumper from source
You could build MWDumper from ; but you'll need tons of scattered dependencies from 2005 though.

''Note: If you manage to build MWDumper from source, please include instructions below on how you did so. Thank you.''

Step 1:

Step 2:

Step 3:

Reporting bugs
Bugs can be reported to the mwdumper product in the MediaWiki Bugzilla.

Change history (abbreviated)

 * 2005-10-25: Switched SqlWriter.sqlEscape back to less memory-hungry StringBuffer
 * 2005-10-24: Fixed SQL output in non-UTF-8 locales
 * 2005-10-21: Applied more speedup patches from Folke
 * 2005-10-11: SQL direct connection, GUI work begins
 * 2005-10-10: Applied speedup patches from Folke Behrens
 * 2005-10-05: Use bulk inserts in SQL mode
 * 2005-09-29: Converted from C# to Java
 * 2005-08-27: Initial extraction code

Todo

 * Add some more junit tests
 * Include table initialization in SQL output
 * Allow use of table prefixes in SQL output
 * Ensure that titles and other bits are validated correctly.
 * Test XML input for robustness
 * Provide filter to strip ID numbers
 * &lt;siteinfo&gt; is technically optional; live without it and use default namespaces
 * GUI frontend(s)
 * Port to Python? ;)

Alternate method of loading a huge wiki

 * Warning: This method takes days to run.

If you have to load a huge wiki this might help...

Below is a set of instructions that makes loading a large wiki less error prone and maybe a bit faster. It is not a script but rather a set of commands you can copy into bash (running in a screen session.) You'll have to babysit and customize the process for your needs.


 * 1) Dump SQL to disk in even sized chunks.  This takes about 80 Gb of hard drive space and 3 hours for enwiki.
 * 2) Setup the db to receive the chunks.  This takes a few seconds.
 * 3) Import the chunks.  This takes a few days for enwiki.
 * 4) Rebuild the DB.  This takes another day for enwiki.
 * 5) Run standard post import cleanup.  I haven't finished this step successfully yet but some of it can be skipped I think.

export DUMP_PREFIX=/public/datasets/public/enwiki/20130604/enwiki-20130604 export DIR_ROOT=/data/project/dump export DIR=${DIR_ROOT}/enwiki export EXPORT_PROCESSES=4 export IMPORT_PROCESSES=4 export DB=enwiki2 export EXPORT_FILE_SIZE=5 export EXPORT_FILE_SUFFIX_LENGTH=8 export LOG=~/log

bash -c 'sleep 1 && echo y' | mysqladmin drop ${DB} -u root sudo rm -rf ${DIR} rm -rf ${LOG}

sudo mkdir -p ${DIR} sudo chown -R ${USER} ${DIR_ROOT} mkdir -p ${LOG}

sudo apt-get install openjdk-7-jdk libicu-dev -y #jdk for mwdumper and libicu-dev for uconv ls -1S ${DUMP_PREFIX}-pages-meta-current*.xml-p* | xargs -I{} -P${EXPORT_PROCESSES} -t bash -c ' mkdir -p ${DIR}/$(basename {}) cd ${DIR}/$(basename {}) bunzip2 -c {} | uconv -f UTF-8 -t ascii --callback escape-xml-dec -v 2> ${LOG}/$(basename {}).uconv | java -jar ~/mwdumper-1.16.jar --format=sql:1.5 2> ${LOG}/$(basename {}).mwdumper | grep INSERT | split -l ${EXPORT_FILE_SIZE} -a ${EXPORT_FILE_SUFFIX_LENGTH} 2> ${LOG}/$(basename {}).split '
 * 1) Dump SQL to disk in even sized chunks.
 * 2) Sort by size descending to keep as many threads as possible hopping.
 * 3) uconv cleans up UTF-8 errors in the source files.
 * 4) grep removes BEGIN and COMMIT statements that mwdumper thinks are good, but I do better below

mysqladmin create ${DB} --default-character-set=utf8 -u root mysql -u root ${DB} < /srv/mediawiki/maintenance/tables.sql mysql -u root ${DB} <<HERE ALTER TABLE page CHANGE page_id page_id INTEGER UNSIGNED, DROP INDEX name_title, DROP INDEX page_random, DROP INDEX page_len, DROP INDEX page_redirect_namespace_len; ALTER TABLE revision CHANGE rev_id rev_id INTEGER UNSIGNED, DROP INDEX rev_page_id, DROP INDEX rev_timestamp, DROP INDEX page_timestamp, DROP INDEX user_timestamp, DROP INDEX usertext_timestamp, DROP INDEX page_user_timestamp; ALTER TABLE text CHANGE old_id old_id INTEGER UNSIGNED; HERE
 * 1) Setup the db to receive the chunks.

echo 'BEGIN;' > ${DIR_ROOT}/BEGIN echo 'COMMIT;' > ${DIR_ROOT}/COMMIT find ${DIR} -type f | sort -R | xargs -I{} -P${IMPORT_PROCESSES} -t bash -c ' cat ${DIR_ROOT}/BEGIN {} ${DIR_ROOT}/COMMIT | mysql -u root ${DB} && rm {}'
 * 1) Import the chunks
 * 2) Each chunk is wrapped in a transaction and if the import succeeds the chunk is removed from disk.
 * 3) This means you should be able to safely ctrl-c the process at any time and rerun this block and
 * 4) it'll pick up where it left off.  The worst case scenario is you'll get some chunk that was added
 * 5) but not deleted and you'll see mysql duplicate key errors.  Or something like that.  Anyway, if you
 * 6) are reading this you are a big boy and can figure out how clean up the database or remove the file.

mysql -u root ${DB} < 1; UPDATE page, bad_page SET page.page_title = CONCAT(page.page_title, page.page_id) WHERE page.page_namespace = bad_page.page_namespace AND page.page_title = bad_page.page_title; DROP TABLE bad_page; ALTER TABLE page CHANGE page_id page_id INTEGER UNSIGNED AUTO_INCREMENT, ADD UNIQUE INDEX name_title (page_namespace,page_title), ADD INDEX page_random (page_random), ADD INDEX page_len (page_len), ADD INDEX page_redirect_namespace_len (page_is_redirect, page_namespace, page_len); ALTER TABLE revision CHANGE rev_id rev_id INTEGER UNSIGNED AUTO_INCREMENT, ADD UNIQUE INDEX rev_page_id (rev_page, rev_id), ADD INDEX rev_timestamp (rev_timestamp), ADD INDEX page_timestamp (rev_page,rev_timestamp), ADD INDEX user_timestamp (rev_user,rev_timestamp), ADD INDEX usertext_timestamp (rev_user_text,rev_timestamp), ADD INDEX page_user_timestamp (rev_page,rev_user,rev_timestamp); ALTER TABLE text CHANGE old_id old_id INTEGER UNSIGNED AUTO_INCREMENT; HERE
 * 1) Rebuild the DB

cd /srv/mediawiki php maintenance/update.php
 * 1) Run standard post import cleanup