Manual talk:MWDumper

Process Error
When I tried to process an xml file exported by the wikipedia export page, i.e. import it into mysql database, I got this error "XML document structures must start and end within the same entity". Has any one come across this before? How did you solve the problem eventually? Or is it a bug at the moment? Thank you all in advance. I look forward to someone discussing it.

Table Creation
I am interested in using wikipedia for research and do not need the web front end. I cannot use the browser based setup used by mediawiki. is there either a list of create table statements necessary to make this database, or a not browser version of the mediawiki setup?

GFDL
From http://mail.wikipedia.org/pipermail/wikitech-l/2006-February/033975.html:


 * I hereby declare it GFDL and RTFM-compatible. :) -- brion vibber


 * So this article, which started as a the README file from MWDumper, is allowed on the wiki. This might be good, as I tend to read wikis more than I read READMEs! --Kernigh 04:53, 12 February 2006 (UTC)

Example (in)correct?
Is the parameter -d correct described in this example?
 * java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -p
 * My mysql tells me
 * -p, --password[=name] Password to use when connecting to server ...
 * -D, --database=name Database to use.
 * (mysql Ver 14.7 Distrib 4.1.15, for pc-linux-gnu (i486) using readline 5.1)
 * Would this be better?
 * java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -D -p
 * (if password is given per command line there must be no space between -p and the actual password)
 * Or if the password is ommited it is requested interactively:
 * java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -D -p

MWDumper error
Running WinXP, XAMPP, JRE 1.5.0_08, MySQL JDBC 3.1.13

http://f.foto.radikal.ru/0610/4d1d041f3fd7.png --89.178.61.174 22:09, 9 October 2006 (UTC)

MWDumper Issues
Using MWDumper, how would I convert a Wikipedia/Wikibooks XML dump to an SQL file?

ANSWER
java -jar mwdumper.jar --format=sql:1.5 x.xml > y.sql where x.xml is the name of your input file and y.sql is the name of your output file.

Problems with MWDumper
When I run: java -jar mwdumper.jar -–format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 | c:\wamp\mysql\bin\mysql -u wikiuser -p wikidb

I get:

Exception in thread "main" java.io.FileNotFoundException: -ûformat=sql:1.5 (The system cannot find the file specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream. (Unknown Source) at java.io.FileInputStream. (Unknown Source) at org.mediawiki.dumper.Tools.openInputFile(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)

Please help!

SOLUTION:
For the above problem here is the fix:

java -jar mwdumper.jar -–format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 | c:\wamp\mysql\bin\mysql -u wikiuser -p wikidb

Notice "-ûformat=sql:1.5" in the error message? The problem is one of the "–" is using the wrong char (caused by copy&paste)...just edit and replace (type) them by hand. so replace "-û" "--" next to format=sql:1.5

P.S For a really fast dump (60min vs 24hrs) unbzip the enwiki-latest-pages-articles.xml.bz2 file so that it becomes enwiki-latest-pages-articles.xml Then use the command: java -jar mwdumper.jar -–format=sql:1.5 enwiki-latest-pages-articles.xml | c:\wamp\mysql\bin\mysql -u wikiuser -p wikidb

Page Limitations?
I'm attempting to import a Wikipedia database dump comprized of about 4,800,000 files on a Windows XP system. I'm using the following command: java -jar mwdumper.jar --format=sql:1.5 enwiki-20070402-pages-articles.xml" | mysql -u root -p wikidb

Everything appears to go smootly, the progress indicator goes up to the expected 4 million and someting, but only 432,000 pages are actually imported into the MySQL database. Why is this? Any assistance is greatly appriciated. Uiop 02:31, 15 April 2007 (UTC)


 * MySQL experienced some error, and the error message scrolled off your screen. To aid in debugging, either save the output from mysql's stderr stream, or run mwdumper to a file first, etc. --brion 21:15, 20 April 2007 (UTC)

PROBLEM SOLVED
Mate, I had the same problem with it stopping at 432,000 pages. I'm assuming you're using WAMP here.

The problem is with with the log files. If you go to C:\wamp\mysql\data (or whatever's your equivalent directory) you'll see two files ib_logfile0 and ib_logfile1. You'll notice they are both 10Mb. They need to be much bigger. This is how you fix it.

To start off, you'll need to delete the dump you've been doing so far. Left click on the WAMP icon in the taskbar, choose MySQL, then MySQL Console. It will ask you for a password, which is blank by default so just press enter. Now type the following commands:

use wikidb; delete from page; delete from revision; delete from text; quit

OK. Now, left click on the WAMP icon in the taskbar and choose Config Files and then 'my.ini'. Find the line innodb_log_file_size, and set this to 512M (was 10M in mine). Scroll to the bottom, and add the following line:

set-variable=max_allowed_packet=32M

Left click on the WAMP icon in the taskbar and select MySQL->Stop Service. Open C:\wamp\mysql\data (or whatsoever your equivalent directory) and delete ib_logfile0 and ib_logfile1. Left-click on the WAMP icon in the taskbar again, and select MySQL->Start / Resume Service.

Now go ahead and run mwdumper.

Happy dumping!

Same problem as above, how to disable innodb_log_file or make it greater than 2048M?
I am having the same problem as above - with innodb_log_file_size set to 500MB, about 400k pages are created. With inno_db_log_file_size set to 2000MB, I get 1.1 million pages created. I would like to import Enwiki's 5 million pages, so I need a much larger inno_db_log_file_size. However, Mysql crashes on startup if I set this to a value larger than 2047MB. According to http://dev.mysql.com/doc/refman/5.0/en/innodb-configuration.html, the size of both log files is capped at 4GB. Does anyone know why this log file is being written to so much by MWDumper and how we can reduce the output to that file.

Can dump be resumed?
I am using mwdumper to import enwiki-20070402-pages-articles.xml. I got up to 1,144,000 pages and, instead of showing how many pages per second it was importing, it said (-/sec). 1,115,000 said the same thing.

After that, the import sped up dramatically: It says it's processing around (20/sec) but in fact it seems to be doing about 2000/sec, because those figures are flying past!

I'm afraid that after 30 hours of importing, I may have lost it. I don't want to start again, especially if that means the possibility of the same happening again. Debugging by trial and error could take the rest of my life!

Is there any way that I can resume the dump from 144,000 pages if need be?

(Hopefully this dump DOES work anyway. Maybe I need to increase innodb_log_file_size to 1024G or perhaps 2048G.)

Importing to a database with table prefix for wiki
I want to import "XML export" of Mediawiki to a local wiki but it tries to import them to non-prefixed tables however my tables have got prefix. How can I solve this problem? Is there a solution to import xml to prefixed tables (like fa_page, fa_text, fa_revisions) by this software? It's so bad if it doesn't have this feature.--Soroush 16:40, 5 September 2007 (UTC)

Yes. Open text processor, paste
 * 1) !/usr/bin/perl

while(<>) { s/INTO /INTO yourprefixhere_/g; print; }

save it as prefixer.pl. Run MWDumper with -output=file:temp.sql option (instead of --output=mysql:..). Execute perl prefixer.pl < temp.sql > fill.sql Run mysql -u wikiuser -p yourpasswordhere Type use wikidb then source fill.sql --Derbeth talk 21:11, 31 October 2007 (UTC)

Source Code
Is the source code to mwdumper available?
 * http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/ --78.106.145.69 22:17, 23 October 2007 (UTC)
 * http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper/ --82.255.239.71 09:52, 16 March 2008 (UTC)

Overwrite
How to overwrite articles which already exist? Werran 21:08, 10 April 2008 (UTC)

Size restrictions
Maybe you can add a feature for how many pages or how big the resulting dump can be?

More recent compiled version
The latest compiled version of MWDumper in http://download.wikimedia.org/tools/ dates from 2006-Feb-01, while http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper/README shows changes to the code up to 2007-07-06. The *.jar version from 2006 doesn't work on recent Commons dumps, and I don't know how to compile the program under Windows. Could you please make a more recent compiled version available? -- JovanCormac 06:28, 3 September 2009 (UTC)

This one? http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip


 * Can someone compile latest rev for Windows? Version above replaces the contributor credentials with 127.0.0.1 ;<

Filtering does not seem to work
I want to import only a few pages from the whole English Wikipedia to my database. I suppose, I should give the title of the desired pages in a file (line by lines) and use the "--filter=list:fileName" option. But, when I tried this option, it seems that filtering does not have any effect and the script starts to import pages, saying 4 pages, 1000 versions, 4 pages, 2000 versions and so on which imports some other pages not listed in the filtering option.

This is the command that I use:

java -jar mwdumper.jar --filter=exactlist:titles --filter=latest --filter=notalk --output=file:out.txt --format=xml datasets/enwiki-latest-pages-meta-history.xml

0.4 compatible version?
Given that dumps are now in version 0.4 format ("http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4") and MWDumper's page says "It can read MediaWiki XML export dumps (version 0.3, minus uploads)," are there plans to support the 0.4 version? I didn't have success with it as it is, perhaps operator error, but I think not. Thanks

Encoding
Here: http://www.mediawiki.org/wiki/Manual:MWDumper#A_note_on_character_encoding This is mentioned: '' Make sure the database is expecting utf8-encoded text. If the database is expecting latin1 (which MySQL does by default), you'll get invalid characters in your tables if you use the output of mwdumper directly. One way to do this is to pass --default-character-set=utf8 to mysql in the above sample command.

Also make sure that your MediaWiki tables use CHARACTER SET=binary. Otherwise, you may get error messages like Duplicate entry in UNIQUE Key 'name_title' because MySQL fails to distinguish certain characters.''

How is it possible to use --default-character-set=utf8 and make sure the character set=binary at the same time?

If the character set is utf8 is not binary... Can somebody explain how to force CHARACTER SET=binary while using --default-character-set=utf8? Is this possible?