Manual talk:MWDumper

Process Error
When I tried to process an xml file exported by the wikipedia export page, i.e. import it into mysql database, I got this error "XML document structures must start and end within the same entity". Has any one come across this before? How did you solve the problem eventually? Or is it a bug at the moment? Thank you all in advance. I look forward to someone discussing it.

Table Creation
I am interested in using wikipedia for research and do not need the web front end. I cannot use the browser based setup used by mediawiki. is there either a list of create table statements necessary to make this database, or a not browser version of the mediawiki setup?

GFDL
From http://mail.wikipedia.org/pipermail/wikitech-l/2006-February/033975.html:


 * I hereby declare it GFDL and RTFM-compatible. :) -- brion vibber


 * So this article, which started as a the README file from MWDumper, is allowed on the wiki. This might be good, as I tend to read wikis more than I read READMEs! --Kernigh 04:53, 12 February 2006 (UTC)

Example (in)correct?
Is the parameter -d correct described in this example?
 * java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -p
 * Rebuild and execute. Disclaimer: Use at your own risk.

Importing to a database with table prefix for wiki
I want to import "XML export" of Mediawiki to a local wiki but it tries to import them to non-prefixed tables however my tables have got prefix. How can I solve this problem? Is there a solution to import xml to prefixed tables (like fa_page, fa_text, fa_revisions) by this software? It's so bad if it doesn't have this feature.--Soroush 16:40, 5 September 2007 (UTC)

Yes. Open text processor, paste
 * 1) !/usr/bin/perl

while(<>) { s/INTO /INTO yourprefixhere_/g; print; }

save it as prefixer.pl. Run MWDumper with -output=file:temp.sql option (instead of --output=mysql:..). Execute perl prefixer.pl < temp.sql > fill.sql Run mysql -u wikiuser -p yourpasswordhere Type use wikidb then source fill.sql --Derbeth talk 21:11, 31 October 2007 (UTC)

Source Code
Is the source code to mwdumper available?
 * http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/ --78.106.145.69 22:17, 23 October 2007 (UTC)
 * http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper/ --82.255.239.71 09:52, 16 March 2008 (UTC)

Overwrite
How to overwrite articles which already exist? Werran 21:08, 10 April 2008 (UTC)

Size restrictions
Maybe you can add a feature for how many pages or how big the resulting dump can be?

More recent compiled version
The latest compiled version of MWDumper in http://download.wikimedia.org/tools/ dates from 2006-Feb-01, while http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper/README shows changes to the code up to 2007-07-06. The *.jar version from 2006 doesn't work on recent Commons dumps, and I don't know how to compile the program under Windows. Could you please make a more recent compiled version available? -- JovanCormac 06:28, 3 September 2009 (UTC)

This one? http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip


 * Can someone compile latest rev for Windows? Version above replaces the contributor credentials with 127.0.0.1 ;<

Filtering does not seem to work
I want to import only a few pages from the whole English Wikipedia to my database. I suppose, I should give the title of the desired pages in a file (line by lines) and use the "--filter=list:fileName" option. But, when I tried this option, it seems that filtering does not have any effect and the script starts to import pages, saying 4 pages, 1000 versions, 4 pages, 2000 versions and so on which imports some other pages not listed in the filtering option.

This is the command that I use:

java -jar mwdumper.jar --filter=exactlist:titles --filter=latest --filter=notalk --output=file:out.txt --format=xml datasets/enwiki-latest-pages-meta-history.xml

0.4 compatible version?
Given that dumps are now in version 0.4 format ("http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4") and MWDumper's page says "It can read MediaWiki XML export dumps (version 0.3, minus uploads)," are there plans to support the 0.4 version? I didn't have success with it as it is, perhaps operator error, but I think not. Thanks

Encoding
Here: http://www.mediawiki.org/wiki/Manual:MWDumper#A_note_on_character_encoding This is mentioned: '' Make sure the database is expecting utf8-encoded text. If the database is expecting latin1 (which MySQL does by default), you'll get invalid characters in your tables if you use the output of mwdumper directly. One way to do this is to pass --default-character-set=utf8 to mysql in the above sample command.

Also make sure that your MediaWiki tables use CHARACTER SET=binary. Otherwise, you may get error messages like Duplicate entry in UNIQUE Key 'name_title' because MySQL fails to distinguish certain characters.''

How is it possible to use --default-character-set=utf8 and make sure the character set=binary at the same time?

If the character set is utf8 is not binary... Can somebody explain how to force CHARACTER SET=binary while using --default-character-set=utf8? Is this possible?

Steps taken to restore Frysk Wikipedia

 * Make sure you select the option 'Experimental MySQL 4.1/5.0 binary' when selecting the type of database (in Mediawiki 1.11.2 on Ubuntu 8.04).
 * This is the batch-file (thanks for the tip in bug https://bugzilla.wikimedia.org/show_bug.cgi?id=14379):
 * especially the &characterEncoding=UTF-8 helps a lot
 * the mwdumper program was updated to allow for continuing when a batch of 100 records fails because of dumplicate keys. (Yes, they still happen) Please contact gerke dot ephorus dot groups at gmail dot com to request a updated version. (Sorry, no github version available yet, maybe after my holiday ;-) )

SQL Output Going to the Wrong Place
I am trying to simply take an XML dump and convert it to SQL code, which I will then run on a MySQL server. The code I've been using to do so is below:

java -jar mwdumper.jar --format=sql:1.5 --output=file:stubmetahistory.sql --quiet enwiki-20100312-stub-meta-history.xml > out.txt

What I've found is happening is that the file that I would like to be a SQL file (stubmetahistory.sql) is an exact XML copy of the original file (enwiki-20100312-stub-meta-history.xml). However, what is appearing on the screen and being piped to the out.txt file is the SQL file I am looking for. Any thoughts on what I am doing wrong, or what I am missing here to get this correct? The problem of course with just using the out.txt to load into my MySQL server is that there could be problems with the character encoding.

Thank you, CMU Researcher 20:37, 19 May 2010 (UTC)

Alternatives
For anyone unfamiliar with Java (such as myself), is there any other program we can use? 70.101.99.64 21:09, 21 July 2010 (UTC)


 * There are a bunch listed here.. http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps

Java GC
There seems to be a problem with the garbage collection in mwdumper. On trying to import the Wikipedia 20100130 English dump containing 19,376,810 pages and 313,797,035 revisions, it aborts with the error after 4,216,269 pages and 196,889,000 revs:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String. (String.java:215) at org.mediawiki.importer.XmlDumpReader.bufferContents(Unknown Source) at org.mediawiki.importer.XmlDumpReader.bufferContentsOrNull(Unknown Source) at org.mediawiki.importer.XmlDumpReader.readText(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)

The recommended suggestion (http://forums.sun.com/thread.jspa?threadID=5114529; http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom) of turning off this feature is NOT feasible as the error is thrown when 98% of the program is spent in GC and under 2% of the heap is recovered. I would appreciate any help or comments.

Some section on building from SVN checkout?
How do you build the JAR from an SVN checkout. I think we should include it on this page. I'm a java JAR newbie and I couldn't get it to work.


 * In the root folder (folder with build.xml), type "ant". It should put the new jar file in mwdumper/build/mwdumper.jar.

Editing Code to add tab delimited output
I've had success using the mwdumper to dump Wikipedia data into MySQL, but I'd like to do some analysis using Hadoop (Hive or Pig). I will need the Wikipedia data (revision, page, and text tables) in tab delimited or really any other delimiter to dump it into a cluster. How difficult would it be to make those modifications? Could you point out where in the code I should be looking? It would also be nice to be able to filter by table (e.g., have a seperate txt output for each table).

Needed to DELETE before running MWDumper
I installed and configured a fresh Mediawiki 1.16.2 install, and found that before I could run MWDumper successfully I had to delete all rows from the page, text, and revision tables (USE wikidb; DELETE page; DELETE text; DELETE revision). If I didn't do this first I received the error: "ERROR 1062 (23000) at line xxx: Duplicate entry '1' for key 'PRIMARY'". Dcoetzee 10:32, 21 March 2011 (UTC)