Manual talk:MWDumper

From mediawiki.org
Latest comment: 4 years ago by Ciencia Al Poder in topic Slots and content tables making revision not available?

Process Error[edit]

When I tried to process an xml file exported by the wikipedia export page, i.e. import it into mysql database, I got this error "XML document structures must start and end within the same entity". Has any one come across this before? How did you solve the problem eventually? Or is it a bug at the moment? Thank you all in advance. I look forward to someone discussing it.

I got this error when import with *.xml.bz2 file, but after uncompress it to *.xml, this error is gone.

Table Creation[edit]

I am interested in using wikipedia for research and do not need the web front end. I cannot use the browser based setup used by mediawiki. is there either a list of create table statements necessary to make this database, or a not browser version of the mediawiki setup?

  • Just install MediaWiki and SHOW CREATE TABLE `table_name` in your SQL client.

GFDL[edit]

From http://mail.wikipedia.org/pipermail/wikitech-l/2006-February/033975.html:

I hereby declare it GFDL and RTFM-compatible. :) -- brion vibber
So this article, which started as a the README file from MWDumper, is allowed on the wiki. This might be good, as I tend to read wikis more than I read READMEs! --Kernigh 04:53, 12 February 2006 (UTC)Reply

Example (in)correct?[edit]

Is the parameter -d correct described in this example?

 java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u <username> -p <databasename>
My mysql tells me
 -p, --password[=name] Password to use when connecting to server ...
 -D, --database=name Database to use.
 (mysql  Ver 14.7 Distrib 4.1.15, for pc-linux-gnu (i486) using readline 5.1)
:Would this be better?
 java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u <username> -D <databasename> -p<password>
 (if password is given per command line there must be no space between -p and the actual password)
:Or if the password is ommited it is requested interactively:
 java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u <username> -D <databasename> -p

MWDumper error[edit]

Running WinXP, XAMPP, JRE 1.5.0_08, MySQL JDBC 3.1.13

http://f.foto.radikal.ru/0610/4d1d041f3fd7.png --89.178.61.174 22:09, 9 October 2006 (UTC)Reply

MWDumper Issues[edit]

Using MWDumper, how would I convert a Wikipedia/Wikibooks XML dump to an SQL file?

ANSWER[edit]

java -jar mwdumper.jar --format=sql:1.5 x.xml > y.sql where x.xml is the name of your input file and y.sql is the name of your output file.

Problems with MWDumper[edit]

When I run: java -jar mwdumper.jar -–format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 | c:\wamp\mysql\bin\mysql -u wikiuser -p wikidb

I get:

Exception in thread "main" java.io.FileNotFoundException: -ûformat=sql:1.5 (The system cannot find the file specified)

at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(Unknown Source)
at java.io.FileInputStream.<init>(Unknown Source)
at org.mediawiki.dumper.Tools.openInputFile(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)

Please help!


SOLUTION:[edit]

For the above problem here is the fix:

java -jar mwdumper.jar -–format=sql:1.5 enwiki-latest-pages-articles.xml.bz2 | c:\wamp\mysql\bin\mysql -u wikiuser -p wikidb

Notice "-ûformat=sql:1.5" in the error message? The problem is one of the "–" is using the wrong char (caused by copy&paste)...just edit and replace (type) them by hand. so replace "-û" "--" next to format=sql:1.5


P.S For a really fast dump (60min vs 24hrs) unbzip the enwiki-latest-pages-articles.xml.bz2 file so that it becomes enwiki-latest-pages-articles.xml Then use the command: java -jar mwdumper.jar -–format=sql:1.5 enwiki-latest-pages-articles.xml | c:\wamp\mysql\bin\mysql -u wikiuser -p wikidb

Page Limitations?[edit]

I'm attempting to import a Wikipedia database dump comprized of about 4,800,000 files on a Windows XP system. I'm using the following command: java -jar mwdumper.jar --format=sql:1.5 enwiki-20070402-pages-articles.xml" | mysql -u root -p wikidb

Everything appears to go smootly, the progress indicator goes up to the expected 4 million and someting, but only 432,000 pages are actually imported into the MySQL database. Why is this? Any assistance is greatly appriciated. Uiop 02:31, 15 April 2007 (UTC)Reply

MySQL experienced some error, and the error message scrolled off your screen. To aid in debugging, either save the output from mysql's stderr stream, or run mwdumper to a file first, etc. --brion 21:15, 20 April 2007 (UTC)Reply

PROBLEM SOLVED[edit]

Mate, I had the same problem with it stopping at 432,000 pages. I'm assuming you're using WAMP here.

The problem is with with the log files. If you go to C:\wamp\mysql\data (or whatever's your equivalent directory) you'll see two files ib_logfile0 and ib_logfile1. You'll notice they are both 10Mb. They need to be much bigger. This is how you fix it.

To start off, you'll need to delete the dump you've been doing so far. Left click on the WAMP icon in the taskbar, choose MySQL, then MySQL Console. It will ask you for a password, which is blank by default so just press enter. Now type the following commands:

use wikidb; delete from page; delete from revision; delete from text; quit

OK. Now, left click on the WAMP icon in the taskbar and choose Config Files and then 'my.ini'. Find the line innodb_log_file_size, and set this to 512M (was 10M in mine). Scroll to the bottom, and add the following line:

set-variable=max_allowed_packet=32M

Left click on the WAMP icon in the taskbar and select MySQL->Stop Service. Open C:\wamp\mysql\data (or whatsoever your equivalent directory) and delete ib_logfile0 and ib_logfile1. Left-click on the WAMP icon in the taskbar again, and select MySQL->Start / Resume Service.

Now go ahead and run mwdumper.

Happy dumping!

Same problem as above, how to disable innodb_log_file or make it greater than 2048M?[edit]

I am having the same problem as above - with innodb_log_file_size set to 500MB, about 400k pages are created. With inno_db_log_file_size set to 2000MB, I get 1.1 million pages created. I would like to import Enwiki's 5 million pages, so I need a much larger inno_db_log_file_size. However, Mysql crashes on startup if I set this to a value larger than 2047MB. According to http://dev.mysql.com/doc/refman/5.0/en/innodb-configuration.html, the size of both log files is capped at 4GB. Does anyone know why this log file is being written to so much by MWDumper and how we can reduce the output to that file.

Can dump be resumed?[edit]

I am using mwdumper to import enwiki-20070402-pages-articles.xml. I got up to 1,144,000 pages and, instead of showing how many pages per second it was importing, it said (-/sec). 1,115,000 said the same thing.

After that, the import sped up dramatically: It says it's processing around (20/sec) but in fact it seems to be doing about 2000/sec, because those figures are flying past!

I'm afraid that after 30 hours of importing, I may have lost it. I don't want to start again, especially if that means the possibility of the same happening again. Debugging by trial and error could take the rest of my life!

Is there any way that I can resume the dump from 144,000 pages if need be?

(Hopefully this dump DOES work anyway. Maybe I need to increase innodb_log_file_size to 1024G or perhaps 2048G.)

ANSWERS: 1) I suggest

a) java -jar mwdumper.jar --format=sql:1.5 enwiki-latest-pages-articles.xml > a.sql

b) remove statements from a.sql for whatever already inserted in tables.

c) mysql -u <username> -p <dbname> < a.sql

2) Previous fix:

Download the source. Edit src/org/mediawiki/dumper/ProgressFilter.java change the following functions, replacing [REVISION_NUMBER] with the appropriate number.
    public void writeStartPage(Page page) throws IOException {
        pages++;
        if (revisions > [REVISION_NUMBER])
            super.writeStartPage(page);
    }
    public void writeRevision(Revision rev) throws IOException {
        revisions++;
        if (revisions > [REVISION_NUMBER])
            super.writeRevision(rev);
        if (revisions % interval == 0)
            showProgress();
    }
Rebuild and execute. Disclaimer: Use at your own risk.

Importing to a database with table prefix for wiki[edit]

I want to import "XML export" of Mediawiki to a local wiki but it tries to import them to non-prefixed tables however my tables have got prefix. How can I solve this problem? Is there a solution to import xml to prefixed tables (like fa_page, fa_text, fa_revisions) by this software? It's so bad if it doesn't have this feature.--Soroush 16:40, 5 September 2007 (UTC)Reply

Yes. Open text processor, paste

#!/usr/bin/perl

while(<>) {
	s/INTO /INTO yourprefixhere_/g;
	print;
}

save it as prefixer.pl. Run MWDumper with -output=file:temp.sql option (instead of --output=mysql:..). Execute perl prefixer.pl < temp.sql > fill.sql Run mysql -u wikiuser -p yourpasswordhere Type use wikidb then source fill.sql --Derbeth talk 21:11, 31 October 2007 (UTC)Reply

Source Code[edit]

Is the source code to mwdumper available?

http://svn.wikimedia.org/viewvc/mediawiki/trunk/mwdumper/ --78.106.145.69 22:17, 23 October 2007 (UTC)Reply
http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper/ --82.255.239.71 09:52, 16 March 2008 (UTC)Reply
https://git.wikimedia.org/git/mediawiki/tools/mwdumper.git

Overwrite[edit]

How to overwrite articles which already exist? Werran 21:08, 10 April 2008 (UTC)Reply

Size restrictions[edit]

Maybe you can add a feature for how many pages or how big the resulting dump can be?

More recent compiled version[edit]

The latest compiled version of MWDumper in download:tools/ dates from 2006-Feb-01, while http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper/README shows changes to the code up to 2007-07-06. The *.jar version from 2006 doesn't work on recent Commons dumps, and I don't know how to compile the program under Windows. Could you please make a more recent compiled version available? -- JovanCormac 06:28, 3 September 2009 (UTC)Reply

This one? http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip

Can someone compile latest rev for Windows? Version above replaces the contributor credentials with 127.0.0.1 ;<

Filtering does not seem to work[edit]

I want to import only a few pages from the whole English Wikipedia to my database. I suppose, I should give the title of the desired pages in a file (line by lines) and use the "--filter=list:fileName" option. But, when I tried this option, it seems that filtering does not have any effect and the script starts to import pages, saying 4 pages, 1000 versions, 4 pages, 2000 versions and so on which imports some other pages not listed in the filtering option.

This is the command that I use:

java -jar mwdumper.jar --filter=exactlist:titles --filter=latest --filter=notalk --output=file:out.txt --format=xml datasets/enwiki-latest-pages-meta-history.xml


0.4 compatible version?[edit]

Given that dumps are now in version 0.4 format ("http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4") and MWDumper's page says "It can read MediaWiki XML export dumps (version 0.3, minus uploads)," are there plans to support the 0.4 version? I didn't have success with it as it is, perhaps operator error, but I think not. Thanks

Encoding[edit]

Here: Manual:MWDumper#A note on character encoding This is mentioned: Make sure the database is expecting utf8-encoded text. If the database is expecting latin1 (which MySQL does by default), you'll get invalid characters in your tables if you use the output of mwdumper directly. One way to do this is to pass --default-character-set=utf8 to mysql in the above sample command.

Also make sure that your MediaWiki tables use CHARACTER SET=binary. Otherwise, you may get error messages like Duplicate entry in UNIQUE Key 'name_title' because MySQL fails to distinguish certain characters.

How is it possible to use --default-character-set=utf8 and make sure the character set=binary at the same time?

If the character set is utf8 is not binary... Can somebody explain how to force CHARACTER SET=binary while using --default-character-set=utf8? Is this possible?

Steps taken to restore Frysk Wikipedia[edit]

  • Make sure you select the option 'Experimental MySQL 4.1/5.0 binary' when selecting the type of database (in Mediawiki 1.11.2 on Ubuntu 8.04).
  • This is the batch-file (thanks for the tip in bug bugzilla:14379):
#!/bin/bash

#
# Example call:
# ~/bin/load.sh fy ~/wikidownloads/fywiki-20100310-pages-articles.xml.bz2
#

if [ -z "$1" ]; then
	echo Please provide a language
	exit 1
fi

if [ -z "$2" ]; then
	echo Please provide a import file
	exit 1
fi

LANG=$1
IMPORT_FILE=$2
DBUSER=wikiuser
DBPASSWD=wikiuser

#MYSQL_JDBC=~/bin/mysql-connector-java-3.1.14-bin.jar
MYSQL_JDBC=~/bin/mysql-connector-java-5.0.8-bin.jar

nice -n +5 java -server -classpath $MYSQL_JDBC:~/bin/mwdumper-1.16.7.jar \
   org.mediawiki.dumper.Dumper \
   --progress=500 \
   --output=mysql://127.0.0.1/wikidb_$LANG?user=$DBUSER\&password=$DBPASSWD\&characterEncoding=UTF-8 \
   --format=mysql:1.5 \
   $IMPORT_FILE 2>&1|tee -a ~/import-log-$LANG-`date +%F-%H-%M-%S`.log
  • especially the &characterEncoding=UTF-8 helps a lot
  • the mwdumper program was updated to allow for continuing when a batch of 100 records fails because of dumplicate keys. (Yes, they still happen) Please contact gerke dot ephorus dot groups at gmail dot com to request a updated version. (Sorry, no github version available yet, maybe after my holiday ;-) )

SQL Output Going to the Wrong Place[edit]

I am trying to simply take an XML dump and convert it to SQL code, which I will then run on a MySQL server. The code I've been using to do so is below:

java -jar mwdumper.jar --format=sql:1.5 --output=file:stubmetahistory.sql --quiet enwiki-20100312-stub-meta-history.xml > out.txt

What I've found is happening is that the file that I would like to be a SQL file (stubmetahistory.sql) is an exact XML copy of the original file (enwiki-20100312-stub-meta-history.xml). However, what is appearing on the screen and being piped to the out.txt file is the SQL file I am looking for. Any thoughts on what I am doing wrong, or what I am missing here to get this correct? The problem of course with just using the out.txt to load into my MySQL server is that there could be problems with the character encoding.

Thank you, CMU Researcher 20:37, 19 May 2010 (UTC)Reply

Alternatives[edit]

For anyone unfamiliar with Java (such as myself), is there any other program we can use? 70.101.99.64 21:09, 21 July 2010 (UTC)Reply

There are a bunch listed here.. Manual:Importing XML dumps

Java GC[edit]

There seems to be a problem with the garbage collection in mwdumper. On trying to import the Wikipedia 20100130 English dump containing 19,376,810 pages and 313,797,035 revisions, it aborts with the error after 4,216,269 pages and 196,889,000 revs:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:215)
        at org.mediawiki.importer.XmlDumpReader.bufferContents(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.bufferContentsOrNull(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.readText(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)

The recommended suggestion (http://forums.sun.com/thread.jspa?threadID=5114529; http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom) of turning off this feature is NOT feasible as the error is thrown when 98% of the program is spent in GC and under 2% of the heap is recovered. I would appreciate any help or comments.

Some section on building from SVN checkout?[edit]

How do you build the JAR from an SVN checkout. I think we should include it on this page. I'm a java JAR newbie and I couldn't get it to work.

In the root folder (folder with build.xml), type "ant". It should put the new jar file in mwdumper/build/mwdumper.jar.

Editing Code to add tab delimited output[edit]

I've had success using the mwdumper to dump Wikipedia data into MySQL, but I'd like to do some analysis using Hadoop (Hive or Pig). I will need the Wikipedia data (revision, page, and text tables) in tab delimited or really any other delimiter to dump it into a cluster. How difficult would it be to make those modifications? Could you point out where in the code I should be looking? It would also be nice to be able to filter by table (e.g., have a seperate txt output for each table).

Needed to DELETE before running MWDumper[edit]

I installed and configured a fresh Mediawiki 1.16.2 install, and found that before I could run MWDumper successfully I had to delete all rows from the page, text, and revision tables (USE wikidb; DELETE page; DELETE text; DELETE revision). If I didn't do this first I received the error: "ERROR 1062 (23000) at line xxx: Duplicate entry '1' for key 'PRIMARY'". Dcoetzee 10:32, 21 March 2011 (UTC)Reply

FIX: Editing Code to add tab delimited output[edit]

I've update mwdumper with a new dumper class that can be used to export a flat file (tab delimited). The updated code is at https://github.com/bcollier/mwdumper.

error: 11 million pages in enwiki[edit]

I have installed mwdumper and just ran it against an uncompressed version of enwiki-20110901-pages-articles.xml.bz2.

I was expecting ~4 million articles but it processed ~11 million before stopping; however, at that point it was sending all the sql output to null (oops). Now I fixed the sql problem and am using phpMyAdmin to watching mwdumper trundle along writing rows.

mwdumper is on 1,567,000 rows and phpMyAdmin is seeing this:

Table Action Rows Type Collation Size Overhead

page ~1,564,895 InnoDB binary 296.6 MiB revision ~1,441,088 InnoDB binary 590.1 MiB textp ~10,991,108 InnoDB binary 10.9 GiB

Should it complete when mwdumper gets to 4 million or when it gets to 11 million?

I made an import of enwiki in Mai 2015. Result was about 17 million imported pages.

Error: Duplicate entry '1' for key 'PRIMARY'[edit]

If there is ERROR 1062 (23000) at line 35: Duplicate entry '1' for key 'PRIMARY' when restoring database dump to new mediawiki install it is because mwdumper expects that database is empty. With default mediawiki install there is sample pages. Error can be solved by clearing the database. I did it with this:

echo "TRUNCATE TABLE page; TRUNCATE TABLE revision; TRUNCATE TABLE text;" |  mysql -u $mediawikiuser -p $mediawikidatabase

--91.153.53.216 01:07, 22 October 2011 (UTC)Reply

at line 1: Duplicate entry '0-' for key 'name_title' Bye[edit]

Hi, I wanted to import a backup in mysql db get stuck with this error but SQLDumpSplitter have helped me so find the exact line. I don't know why but I get this error importing these two to mysql server. if know anybody knows the reason I will be happy to know.--Pouyana (talk) 22:00, 21 August 2012 (UTC)Reply

(506331,0,'𝓩','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,3934368,90),
(506316,0,'𝕽','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,3934351,90),
I have this same issue when using the direct connection but not when piping through MySQL. It most likely has something to do with character encoding, but I'm not sure how to fix it. Dcoetzee (talk) 01:17, 2 April 2013 (UTC)Reply

Under construction[edit]

It's been "very much under construction" since 11 February 2006. Is that still the case, or should it be considered stable yet? Leucosticte (talk) 21:16, 16 September 2012 (UTC)Reply

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048[edit]

Hello

I have the following error :

   Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
      at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
      at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
      at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

when i try to import enwiki-20130708-pages-articles it crashes around the page 5 800 000

i have tested the build from june 26 without any luck

how can I fix this ?

thx

Also having the same problem. any help would be deeply appreciated. -- NutzTheRookie (talk) 11:31, 5 September 2013 (UTC)Reply

I also find that:

   java -server -jar mwdumper-1.16.jar --format=sql:1.5 enwiki-20131202-pages-articles.xml.bz2

crashes soon after 4 510 000 pages with an identical stack trace.

==Same problem==

4,510,000 pages (5,294.888/sec), 4,510,000 revs (5,294.888/sec) Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048

       at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
       at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
       at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
       at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
       at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
       at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
       at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
       at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

This is a Xerces bug, documented at https://issues.apache.org/jira/browse/XERCESJ-1257

The workaround suggested is to use the JVM's UTF-8 reader instead of the Xerces UTF8Reader. I tried this suggested workaround, and it seemed to fix it for me. I made this change:

	public void readDump() throws IOException {
		try {
			SAXParserFactory factory = SAXParserFactory.newInstance();
			SAXParser parser = factory.newSAXParser();
			Reader reader = new InputStreamReader(input,"UTF-8");
			InputSource is = new InputSource(reader);
			is.setEncoding("UTF-8");
			parser.parse(is, this);
		} catch (ParserConfigurationException e) {
			throw (IOException)new IOException(e.getMessage()).initCause(e);
		} catch (SAXException e) {
			throw (IOException)new IOException(e.getMessage()).initCause(e);
		}
		writer.close();
	}

When in src\org\mediawiki\importer\XMLDumpReader.java Don't forget to add

import java.io.*;

and add

import org.xml.sax.InputSource;

to the top of your file, to settle the imports.

Importing multiple XML files iteratively[edit]

Is there any setting in MWDumper that will allow the import of multiple XML exports iteratively? For example, if I have 100 pages with full history in separate XML files, is there any way to command MWDumper to import all files (full path data) from an external text file (a la wget -i)? Thanks. Wikipositivist (talk) 22:26, 19 November 2013 (UTC)Reply

Any succes importing English wiki XML dump with mwdumper?[edit]

Was anyone able to import recent English Wikipedia XML dump with mwdumper? I tried multiple dumps from past few months and I'm getting varius erros like https://bugzilla.wikimedia.org/show_bug.cgi?id=57236 or https://bugzilla.wikimedia.org/show_bug.cgi?id=24909. Could someone share the last dump that can be imported with mwdumper? Jogers (talk) 21:46, 18 April 2014 (UTC)Reply

How to use MwDumper on my shared host to import Dumps into my wiki?[edit]

I'm basically a noob. Can anyone tell me step by step of how can i import xml dumps into my wiki which is hosted. I'm not understanding anything from the manual. thanks. — Preceding unsigned comment added by 80.114.132.32 (talkcontribs)

You need to run it on your computer, and it will output a SQL file (you should redirect the output to a file). Then load the SQL file on the database server. Your shared host probably provides some sort of cpanel where you can access PhpMyAdmin to write SQL queries. From PhpMyAdmin there's an option to load an SQL file. --Ciencia Al Poder (talk) 18:57, 26 September 2015 (UTC)Reply

Unable to import xml dump duplicate entry error[edit]

I am getting an error of a duplicate entry while I am importing the wikipedia 2008 dump into media wiki 1.24.4. I am using one of the methods on the import xml wiki page using java direct connection to mysql and mwdumper. After importing 182,000 pages it fails saying there is a duplicate entry.

182,000 pages (155.23/sec), 182,000 revs (155.23/sec)
Exception in thread "main" java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '0-?' for key 'name_title'
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Caused by: org.xml.sax.SAXException: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '0-?' for key 'name_title'
        at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:229)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        ... 1 more
 

At the time my charset was utf8 so after I read a troubleshooting page someone with the same issue said it might be caused by not having the charset to binary . So I changed it to binary and still got the same error. I am not sure how to proceed to fix this issue. Does anyone know why I am getting this error? --Ksures (talk) 20:38, 30 October 2015 (UTC)Reply

SOLUTION:[edit]

When you first make a DB to store the dump, even before applying mwdumper, you should specify the charset for that DB: CREATE DATABASE wikidb DEFAULT CHARACTER SET utf8;

java.lang.IllegalArgumentException: Invalid contributor[edit]

Doing this on a debian linux machine, i get this exception:

2,439 pages (0.595/sec), 55,000 revs (13.42/sec)
2,518 pages (0.614/sec), 56,000 revs (13.656/sec)
2,630 pages (0.641/sec), 57,000 revs (13.891/sec)
2,865 pages (0.698/sec), 58,000 revs (14.13/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor
        at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)
root@odroid-server:/var/storage/temp#

Any Ideas how to fix this?

Update from Dezember 2015[edit]

Where can i download the latest version (jar-file) from Dezember 2015? https://dumps.wikimedia.org/tools/ is outdated. Thx — Preceding unsigned comment added by 92.75.174.219 (talkcontribs)

There are no new dumps available AFAIK. But compilation should be straightforward. Install maven and then follow the steps in the "How to build MWDumper from source" section --Ciencia Al Poder (talk) 17:08, 10 February 2016 (UTC)Reply

Cannot import recent dumps[edit]

Is mwdumper still considered a recommended way to import Wikipedia database? I couldn't get it to work at all in English Wikipedia dumps since May 2015.

It always fails with either "MySQLIntegrityConstraintViolationException: Duplicate entry" or "SAXParseException: XML document structures must start and end within the same entity". I was able to get around the first exception by removing uniqueness constraint on page title field. But it fails with the second one later or with "SAXParseException: The element type "parentid" must be terminated by the matching end-tag "</parentid>" — Preceding unsigned comment added by 5.175.208.103 (talkcontribs)

The duplicate error may be because of edits happening while the dump is being generated, so the XML effectively has duplicate entries. That's probably a bug of the XML dump generation. About the <parentid> not matching </parentid>, it would be good to see the XML portion where this happen to see if this is indeed the problem, or maybe the XML dump you downloaded is cut off somehow... If the XML is not well formed, that's a bug on the XML dump generation, something that should be reported on phabricator --Ciencia Al Poder (talk) 11:37, 27 March 2016 (UTC)Reply

Where is the my-huge.cnf sample config?[edit]

Is that available anywhere? I don't see it under /etc/mysql. Thanks. MW131tester (talk) 16:58, 9 February 2019 (UTC)Reply

It was removed because it's dated, according to https://mariadb.com/kb/en/library/configuring-mariadb-with-option-files/#example-option-files --Ciencia Al Poder (talk) 14:45, 10 February 2019 (UTC)Reply

The "huge wiki" instructions were added in July 2013 by an anonymous IP[edit]

Those instructions don't include any commands to, e.g., DROP PRIMARY KEY from the page table, and they say to readd it as CHANGE page_id page_id INTEGER UNSIGNED AUTO_INCREMENT rather than as int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT like what you see in tables.sql. (Also, the rev_page_id index is supposedly getting removed from tables.sql in some future MediaWiki version, although this hasn't happened yet.) MW131tester (talk) 17:53, 9 February 2019 (UTC)Reply

I've added a warning about this. Removing indices doesn't seem to be of much benefit when storage is InnoDB [1]. Maybe for a fast import it would be better to use MyISAM? --Ciencia Al Poder (talk) 15:11, 10 February 2019 (UTC)Reply
Hard to say; I removed the indexes, because I thought they were slowing down my script, and then discovered that my script was failing for other reasons. But I was using my own custom script rather than MWDumper.
If we wanted to make that script more robust, we could have it run "show index from" statements to see which indexes are there; and then it could execute the appropriate SQL statements depending on what MW version it realizes it's working with. MW131tester (talk) 14:25, 11 February 2019 (UTC)Reply

Which dump[edit]

Hi there, which dump needs to be imported from this page? Generally, how realistic is it that the Wiki works then showing the rendered templates and categories? --Aschroet (talk) 16:34, 3 May 2019 (UTC)Reply

I guess the commonswiki-20190420-pages-meta-current.xml.bz2 (9.9 GB) which should contain pages from all namespaces, but only latest revision. --Ciencia Al Poder (talk) 09:16, 8 May 2019 (UTC)Reply

Slots and content tables making revision not available?[edit]

"each revision must have a "main" slot that corresponds to what is without MCR the singular page content." (Requests for comment/Multi-Content Revisions)

Since those tables are new and the tool is old, I guess this is why I got error on every import I did using the tool.Uziel302 (talk) 06:43, 30 August 2019 (UTC)Reply

Your guess is correct. This tool is outdated. --Ciencia Al Poder (talk) 09:39, 30 August 2019 (UTC)Reply
Ciencia Al Poder, is there any other way to create many pages, faster than importDump.php and pywikibot? I tried to play with the DB but no avail. Couldn't figure all the dependencies for the revision to show.Uziel302 (talk) 15:28, 6 September 2019 (UTC)Reply
Yes, edit.php --Ciencia Al Poder (talk) 15:52, 7 September 2019 (UTC)Reply