Manual talk:MWDumper

Process Error
When I tried to process an xml file exported by the wikipedia export page, i.e. import it into mysql database, I got this error "XML document structures must start and end within the same entity". Has any one come across this before? How did you solve the problem eventually? Or is it a bug at the moment? Thank you all in advance. I look forward to someone discussing it.

Table Creation
I am interested in using wikipedia for research and do not need the web front end. I cannot use the browser based setup used by mediawiki. is there either a list of create table statements necessary to make this database, or a not browser version of the mediawiki setup?

GFDL
From http://mail.wikipedia.org/pipermail/wikitech-l/2006-February/033975.html:


 * I hereby declare it GFDL and RTFM-compatible. :) -- brion vibber


 * So this article, which started as a the README file from MWDumper, is allowed on the wiki. This might be good, as I tend to read wikis more than I read READMEs! --Kernigh 04:53, 12 February 2006 (UTC)

Example (in)correct?
Is the parameter -d correct described in this example?
 * java -jar mwdumper.jar --format=sql:1.5 pages_full.xml.bz2 | mysql -u -p
 * especially the &characterEncoding=UTF-8 helps a lot
 * the mwdumper program was updated to allow for continuing when a batch of 100 records fails because of dumplicate keys. (Yes, they still happen) Please contact gerke dot ephorus dot groups at gmail dot com to request a updated version. (Sorry, no github version available yet, maybe after my holiday ;-) )

SQL Output Going to the Wrong Place
I am trying to simply take an XML dump and convert it to SQL code, which I will then run on a MySQL server. The code I've been using to do so is below:

java -jar mwdumper.jar --format=sql:1.5 --output=file:stubmetahistory.sql --quiet enwiki-20100312-stub-meta-history.xml > out.txt

What I've found is happening is that the file that I would like to be a SQL file (stubmetahistory.sql) is an exact XML copy of the original file (enwiki-20100312-stub-meta-history.xml). However, what is appearing on the screen and being piped to the out.txt file is the SQL file I am looking for. Any thoughts on what I am doing wrong, or what I am missing here to get this correct? The problem of course with just using the out.txt to load into my MySQL server is that there could be problems with the character encoding.

Thank you, CMU Researcher 20:37, 19 May 2010 (UTC)

Alternatives
For anyone unfamiliar with Java (such as myself), is there any other program we can use? 70.101.99.64 21:09, 21 July 2010 (UTC)

Java GC
There seems to be a problem with the garbage collection in mwdumper. On trying to import the Wikipedia 20100130 English dump containing 19,376,810 pages and 313,797,035 revisions, it aborts with the error after 4,216,269 pages and 196,889,000 revs:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String. (String.java:215) at org.mediawiki.importer.XmlDumpReader.bufferContents(Unknown Source) at org.mediawiki.importer.XmlDumpReader.bufferContentsOrNull(Unknown Source) at org.mediawiki.importer.XmlDumpReader.readText(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source)

The recommended suggestion (http://forums.sun.com/thread.jspa?threadID=5114529; http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom) of turning off this feature is NOT feasible as the error is thrown when 98% of the program is spent in GC and under 2% of the heap is recovered. I would appreciate any help or comments.