Manual:Importing XML dumps

This page describes methods to import XML dumps.

The Special:Export page of any mediawiki, including any Wikimedia site and wikipedia, create a XML file (content dump). See Data dumps and Manual:DumpBackup.php. XML files are explained more on meta:Help:Export.

There are several methods for importing these XML dumps:

Using Special:Import
Special:Import can be used by wiki users with import permission (by default this is users in the sysop group) to import a small number of pages (about 100 should be safe). Trying to import large dumps this way may result in timeouts or connection failures. See meta:Help:Import for a detailed description.

See Manual:XML Import file manipulation in CSharp for a C# code sample that manipulates an XML import file.

Possible Problems
For using Transwiki-Import PHP safe_mode must be off and open_basedir must be empty. Otherwise the import fails.

Using importDump.php, if you have shell access

 * Recommended method for general use, but slow for very big data sets. For very large amounts of data, such as a dump of a big Wikipedia, use mwdumper, and import the links tables as separate SQL dumps.

importDump.php is a command line script located in the maintenance folder  of your MediaWiki installation. If you have shell access, you can call importdump.php like this:

php importDump.php

where   is the name of the XML dump file. If the file is compressed and that has a .gz or .bz2 file extension, it is decompressed automatically.

If you are using WAMP installation, you can have troubles with the importing, due to innoDB settings (by default is this engine disabled in my.ini, so if you don't need troubles, use MyIsam engine)

to run importDump.php (or any other tool from the maintenance directory), you need to set up your AdminSettings.php file.

running importDump.php can take quite a long time. For a large Wikipedia dump with millions of pages, it may take days, even on a fast server. Also note that the information in meta:Help:Import about merging histories, etc. also applies.

After running importDump.php, you may want to run rebuildrecentchanges.php in order to update the content of your Special:Recentchanges page.

FAQ
How do we setup debug mode?


 * In the importDump.php file there is a variable $debug. Set that variable to true to display debug messages.  Set that variable to false to turn off debug messages.

How do we do a dryrun (no data added to the database)?


 * In the importDump.php file there is a variable $dryrun. Set that variable to true to run a dryrun.  Set that variable to false to add the data to the database.

Which file controls the main functionality of importing the data?


 * Import.php in the includes directory. The main class is WikiImporter.

Does SpecialImport.php (Importer from the special pages) use WikiImporter class?


 * Yes, importDump.php and SpecialImport.php both use WikiImporter.

Error messages
roots@hello:~# php importImages.php /maps gif bmp PNG JPG GIF BMP
 * Typed:


 * Error:

> PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/mcrypt.ini on line 1 in Unknown on line 0 > Could not open input file: importImages.php


 * Cause:

Before running importImages.php you first need to change directories to the maintenance folder which has the importImages.php maintence script.

Using mwdumper
mwdumper is a Java application that can be used to read, write and convert MediaWiki XML dumps. It can be used to generate a SQL dump from the XML file (for later use with mysql or phpmyadmin</tt>) as well as for importing into the database directly. It is a lot faster than importDump.php, however, it only imports the revisions (page contents), and does not update the internal link tables accordingly -- that means that category pages and many special pages will show incomplete or incorrect information unless you update those tables.

If available, you can fill the link tables by importing separate SQL dumps of these tables using the mysql</tt> command line client directly. For Wikimedia wikis, this data is available along with the XML dumps.

Otherwise, you can run rebuildall.php</tt>, which will take a long time, because it has to parse all pages. This is not recommended for large data sets.

Using xml2sql
Xml2sql (a multiplatform ANSI C program) converts a MediaWiki XML file into an SQL dump for use with mysql</tt> or phpmyadmin</tt>. Just like using mwdumper (see above), importing this way is fast, but does not update secondary data like link tables, so you need to run rebuildall.php</tt>, which nullifies that advantage.

xml2sql is not an official tool and not maintained by MediaWiki developers. It may become outdated and incompatible with the latest version of MediaWiki!


 * It already did. It's reporting "unexpected element " just as the original 1.15 Import.php does (reported on 13 February 2010). (Tried on April 29th 2010 and getting same issue).

Using pywikipediabot, pagefromfile.py and Nokogiri
pywikipediabot is a collection of tools written in python that automate work on Wikipedia or other MediaWiki sites. Once installed on your computer, you can use the specific tool 'pagefromfile.py' which lets you upload a wiki file on Wikipedia or Mediawiki sites. The xml file created by dumpBackup.php can be transformed into a wiki file suitable to be processed by 'pagefromfile.py' using a simple Ruby program similar to the following (here the program will transform all xml files which are on the current directory which is needed if your mediawiki site is a family): For example, here is an excerpt of a wiki file output by the command 'ruby dumpxml2wiki.rb' (two pages can then be uploaded by pagefromfile.py, a Template and a second page which is a redirect):

The program accesses each xml file, extracts the texts within markups of each page, searches the corresponding title as a parent and enclosed it with the paired   commands used by 'pagefromfile' to create or update a page. The name of the page is in an html comment and separated by three quotes on the same first start line. Please notice that the name of the page can be written in Unicode. Sometimes it is important that the page starts directly with the command, like for a #REDIRECT ; thus the comment giving the name of the page must be after the command but still on the first line.

Please remark that the xml dump files produced by dumpBackup.php are prefixed by a namespace: In order to access the text node using Nokogiri, you need to prefix your path with 'xmlns': . Nokogiri is an HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3 selectors from the last generation of XML parsers using Ruby.

Example of the use of 'pagefromfile' to upload the output wiki text file:

What to import?

 * /Importing from mediawiki.org