Manual:Importing XML dumps/en

This page describes methods to import XML dumps. XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc.

The Special:Export page of any MediaWiki site, including any Wikimedia site and Wikipedia, creates an XML file (content dump). See Data dumps and Manual:DumpBackup.php. XML files are explained more on meta:Help:Export.

What to import?

 * Importing from mediawiki.org

How to import?
There are several methods for importing these XML dumps.

Using Special:Import
Special:Import can be used by wiki users with  permission (by default this is users in the   group) to import a small number of pages (about 100 should be safe). Trying to import large dumps this way may result in timeouts or connection failures. See meta:Help:Import for a detailed description.

You are asked to give an interwiki prefix. For instance, if you exported from the English Wikipedia, you have to type 'en'.

Changing permissions
See Manual:User_rights

To allow all registered editors to import (not recommended) the line added to "LocalSettings.php" would be:

Possible problems
For using Transwiki-Import PHP safe_mode must be off and "open_basedir" must be empty (both of them are variables in php.ini). Otherwise the import fails.

If you get error like this:

Warning: XMLReader::open: Unable to open source data in /.../wiki/includes/Import.php on line 53 Warning: XMLReader::read: Load Data before trying to read in /.../wiki/includes/Import.php on line 399

And Special:Import shows: "Import failed: Expected &lt;mediawiki> tag, got ", this may be a problem caused by a fatal error on a previous import, which leaves libxml in a wrong state across the entire server, or because another PHP script on the same server disabled entity loader (PHP bug). This happens on MediaWiki versions prior to MediaWiki 1.26, and the solution is to restart the webserver service (apache, etc), or write and execute a script that calls  (see ).

Using importDump.php, if you have shell access

 * Recommended method for general use, but slow for very big data sets. For very large amounts of data, such as a dump of a big Wikipedia, use mwdumper, and import the links tables as separate SQL dumps.

is a command line script located in the maintenance folder of your MediaWiki installation. If you have shell access, you can call importdump.php from within the maintenance folder like this (add paths as necessary): php importDump.php --conf ../LocalSettings.php /path_to/dumpfile.xml.gz --username-prefix=""

or this:

where  is the name of the XML dump file. If the file is compressed and that has a  or   file extension (but not   or  ), it is decompressed  automatically.

Afterwards use ImportImages.php to import the images: php importImages.php ../path_to/images

If you have other digital media file types uploaded to your wiki, i.e., .zip, .nxc, .cpp, .py, or .pdf, then you must also backup/export the wiki_prefix_imagelinks table and "insert" it into the new SQL database table that corresponds with your new MediaWiki. Otherwise, all links referencing these file types will turn up as broken in your wikipages.

If you are using WAMP installation, you can have problems with the importing, due to innoDB settings (by default this engine is disabled in my.ini, so if you don't need problems, use MyIsam engine)

running importDump.php can take quite a long time. For a large Wikipedia dump with millions of pages, it may take days, even on a fast server. Add --no-updates for faster import. Also note that the information in meta:Help:Import about merging histories, etc. also applies.

Optimizing of database after import is recommended: it can reduce database size in two or three times.

After running importDump.php, you may want to run rebuildrecentchanges.php in order to update the content of your Special:Recentchanges page.

FAQ

 * How to setup debug mode?: Use command line option.


 * How to make a dry run (no data added to the database)?: Use command line option

Error messages
roots@hello:~# php importImages.php /maps gif bmp PNG JPG GIF BMP
 * Typed:

> PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/mcrypt.ini on line 1 in Unknown on line 0 > Could not open input file: importImages.php
 * Error:

Before running importImages.php you first need to change directories to the maintenance folder which has the importImages.php maintence script.
 * Cause:

DB connection error: No such file or directory (localhost)
 * Error while running MAMP

Using specific database credentials $wgDBserver        = "localhost:/Applications/MAMP/tmp/mysql/mysql.sock"; $wgDBadminuser     = "XXXX"; $wgDBadminpassword = "XXXX";
 * Solution

Using importTextFiles.php Maintenance Script
If you have a lot of content converted from another source (several word processor files, content from another wiki, etc), you may have several files that you would like to import into your wiki. In MediaWiki 1.27 and later, you can use the importTextFiles.php maintenance script.

You can also use the edit.php maintenance script for this purpose.

Using mwdumper
Apparently, it can't be used to import into MediaWiki 1.31 or later.

mwdumper is a Java application that can be used to read, write and convert MediaWiki XML dumps. It can be used to generate a SQL dump from the XML file (for later use with  or  ) as well as for importing into the database directly. It is a lot faster than importDump.php, however, it only imports the revisions (page contents), and does not update the internal link tables accordingly -- that means that category pages and many special pages will show incomplete or incorrect information unless you update those tables.

If available, you can fill the link tables by importing separate SQL dumps of these tables using the  command line client directly. For Wikimedia wikis, this data is available along with the XML dumps.

Otherwise, you can run, which will take a  long time, because it has to parse all pages. This is not recommended for large data sets.

Using pywikibot, pagefromfile.py and Nokogiri
pywikibot is a collection of tools written in python that automate work on Wikipedia or other MediaWiki sites. Once installed on your computer, you can use the specific tool 'pagefromfile.py' which lets you upload a wiki file on Wikipedia or MediaWiki sites. The xml file created by dumpBackup.php can be transformed into a wiki file suitable to be processed by 'pagefromfile.py' using a simple Ruby program similar to the following (here the program will transform all xml files which are on the current directory which is needed if your MediaWiki site is a family): For example, here is an excerpt of a wiki file output by the command 'ruby dumpxml2wiki.rb' (two pages can then be uploaded by pagefromfile.py, a Template and a second page which is a redirect):

The program accesses each xml file, extracts the texts within markups of each page, searches the corresponding title as a parent and enclosed it with the paired   commands used by 'pagefromfile' to create or update a page. The name of the page is in an html comment and separated by three quotes on the same first start line. Please notice that the name of the page can be written in Unicode. Sometimes it is important that the page starts directly with the command, like for a #REDIRECT ; thus the comment giving the name of the page must be after the command but still on the first line.

Please remark that the xml dump files produced by dumpBackup.php are prefixed by a namespace: In order to access the text node using Nokogiri, you need to prefix your path with 'xmlns': . Nokogiri is an HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3 selectors from the last generation of XML parsers using Ruby.

Example of the use of 'pagefromfile' to upload the output wiki text file:

How to import logs?
Exporting and importing logs with the standard MediaWiki scripts often proves very hard; an alternative for import is the script  in the WikiDAT tool, as suggested by Felipe Ortega.

Interwikis
If you get the message Page "meta:Blah blah" is not imported because its name is reserved for external linking (interwiki). the problem is that some pages to be imported have a prefix that is used for interwiki linking. For example, ones with a prefix of 'Meta:' would conflict with the interwiki prefix  which by default links to https://meta.wikimedia.org.

You can do any of the following.
 * Remove the prefix from the table. This will preserve page titles, but prevent interwiki linking through that prefix.
 * Example: you will preserve page titles 'Meta:Blah blah' but will not be able to use the prefix 'meta:' to link to meta.wikimedia.org (although it will be possible through a different prefix).
 * How to do it: before importing the dump, run the query  (note: do not include the colon in the  ). Alternatively, if you have enabled editing the interwiki table, you can simply go to Special:Interwiki and click the 'Delete' link on the right side of the row belonging to that prefix.
 * Replace the unwanted prefix in the XML file with "Project:" before importing. This will preserve the functionality of the prefix as an interlink, but will replace the prefix in the page titles with the name of the wiki where they're imported into, and might be quite a pain to do on large dumps.
 * Example: replace all 'Meta:' with 'Project:' in the XML file. MediaWiki will then replace 'Project:' with the name of your wiki during importing.