|MediaWiki file: importDump.php|
|Source code:||master • 1.39.0 • 1.38.4 • 1.35.8|
- Recommended method for general use, but slow for very big data sets. See #Importing English Wikipedia or other large wikis, below.
importDump.php file is a maintenance script to import XML dump files into the current wiki. It reads pages from an XML file as produced from Special:Export or dumpBackup.php, and saves them into the current wiki. It is one of MediaWiki's maintenance scripts, and is located in the maintenance folder of your MediaWiki installation.
If you have shell access, you can call importdump.php from within the maintenance folder like this (add paths as necessary):
php importDump.php --conf ../LocalSettings.php /path_to/dumpfile.xml.gz --username-prefix=""
--username-prefix=""when importing files.
php importDump.php < dumpfile.xml
dumpfile.xml is the name of the XML dump file.
If the file is compressed and that has a
.bz2 file extension (but not
.tar.bz2), it is decompressed automatically.
Afterwards use ImportImages.php to import the images:
php importImages.php ../path_to/images
--no-updatesfor faster import. Also note that the information in meta:Help:Import about merging histories, etc. also applies.
If you imported a dump with the
--no-updates parameter, you'll need to run rebuildall.php to populate all the links, templates and categories.
Description of operation
The script reports ongoing progress in 100-page increments (by default), reporting the number of pages imported per second for each increment, so you can monitor its activity, and see that it hasn't hung. Can take 30 or more seconds between increments.
The script is robust, as it skips past previously loaded pages, rather than overwrites them, so that it can pick up where it left off fairly quickly after being interrupted and restarted. It still displays progress increments while doing this, which skips by pretty fast.
Pages will be imported preserving the timestamp of each edit. Due to this feature, if a page being imported is older than the existing page, it will only populate the page history, but it won't replace the most recent revision with an older one. If that behavior is not desired, existing pages should be deleted first prior to import, or they'll need to be edited, reverting to the last imported revision found in the page history.
The wiki is usable during the import.
The wiki looks weird missing most of the templates, and with so many red links, but it gets better as the import proceeds.
php importDump.php --conf ../LocalSettings.php /path_to/dumpfile.xml.gz
php importDump.php < dumpfile.xml
|--report||Report position and speed after every n pages processed.|
|--namespaces||Import only the pages from namespaces belonging to the list of pipe-separated namespace names or namespace indexes.|
|--dry-run||Parse dump without actually importing pages.|
|--debug||Output extra verbose debug information.|
|--uploads||Process file upload data if included (experimental).|
|--no-updates||Disable link table updates. Is faster but leaves the wiki in an inconsistent state. Run rebuildall.php after the import to correct the link table.|
|--image-base-path||Import files from a specified path.|
|--skip-to||Start from the given page number, by skipping first n-1 pages.|
|--username-prefix||Adds a prefix to usernames. Due to this bug it may be necessary to specify |
How to setup debug mode?
Use command line option
How to make a dry run (no data added to the database)?
Use command line option
Failed to open stream
In case you get an error "failed to open stream: No such file or directory", make sure that the specified file does exist and that PHP has access to it.
Error while running importImages
roots@hello:~# php importImages.php /maps gif bmp PNG JPG GIF BMP
> PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/mcrypt.ini on line 1 in Unknown on line 0 > Could not open input file: importImages.php
Before running importImages.php you first need to change directories to the maintenance folder which has the importImages.php maintence script.
Error while running MAMP
DB connection error: No such file or directory (localhost)
Using specific database credentials
$wgDBserver = "localhost:/Applications/MAMP/tmp/mysql/mysql.sock"; $wgDBadminuser = "XXXX"; $wgDBadminpassword = "XXXX";
Importing English Wikipedia or other large wikis
For very large data sets, importDump.php may take a long time (days or weeks); there are alternate methods which can be much faster for full site restoration, see Manual:Importing XML dumps.
If you can't get the other methods to work, here are some pointers for using importDump.php for importing large wikis, to reduce import time as much as possible...
Parallelizing the import
You could try running importDump.php multiple times simultaneously on the same dump, using the option
In an experiment on Ubuntu, the script was run (on a decompressed dump) multiple times in separate windows simultaneously using the
--skip-to option. On a quad-core laptop computer, running the script in 4 windows sped up import by a factor of 4. In the experiment, the
--skip-to parameter was set
1,000,000 pages apart per instance, and the import was monitored (checked on from time to time), to stop each instance before catching up to another.
Note: This experiment was not tried running multiple instances without the "--skip-to" parameter, to avoid potential clashing -- if you try this without
--skip-to, or you let the instances catch up to each other, please post your findings here. In this experiment, 2 of the windows caught up, and no error messages resulted. The instances of the script appeared to be jumping past each other.
--skip-to differs from normal operation, in that progress increments are not displayed during the skip, instead, it's just the (blinking) cursor. After a few minutes, the increment reports begin to display.
It may be a good idea to segment the data first, with an xml splitter, before importing it in parallel. Then run importDump.php on each segment in a separate window, which would avoid potential clashes. (If you successfully split the dump so it works in this process, please post how to, here).
Import the most useful namespaces first
To speed up import of the most useful parts of the wiki, use the
--namespaces parameter. Import templates first, because articles without working templates look awful. Then import articles. Or, do both at the same time, in multiple windows, as described above, starting templates first, as they import faster and the articles window(s) won't catch up. Note: The main namespace doesn't have a prefix, and so it must be specified using a
0. "Main" and "Article" fail to run and return errors.
Once complete, this will necessitate using
importDump.php again to get the pages in all the other namespaces.
Estimating how long it will take
Before you can estimate how long an import will take, you've got to find out how many total pages are in the wiki you are importing. That is displayed at Special:Statistics in each wiki. As of March 2022, the English Wikipedia had over 55,000,000 pages, including all page types such as talk pages, redirects, etc, but not including pictures ("files").
To see how fast the import is going, go to the page Special:Statistics in the wiki you are are importing into. Note the time and jot down the total pages. Then come back later and see by how much that number has changed. Convert that to pages per day, and then divide that figure into the total pages for the wiki you are importing, to see how many days the import will take.
For example, in the experiment mentioned above, importing using parallelization, and looking at the total pages in Special:Statistics, the wiki is growing about 1,000,000 pages per day. Therefore, it will take around 55 days at that rate to import the 55,000,000 pages (as of March 2022) in the English Wikipedia (not including pictures).
If errors occur when importing files, it may be necessary to use the