Manual talk:Importing XML dumps

link tables?
(regarding mwdumper import) I want to avoid the expensive rebuildall.php script. Looking at enwiki/20080724/, I'm wondering - should we import ALL of the SQL dump files, or are there any that should be skipped? --JaGa 00:50, 23 August 2008 (UTC)
 * OK, I went through maintenance/tables.sql, and compared what an importDump.php populates and what mwdumper populates (only page, revision, and text tables), so I'm thinking this is the list of SQL dumps I'll want after mwdumper finishes:


 * category
 * categorylinks
 * externallinks
 * imagelinks
 * pagelinks
 * redirect
 * templatelinks


 * Thoughts? --JaGa 07:04, 24 August 2008 (UTC)

When I try to import using this command: C:\Program Files\xampp\htdocs\mediawiki-1.13.2\maintenance>"C:\Program Files\xampp\php\php.exe" importDump.php C:\Users\Matthew\Downloads\enwiki-20080524-pages-articles.xml.bz2

It fails with this error: XML import parse failure at line 1, col 1 (byte 0; "BZh91AY&SYö┌║O☺Ä"): Empty document

What do you think is wrong?

table prefix
I have a set of wikis with a different table prefix for each of them. How to I tell importDump.php which wiki to use?


 * Set $wgDBprefix in AdminSettings.php —Emufarmers(T 11:10, 25 February 2009 (UTC)

Importing multiple dumps into same database?
If we try to import multiple dumps into the same database, what happens?

Will it work this way?

For example, if there are are two articles with the same title in both databases, what will happen?

Is it possible to import both of them into the same database and distinguish titles with prefixes?

Merging with an existing wiki
How do I merge the dumps with another wiki I've created without overwriting existing pages/articles?

.bz2 files decompressed automatically by importDump.php?
It seems inly .gz files, not .bz2, are decompressed on the fly. --Apoc2400 22:40, 18 June 2009 (UTC)


 * Filed as bug 19289. —Emufarmers(T 05:15, 19 June 2009 (UTC)

Add

to the importFromFile function

Having trouble with importing XML dumps into database
I have been trying to upload one of the latest version of the dumps, pages-articles.xml.bz2 from enwiki/20090604/. I dont want the front end and other things that comes with wikimedia installations, so i thought i would just create the database and upload the dump. I tried using mwdumper, but it breaks with the following error. 18328I also tried using mwimport, that also failed due to the same problem. any one have any suggestions to import the dump successfully to the database ?

Thanks Srini

Error Importing XML Files
A colleague has exported Wikipedia help contents and when attempting to import ran into an error. One of the errors had to do with Template:Seealso. The XML that is produced has a tag which causes the import.php module to error out. If I remove the line from the XML the imports just fine. We are using 1.14.0. Any thoughts?


 * I am using 1.15., and I get the following errors:


 * Warning: xml_parse [function.xml-parse]: Unable to call handler in_ in /home/content/*/h/s/*hscentral/html/w/includes/Import.php on line 437




 * Warning: xml_parse [function.xml-parse]: Unable to call handler out_ in /home/content/*/h/s/*hscentral/html/w/includes/Import.php on line 437


 * By analyzing what entries kill the script, I found that it is protected redirects- these errors come when a page has both and the lines. Manually removing the restrictions line makes it work. I get these errors both from importdump.php and in my browser window on special:import when there is a protected redirect in the file. 76.244.158.243 02:55, 30 September 2009 (UTC)

simple download updated import.php from here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Import.php?view=co and replace original file in /includes directory. work fine!

Above import.php doesn't work, tested under ubuntu 12


 * xml2sql has the same problem:

xml2sql-0.5/xml2sql -mv commonswiki-latest-pages-articles.xml unexpected element xml2sql-0.5/xml2sql: parsing aborted at line 10785 pos 16. 212.55.212.99 12:22, 13 February 2010 (UTC)

Error message
The error message I get is "Import failed: Loss of session data. Please try again." Ikip 02:50, 27 December 2009 (UTC)

Fix: I got this error while trying to upload a 10 MB file. After cutting it down into 3.5 MB pieces, each individual file received "The file is bigger than the allowed upload size." error messages. 1.8 MB files worked though. --bhandy 19:24, 16 March 2011 (UTC)
 * THANK YOU! This was driving me mad! LOL But your fix worked. ;) Zasurus 13:00, 5 September 2011 (UTC)

Another Fix: Put the following into your .htaccess file (adjust these figures according to the size of your dump file):

php_value upload_max_filesize 20M php_value post_max_size 20M php_value max_execution_time 200 php_value max_input_time 200

Another Fix: Set upload_max_filesize = 20M in php.ini

Does NOT allow importing of modified data on my installation
If I export a dump of the current version using dumpBackup.php --current, then make changes to that dumped file, then attempt to import the changed file back into the system using importDump.php, NONE of the changes come through, even after running rebuildall.php.

Running MW 1.15.1, SemanticMediaWiki 1.4.3.

Am I doing something wrong, or is there a serious bug that I need to report? --Fungiblename 14:09, 13 April 2010 (UTC)

And for the necro-bump.... yes, I was doing something wrong.
For anyone else who has run into this problem, you need to delete revision IDs from your XML page dumps if you want to re-import the XML after modifying it. Sorry for not posting this earlier, but this issue was addressed almost instantly as invalid in response to an admittedly invalid bug report that I filed on Bugzilla in 2010: This is exactly how it's supposed to work to keep you from overwriting revisions via XML imports. --Fungiblename 07:57, 21 September 2011 (UTC)

Error message: PHP Warning: Parameter 3 to parseForum
Two errors: PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/imagick.ini on line 1 in Unknown on line 0

PHP Warning: Parameter 3 to parseForum expected to be a reference, value given in /home/t/public_html/deadwiki.com/public/includes/parser/Parser.php on line 3243

100 (30.59 pages/sec 118.68 revs/sec)

Adamtheclown 05:11, 30 November 2010 (UTC)

XML that does NOT come from a wiki dump
Can this feature be used on an xml file that was not created as, or by, a wiki dump? I am looking for a way to import a lot of text documents at once, that can be wikified later. Your advice, wisdom, insight, etc, greatly appreciated.
 * NO - XML is a structure not a format, so the mediawiki xml-reader only accepts xml-dumps for mediawiki or simalary formated xml. -- (unsigned comment)
 * YES – You could write a script that will convert your text files into a pseudo XML dump. XML files are supposed to be human readable, so you could load a real XML dump into a text editor in order to study the structure of that dump and to write your script accordingly. With some basic programming skills, this should be pretty straightforward. -- Sloyment (talk) 17:16, 28 April 2018 (UTC)

Altered Display Titles
Hi, I am using MediaWiki 1.18.2 and I have been experimenting with the XML Import process.

I have notice that the title actually displayed on each page that has been altered by the Import process appears as Special:Import until it is edited and saved again. I assume that this is supposed to indicate that the page has been edited by the Import process, but it can be very confusing to less knowledgeable users and it also means the page RDF links produced with the SemanticMedia wiki process are incorrectly rendered.

I have noticed a similar process of the display name being cosmetically changes after other forms of mass edits, such as the MassEditRegex extension, so I assume this probably more of a core Mediawiki process, but I have not been able to find any information about this issue.

I would love to be able to turn this feature off, or perhaps at least be able to hide it for certain groups of users, any help would be greatly appreciated.

Thanks Jpadfield (talk) 11:41, 5 April 2012 (UTC)

No Page error
When I try to import a template XML file from Wikipedia, I receive an error message that says "No page available to import." Any ideas why it won't find the XML file and what work arounds? 12.232.253.2 15:56, 26 April 2012 (UTC)


 * First thing to check is that the XML file actually has any nodes within. --98.210.170.91 18:49, 5 May 2012 (UTC)

Manual error -- importImages.php ?
Why on the manual page for importdump are there examples referencing importImages.php? Looks like a cut n paste error to me.

Cant open file error
For some reason I keep getting this error suddenly in my MW 1.18. I did not have this problem in the past, I import xml files almost every month. I tried importing a small file of a single page I exported from the same wiki, but the problem persists. Obviously the files have no permissions problems.

Any idea what could be the cause of this? I'm using PHP 5.2.17, on IIS windows 2008 r2. Thank you Osishkin (talk) 23:17, 19 August 2012 (UTC)

XML Imported pages don't seem to show up in search?
I imported several hundred pages through an xml import (via import pages) and none of these pages appear in the search or auto-suggest when I start typing them into the search box. I tried seeing if somehow this was a queued job (it wasn't) as well as creating new pages afterwards to check if there was some kind of lag before new pages appear.

It seems like imported pages somehow don't get recognized as part of Mediawiki's search or auto-suggest. I specifically created/imported these pages intending to use them as simplified pages that people could see come up in search or auto-suggest, yet it seems that somehow they are not indexed?

Any help would be greatly appreciated.
 * I found an answer. After manually importing pages, they aren't necessarily added to be searched. I believe you have to run two scripts: updateSearchIndex.php and rebuildtextindex.php.


 * However, you need to specify a few parameters for updateSearchIndex.php such as a starting date. The below command worked for me


 * "php maintenance/updateSearchIndex.php -s 20081020224040"


 * I was only interested in getting the page titles to come up as searchable in the auto-suggest, so I think the updateSearchIndex.php did the trick for me. The date used is some random date that's before when I imported pages but if you have an older wiki you may need to make a modification.

Importing pages with question marks in the title
It would seem that when one imports pages with question marks in the title, and then navigates to those pages, one gets: "The requested page title was invalid, empty, or an incorrectly linked inter-language or inter-wiki title. It may contain one or more characters which cannot be used in titles." See also the comment at w:Template talk:?, "If you get the Bad Title page with the following text it means that you tried to enter the url 'yoursite.com/wiki/Template:?' instead of searching for Template:? then clicking on 'create page'". As an example, see http://rationalwikiwikiwiki.org/wiki/RationalWikiWiki:What_is_going_on_at_RationalWiki%3F, which resulted from importing http://rationalwikiwiki.org/wiki/RationalWikiWiki:What_is_going_on_at_RationalWiki%3F. Leucosticte (talk) 19:31, 2 September 2012 (UTC)

Remove a node from a dump after import
Is there an "easy" way to edit Import.php to remove an XML node after it's been properly imported?

Right now my server is parsing the whole en dump, and every time I restart it has to read through everything it's already imported... even at ~200 rev/sec, It'd take days of non stop running just to read to the end of the file, let alone import, so I wanted to try and delete everything from the compressed dump on the fly as it's imported. Any ideas?

69.179.85.49 21:40, 7 March 2013 (UTC)

Optimizing of database after import
It says "Optimizing of database after import is recommended: it can reduce database size in two or three times." but doesn't explain how to do that. --Rob Kam (talk) 07:22, 4 March 2015 (UTC) Okay, here the answer https://www.mediawiki.org/wiki/Manual:Reduce_size_of_the_database

Import using MWDumper
Please Help! When I import big wiki-dump using the MWdumper, after 1-2 hours there is an error and stops the import process. How to avoid the (to fix) this error? Is a error in the Wikipedia dumps? How to fix it? Antimatter genesis (talk) 13:00, 26 May 2015 (UTC)

speed. method of copying /var/lib/mysql
"running importDump.php can take quite a long time. For a large Wikipedia dump with millions of pages, it may take days, even on a fast server" - i know a fast way but it works only with myisam but mediawiki installs with innodb by default. it is copying files in /var/lib/mysql. as i remember, copying single databases is possible. probably there are tools that make it possible to copy also innodb. of course, to use this method with wikipedia, wikipedia admins should use it. --Qdinar (talk) 12:06, 2 December 2015 (UTC)
 * offtopic. i am importing now ttwiki-20151123-pages-meta-current.xml.bz2 it is "All pages, current versions only", 39.4 MB, in xampp in win10, cpu intel core i3-2348m, and it said at start: 0.84 pages/sec, it said 0.37 pages/sec before sleep and 0.19 pages/sec after sleep (after wake up) now it says 0.20 pages/sec. tt wikipedia has nearly 70000 articles. and many hours passed, i do not know. i am now at 16400th article, so it should be nearly 16400 / 0.4 = 41000 seconds = 11 hours. --Qdinar (talk) 11:46, 3 December 2015 (UTC)
 * a method with rsync, that require stopping source daemon only for short time : http://dba.stackexchange.com/a/2893 . --QDinar (talk) 07:28, 12 August 2016 (UTC)

importdump did not work with XMLReader::read etc issues
What helped: Opened the dump xml with Xcode and saw a Warning: Constant folding feature in Zend OPcache... - I removed this line and saved it and then it worked. It seems that if you make the dump from an unstable state in your mediawiki - this Zend OPcache issue - you are not able to to the importdump.--Donxello (talk) 11:01, 21 July 2016 (UTC)

Recent Changes not showing up?
After importing a bunch of pages with  I run   and even   but see not a trace of the new pages. They're "there"; I can see them just fine in All Pages, and each individual page is there (complete with revision history from the XML), but they just don't show up (nor do their revisions) in Special:RecentChanges. Any suggestions for where to start looking? Version 1.26.4.

If this is a known bug that's been fixed in 1.27, that's fine, I'll upgrade at some point, but a pointer to how I could backport would be appreciated!

Sk4p (talk) 19:22, 30 September 2016 (UTC)


 * If the dump is older than $wgRCMaxAge, imported entries won't be displayed. --Ciencia Al Poder (talk) 09:28, 3 October 2016 (UTC)

Wrong Red links
Dear all, I've imported a system of interdependent templates and normal pages depending on them. My wiki does not resolve the dependencies: Newly pages are displayed red, therefore the pages depending on other pages (templates) don't function. How can I tell my wiki to look for the pages? Thank you in advance. Yours, Ciciban (talk) 10:37, 19 April 2017 (UTC)


 * The refresh of those links is done in the Manual:Job queue. Try runJobs.php. If you imported the dump with importDump.php with no-updates, you should run rebuildAll.php instead. --Ciencia Al Poder (talk) 09:26, 20 April 2017 (UTC)

MWDumper
MWDumper is no longer maintained and has unaddressed bugs that resort in import failure. Should this page really be providing links to something that doesn't work? - X201 (talk) 09:21, 20 November 2017 (UTC)

Pages don't appear in categories after an XML dump import
Problem related to this topic and this topic. When will this bug be fixed ? Users can't launch the refreshLinks.php script after an XML import... Thanks in advance ! --Megajoule (talk) 09:53, 11 April 2018 (UTC)


 * What's the problem? Categorization occurs in the job queue. If the wiki is running jobs continuously, the pages will be categorized sooner or later, but it shouldn't take too much time (from minutes to a couple of hours depending on the quantity of pages imported). No need to refreshLinks.php --Ciencia Al Poder (talk) 09:15, 12 April 2018 (UTC)


 * No, categorization doesn't occur in the job queue when you import an XML dump. RunJobs.php doesn't do anything. You need to refreshLinks.php. Moreover, the fact that the categorization is asynchronous is a real problem. This change in the categorization process is a bad thing, especially when your wiki uses semantic extensions. This could have been a config choice, it should not have been imposed. --Megajoule (talk) 10:53, 12 April 2018 (UTC)


 * If categorization doesn't occur in the job queue when you import an XML dump, you should report a bug about that (if it doesn't already exist). The links you posted to bugs have been resolved since. This should be done in a job queue. There's no way a server can handle the import of lots of pages and parse the wikitext, including filling various related tables like pagelinks, externallinks, categorylinks, templatelinks, etc and their extensions on the main import thread, in a timely manner without bringing the entire site down. --Ciencia Al Poder (talk) 09:17, 13 April 2018 (UTC)