| Data Transfer
Release status: stable
|Description||Allows for importing and exporting the contents of a wiki's pages in XML and CSV form, using template calls to define the fields|
|Author(s)||Yaron Koren <firstname.lastname@example.org>|
|Latest version||1.5 (November 2022)|
|Compatibility policy||Master maintains backward compatibility.|
|License||GNU General Public License 2.0 or later|
|Example||The "view XML" page for Discourse DB|
|Translate the Data Transfer extension if it is available at translatewiki.net|
|Issues||Open tasks · Report a bug|
Data Transfer is an extension to MediaWiki that allows users to both export and import data from and to the wiki, with export done in XML format and import possible in XML, CSV, and some spreadsheet formats.
It should be noted that Data Transfer is not an ideal solution for backing up one's wiki, or transferring wiki pages from one MediaWiki site to another; for that, the much better solution is to use MediaWiki's built-in "Special:Export" and "Special:Import" pages.
Code and download
You can download the Data Transfer code, in .zip format, here.
Or download the code via Git from the MediaWiki source code repository by running this command from the extensions directory:
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/DataTransfer
To view the code online, including version history for each file, you can go here.
After you've obtained a 'DataTransfer' directory (either by extracting a compressed file or downloading via Git), place this directory within the main MediaWiki 'extensions' directory. Then, in the file 'LocalSettings.php' in the main MediaWiki directory, add the following line:
wfLoadExtension( 'DataTransfer' );
By default, the importing of files is allowed only for administrators/sysops. If you want other groups to be able to import files, you can add additional lines to LocalSettings.php to allow that. This line, for example, will allow all users to import files:
$wgGroupPermissions['user']['datatransferimport'] = true;
To allow anyone reading the wiki to import files, you could add the following (though it's not usually recommended):
$wgGroupPermissions['*']['datatransferimport'] = true;
Data Transfer defines a special page, "Special:ViewXML", that lets users view (and thus save) the pages in any combination of the wiki's categories and namespaces in XML form. The fields and values in the XML are taken from the fields and values in any template calls contained in the page; any non-template text is put into one or more "free text" tags. In addition, an "ID" field is also displayed for every page, using MediaWiki's internal "article ID" for that page; this is done so that outside systems can track a page with a more fixed identifier than its name (which can change often). The XML contains only the current state of any page: information on authors and dates modified, and information on previous versions of each page, are not recorded.
Two formats for export are supported: the first, or standard one, contains tags of the form
<Template name="template-name"> and
<Field name="field-name">. The second, or "simplified" one, contains tags of simply the form
Special:ViewXML can also be used to generate XML for individual pages, by adding a "&titles=" parameter to the URL, like "&titles=Page 1|Page 2|Page 3".
By default, the "free text" (non-template) part of a page is parsed by the MediaWiki parser, so that wikitext gets converted into HTML; whereas the values within template calls are not. To disable parsing for the free text, add the following to LocalSettings.php:
$wgDataTransferViewXMLParseFreeText = false;
Conversely, to add parsing for template field values, add the following:
$wgDataTransferViewXMLParseFields = true;
Data Transfer defines three special pages, "Special:ImportXML", "Special:ImportCSV" and "Special:ImportSpreadsheet", that let users with administrator privileges upload XML, CSV and assorted spreadsheet files, respectively. Once uploaded, the data is turned into pages in the wiki (or, if pages with those names already existed in the wiki, new versions of those pages).
Importing XML files
The XML import requires the standard, i.e. non-simplified, XML format that "ViewXML" produces, although with several differences: the "ID" attribute for each page should not be present, and tags called "Category" or "Namespace" (in whatever language the wiki is in) should not be present.
XML simplified output
<pages> <page> <id>28</id> <title>Limburger</title> <Free_Text id="1"> <p><b>Limburger</b> is a cheese that originated in the Herve area of the historical Duchy of Limburg.</p> </Free_Text> </page> </pages>
XML standard format (input and output)
This is both an output format and the format that is needed for importing data in XML format.
<Pages> <Page Title="Limburger"> <Free_Text> <p><b>Limburger</b> is a cheese that originated in the Herve area of the historical Duchy of Limburg.</p> </Free_Text> </Page> </Pages>
You can also import content into page "slots" other than the main one, using MediaWiki's Multi-Content Revisions feature, by adding the "Slot" attribute, like so:
<Pages> <Page Title="Limburger" Slot="text-notes"> <Free_Text> <p><b>Limburger</b> is a cheese that originated in the Herve area of the historical Duchy of Limburg.</p> </Free_Text> </Page> </Pages>
The text within the "Free_Text" field cannot be indented. This is the same as indenting text in an article. If the Free_Text in HTML and it is indented, all of the records will fail. If the HTML text is not indented, the records will import fine.
Importing CSV files
For CSV import to work:
- The file must be in standard CSV format (i.e., separated by commas, as opposed to semicolons or anything else)
- If the file contains non-ASCII characters it must be encoded in either UTF-8 or UTF-16 (the latter being simply called "Unicode" in some Windows programs)
- File's line breaks should contain "line feeds" ("\n") as opposed to just "carriage returns" ("\r"). This is especially if you're using Mac OS
- The top row must contain the name of each column
- One of the columns must contain the title of each page, and so its column name must be "Title" (in whatever language the wiki is in)
- Another column can contain all the free, non-template text in the page: the title of this column must be "Free Text" (again, in the language of the wiki)
- All other columns must represent the contents of a single field of a single template call; the name of such a column should be of the form "template-name[field-name]" (whitespace allowed). There is no need to separately specify the names of the template(s) called in the page.
A brief tutorial on the CSV format: if a value contains a comma, you must enclose it in double quotes. If a field containing one or more double quotes needs to be enclosed in double quotes, those double quotes should be escaped as double double quotes. An empty field can either be left empty, or contain a double double quote. You can see here for the full CSV specification.
Here is an example of a CSV file that can be parsed by Data Transfer:
Title,Cheese[Country],Cheese[Texture],Free Text Mozarella,Italy,Semi-soft,It's good on pizzas! Cheddar,England,Hard/semi-hard,"Often sharp, but not always." Gorgonzola,Italy,"buttery or firm, crumbly","salty, with a ""bite"" from its blue veining" Stilton,,"",needs more data
Importing spreadsheet files
For the spreadsheet import, Data Transfer requires the presence of the PhpSpreadsheet library, which does the actual spreadsheet processing. PhpSpreadsheet can handle spreadsheet files in formats including .xls, .xlsx, .ods, Gnumeric, and even PDF and HTML. The titles of the columns should be the same as for CSV files.
Data Transfer was mostly written by Yaron Koren, reachable at yaron57 -at- gmail.com. Important functionality was also added by Stephan Gambke and Sahaj Khandelwal.
Data Transfer is currently at version 1.5. See the entire version history.
- The import of each page is a MediaWiki background "job". This means that the page creations will not be done immediately, and may take minutes, hours or even longer to complete. Normally, jobs get activated every time a page is viewed on the wiki; to speed up the process (or slow it down), you can change the number of jobs run when a page is viewed, by setting $wgJobRunRate; the default is 1. A job run rate that is too high can conceivably cause a problem such that some jobs don't run. To have the wiki run all jobs immediately, execute the script runJobs.php from the operating system command line.
Customizing the export XML
You can specify that any specific page not be included in the XML produced, by adding the category tag "[[Category:Excluded from XML]]" to that page. You can also add this tag to a template, to exclude any page that uses that template from the XML.
Contributing to the project
Bugs and feature requests
You should use the MediaWiki mailing list, mediawiki-l, for any questions, suggestions or bug reports about Data Transfer.
Contributing patches to the project
If you found some bug and fixed it, or if you wrote code for a new feature, please either do a Git commit of it (if you have a developer account), or create a patch by going to your local "DataTransfer" directory, and typing:
git diff > descriptivename.patch
If you create a patch, please send it, with a description, to Yaron Koren.
Translation of Data Transfer is done through translatewiki.net. The translation for this extension can be found here. To add language values or change existing ones, you should create an account on translatewiki.net, then request permission from the administrators to translate a certain language or languages on this page (this is a very simple process). Once you have permission for a given language, you can log in and add or edit whatever messages you want to in that language.
- Page import - overview of all page import tools
|This extension is included in the following wiki farms/hosts and/or packages:|