Extension talk:Data Transfer

From mediawiki.org

Adding prefix to title?[edit]

I'm looking to import a CSV file with at least 1,000 entries in it, but I want them to go into a separate namespace and provide transclusion only. Is there any way to achieve this?

What do you mean by transclusion? Yaron Koren (talk) 14:35, 18 October 2019 (UTC)Reply[reply]
Like this: Transclusion. I don't want the actual page searchable in the wiki, but I want the data available for other pages to use. So i'm creating a custom namespace for "Asset" where all the assets data will get dumped into, and stored into cargo tables and then the actual article pages will run a query to find their relevant assets data.
Oh, I thought the transclusion thing was related to the data transfer part. You just need to add the namespace to every title in the CSV file. There are various ways to do that - one is by editing the data in a spreadsheet, then saving it back to CSV. Within a spreadsheet, you can create a separate column with just the namespace (and colon), then merge that column and the title column into one. Yaron Koren (talk) 16:47, 18 October 2019 (UTC)Reply[reply]
That's what I was afraid of, was hoping I could apply something at time of import. But I think I can get awk to do what I need it to do. Thanks!

error with MW 1.34[edit]

/Special:ImportCSV Error from line 54 of ...extensions/DataTransfer/specials/DT_ImportCSV.php: Cannot access private property ImportStreamSource::$mHandle

Backtrace:

#0 ...extensions/DataTransfer/specials/DT_ImportCSV.php(29): DTImportCSV->importFromUploadAndModifyPages()
#1 ...includes/specialpage/SpecialPage.php(575): DTImportCSV->execute()
#2 ...includes/specialpage/SpecialPageFactory.php(611): SpecialPage->run()
#3 ...includes/MediaWiki.php(296): MediaWiki\Special\SpecialPageFactory->executePath()
#4 ...includes/MediaWiki.php(900): MediaWiki->performRequest()
#5 ...includes/MediaWiki.php(527): MediaWiki->main()
#6 ...index.php(44): MediaWiki->run()
#7 {main}
Sorry about that - I just checked in what I think is a fix for this. Please let me know if there's still a problem! Yaron Koren (talk) 22:48, 7 January 2020 (UTC)Reply[reply]
Thanks. I am now able to import the data, but there's an unrelated problem having to do with LF handling.Acnetj (talk) 00:59, 8 January 2020 (UTC)Reply[reply]
What is the appropriate version to use for 1.34.x? REL1_34 results in v1.0.1 (1fc1c61) 04:42, 20 September 2019 - Revansx (talk) 23:41, 5 May 2020 (UTC)Reply[reply]
You should never use the REL version of any of my extensions. You should either use the most recent version, or just the latest code. Yaron Koren (talk) 23:44, 5 May 2020 (UTC)Reply[reply]
roger that .. now the trick is to remember to always check to see if an extension is one of yours. thx - Revansx (talk) 00:24, 6 May 2020 (UTC)Reply[reply]
It's not just my extensions; it's any extension that has the "master" compatibility policy. Yaron Koren (talk) 02:18, 6 May 2020 (UTC)Reply[reply]
Gotcha. I just learned something new. cool. - Revansx (talk) 02:59, 6 May 2020 (UTC)Reply[reply]

LF handling[edit]

With the older version. a CSV field can include a new line (LF) and it is not parsed as a separate entry (it is to include free text for the wiki which has multiple lines. The current master (fixing the error above) however handles LF as if it is CRLF.

I changed this on DT_ImportCSV.php and the handling is correct:

line 133: 		$table = str_getcsv( $csvString, "\n" );

line 133: 		$table = str_getcsv( $csvString, "\r\n" );

But it ignored extra LF on the field. Acnetj (talk) 01:31, 8 January 2020 (UTC)Reply[reply]

This fix works for me in this particular instance (because I use control-enter for extra line in LibreOffice). I don't know for other instances. I think it should somehow respect the double quotation marks like for the commas.

Acnetj (talk) 01:52, 8 January 2020 (UTC)Reply[reply]

Sorry about that! What a strange bug in PHP. I just checked in code that I think works better. Yaron Koren (talk) 19:35, 8 January 2020 (UTC)Reply[reply]
I am getting
Unable to construct a valid title from "".

with the latest update. Acnetj (talk) 02:13, 10 January 2020 (UTC)Reply[reply]

What encoding is the file in, do you know? Yaron Koren (talk) 04:12, 10 January 2020 (UTC)Reply[reply]
UTF-8. Just a simple file with one line of data plus header. Acnetj (talk) 19:18, 10 January 2020 (UTC)Reply[reply]
I can't reproduce that problem. Was this exact file working before? Yaron Koren (talk) 20:21, 10 January 2020 (UTC)Reply[reply]
Sorry I just found it was a problem on my end. Just bad data. Things are working as it should for now.Acnetj (talk) 21:40, 10 January 2020 (UTC)Reply[reply]
Great, that's a relief! Yaron Koren (talk) 21:43, 10 January 2020 (UTC)Reply[reply]

Problem with overwriting fields of the template[edit]

MW 1.31, using UTF-16 LE with signature CSV Import

I try to update some content for a specific template.

For example, existing content for page "Acilius":

{{WoodhouseENELnames
 |Text=[[File:woodhouse_999.jpg|thumb
 |link={{filepath:woodhouse_999.jpg}}]]Ἀκύλιος, ὁ.
 }}

Or it could look like this:

{{WoodhouseENELnames
 |Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Ἀκύλιος, ὁ.
 }}

To update I use this file content (using option "Overwrite only fields contained in the file", other templates also existed in page):

 Title,WoodhouseENELnames[Text]
 "Acilius","[[Ἀκύλιος]], ὁ."

Result:

{{WoodhouseENELnames
 |Text=[[Ἀκύλιος]], ὁ.|thumb
 |link=&# 123;&# 123;filepath:woodhouse_999.jpg&# 125;&# 125;]]Ἀκύλιος, ὁ.
  }}

(I deliberately added space in the middle of each entity so that it will not get parsed here) So, for some strange reason the thumbnail, link and old text are maintained and some text is corrupted and/or turned into html entities. When I select "Overwrite existing content" or "Append to existing content" no such problems exist, but I cannot do that as other templates exist in those pages.

I suspect that the "|link=" bit is parsed as an extra field, when in fact it isn't.

I even tried removing first this bit from the content:

[[File:woodhouse_999.jpg|thumb
 |link={{filepath:woodhouse_999.jpg}}]]

And then trying the import. And then I got this error:

Error: the column 0 header, 'ÿþTitle', must be either 'Title', 'Free Text' or of the form 'template_name[field_name]'

Edit: I have managed to partly resolve the import into field corruption by commenting out some regexes in: DataTransfer\includes\DT_PageStructure.php

// $page_contents = preg_replace( '/{{(#.+)}}/', '&# 123;&# 123;$1&# 125;&# 125;', $page_contents );
// escape out transclusions, and calls like "DEFAULTSORT"
// $page_contents = preg_replace( '/{{(.*:.+)}}/', '&# 123;&# 123;$1&# 125;&# 125;', $page_contents );

But still, I cannot find how to stop breaking the line with the image link from

[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]

To

[[File:woodhouse_999.jpg|thumb
 |link={{filepath:woodhouse_999.jpg}}]]

Or when the existing text contains instances like

[[comparing|compare]] 

They get turned into

[[comparing
 |compare]] 

or even mixed with other words.

For example this text:

[[leave in the lurch]]: [[prose|P.]] and [[verse|V.]] [[λείπω]], [[λείπειν]] 

Looked like this when imported:

 [[leave in the lurch]]: [[prose
 |προλείπειν]], ἀμείβειν (Plat. but rare P.), V. ἐξαμείβειν, ἐκλιμπάνειν.

Also, when importing text that has headers (many equals signs, as for example a 3rd level header), line breaks are lost and the header joins the text on top and text gets mangled. For example, when importing something like:

 [[of what kind]]? [[prose|P.]] and [[verse|V.]] [[ποῖος]]; indirect: [[prose|P.]] and [[verse|V.]] [[οἷος]], [[ὁποῖος]]. 
  ===adjective===
 
  [[prose|P.]] and [[verse|V.]] [[πρᾶος]], [[ἤπιος]]

The result is:

 of what kind? P. and V. ποῖος; indirect: P. and [[verse |V.]] οἷος, ὁποῖος.===adjective===

So I am guessing that there is some pre-processing going on before the import which is affected by what the currently existing text looks like, which is quite strange.


Im having similar problems. MW 1.33.1 / Data Transfer 1.1.1. Using ImportCSV with the option Overwrite only fields contained in the file, the output has brackets like { in my existing content replaced by &# 123; etc. Commenting out the relevant lines in DataTransfer\includes\DT_PageStructure.php introduces new lines and hence white spaces. Am I using the feature incorrectly? Is there any workaround? Any comments? --Fehlerx (talk) 17:29, 30 May 2020 (UTC)Reply[reply]
In the file DT_WikiTemplate.php line 33
                //if ( $multi_line_template )
                //      $text .= "\n";

Comment out those two lines, and it should take care of the unwanted line breaks. I had the same issue, and it was REALLY annoying. 05:42, 5 October 2020 (UTC)

I believe this problem has now finally been fixed in the Data Transfer code. Sorry for the very long delay. Yaron Koren (talk) 17:22, 18 February 2021 (UTC)Reply[reply]

Internal error on Special:ImportSpreadsheet in MW 1.34.x[edit]

  • MediaWiki 1.34.1 (b1f6480) 18:15, 30 April 20
  • PHP 7.2.30 (apache2handler)
  • Data Transfer 1.1.1 (1fc1c61) 04:42, 20 September 2019
  • phpoffice/phpexcel dev-master
[XrIAjzTNAOBUFE21T0p@MQAAAAo] /test/Special:ImportSpreadsheet PHPExcel_Reader_Exception from line 73 of /opt/htdocs/mediawiki/vendor/phpoffice/phpexcel/Classes/PHPExcel/Reader/Excel2007.php: Could not open for reading! File does not exist.

Backtrace:

#0 /opt/htdocs/mediawiki/vendor/phpoffice/phpexcel/Classes/PHPExcel/IOFactory.php(281): PHPExcel_Reader_Excel2007->canRead(NULL)
#1 /opt/htdocs/mediawiki/vendor/phpoffice/phpexcel/Classes/PHPExcel/IOFactory.php(191): PHPExcel_IOFactory::createReaderForFile(NULL)
#2 /opt/htdocs/mediawiki/extensions/DataTransfer/specials/DT_ImportSpreadsheet.php(42): PHPExcel_IOFactory::load(NULL)
#3 /opt/htdocs/mediawiki/extensions/DataTransfer/specials/DT_ImportCSV.php(60): DTImportSpreadsheet->importFromFile(ImportStreamSource, NULL, array)
#4 /opt/htdocs/mediawiki/extensions/DataTransfer/specials/DT_ImportCSV.php(29): DTImportCSV->importFromUploadAndModifyPages()
#5 /opt/htdocs/mediawiki/includes/specialpage/SpecialPage.php(575): DTImportCSV->execute(NULL)
#6 /opt/htdocs/mediawiki/includes/specialpage/SpecialPageFactory.php(611): SpecialPage->run(NULL)
#7 /opt/htdocs/mediawiki/includes/MediaWiki.php(296): MediaWiki\Special\SpecialPageFactory->executePath(Title, RequestContext)
#8 /opt/htdocs/mediawiki/includes/MediaWiki.php(900): MediaWiki->performRequest()
#9 /opt/htdocs/mediawiki/includes/MediaWiki.php(527): MediaWiki->main()
#10 /opt/htdocs/mediawiki/index.php(44): MediaWiki->run()
#11 {main}

- Revansx (talk) 00:22, 6 May 2020 (UTC)Reply[reply]

Here's the pertinent lines from the debug data:

Unstubbing $wgLang on call of $wgLang::_unstub from ParserOptions->__construct

[error] [XrIKvf9myczl4IVZkwmjCgAAAAs] /test/Special:ImportSpreadsheet ErrorException from line 39 of /opt/htdocs/mediawiki/extensions/DataTransfer/specials/DT_ImportSpreadsheet.php: PHP Warning: stream_get_meta_data() expects parameter 1 to be resource, object given

[exception] [XrIKvf9myczl4IVZkwmjCgAAAAs] /test/Special:ImportSpreadsheet PHPExcel_Reader_Exception from line 73 of /opt/htdocs/mediawiki/vendor/phpoffice/phpexcel/Classes/PHPExcel/Reader/Excel2007.php: Could not open for reading! File does not exist.

- Revansx (talk) 01:01, 6 May 2020 (UTC)Reply[reply]

Any plans to switch to phpspreadsheet? (versus phpexcel)[edit]

  • MediaWiki 1.34.1 (b1f6480) 18:15, 30 April 20
  • PHP 7.2.30 (apache2handler)
  • Data Transfer 1.1.1 (1fc1c61) 04:42, 20 September 2019
  • phpoffice/phpexcel dev-master

I upgraded phpexcel to phpspreadsheet and Data Transfer said no and that phpexcel is required, however all the documentation for phpexcel says that it is obsolete. Is there any talk of getting Data Transfer to use phpspreadsheet soon? - Revansx (talk) 00:22, 6 May 2020 (UTC)Reply[reply]

The code has been updated 5 months ago to include phpspreadsheet. It checks if one of them( phpspreadsheet or phpexcel ) have been installed. Sen-Sai (talk) 12:50, 12 November 2020 (UTC)Reply[reply]

Problems with importing code with {{[edit]

I wanted to import a bunch of articles that already had the template code in them Template:My templates. I do not want to have to separate them all out into separate items, because I want the template results to be in a list. But when I run "Import XML", I have 6000 pages and it says that I only have 11 pages. So obviously something is wrong. I wish there was a way to simply say to do a direct import without messing with any template stuff.

I finally figured out to properly setup an XML file, so it imports properly.
<Pages>
  <Page Title="My mom">
    <Free_Text>
* {{he-l|שלום|t=Hello World}}

    </Free_Text>
  </Page>
</Pages>

Error Import XML ?[edit]

MW 1.35.3, PHP 7.4.21,SMW 3.2.3 ,Data Transfer 1.2

I can import the CSV file but I hit the below error message when I import XML file .

[952b888cb21a4ba6e34ca73a] /demo/index.php/Special:%E5%AF%BC%E5%85%A5XML Error from line 66 of /home/sjkcyuhu/public_html/tbpedia.org/demo/extensions/DataTransfer/includes/specials/DT_ImportXML.php: Cannot access private property DTXMLParser::$mPages

Backtrace:

  1. 0 /home/sjkcyuhu/public_html/tbpedia.org/demo/extensions/DataTransfer/includes/specials/DT_ImportXML.php(35): DTImportXML->modifyPages(ImportStreamSource, string, string)
  2. 1 /home/sjkcyuhu/public_html/tbpedia.org/demo/includes/specialpage/SpecialPage.php(600): DTImportXML->execute(NULL)
  3. 2 /home/sjkcyuhu/public_html/tbpedia.org/demo/includes/specialpage/SpecialPageFactory.php(635): SpecialPage->run(NULL)
  4. 3 /home/sjkcyuhu/public_html/tbpedia.org/demo/includes/MediaWiki.php(307): MediaWiki\SpecialPage\SpecialPageFactory->executePath(Title, RequestContext)
  5. 4 /home/sjkcyuhu/public_html/tbpedia.org/demo/includes/MediaWiki.php(940): MediaWiki->performRequest()
  6. 5 /home/sjkcyuhu/public_html/tbpedia.org/demo/includes/MediaWiki.php(543): MediaWiki->main()
  7. 6 /home/sjkcyuhu/public_html/tbpedia.org/demo/index.php(53): MediaWiki->run()
  8. 7 /home/sjkcyuhu/public_html/tbpedia.org/demo/index.php(46): wfIndexMain()
  9. 8 {main}
Sorry about that - this was fixed about a week ago. Yaron Koren (talk) 17:32, 22 September 2021 (UTC)Reply[reply]

ImportCSV too slow[edit]

I'm importing a CSV with 650 rows. It only imported about 50 after 10 hours. 1. What can I do to accelerate this import process?

I have noticed the process goes background and keep creating. 2. How can I kill the import process?

You can speed things up by calling the script runJobs.php, within MediaWiki's /maintenance process. Conversely, if you want to stop the import, go into MediaWiki's "job" database table and delete the rows in that table. Yaron Koren (talk) 17:30, 22 September 2021 (UTC)Reply[reply]

Spreadsheet import does no work / InvalidArgumentException[edit]

Setup
  • MediaWiki 1.35.4 (9e24c44) 08:21, 18. Okt. 2021
  • PHP 7.3.29-1~deb10u1 (apache2handler)
  • MariaDB 10.3.31-MariaDB-0+deb10u1
  • Data Transfer 1.2.1 (138d06f) 16:58, 24. Sep. 2021
  • phpoffice/phpspreadsheet 1.18.0
Issue
[b9a401e363fc376318d38743] /whn/Special:ImportSpreadsheet InvalidArgumentException from line 146 of /../w/vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Shared/File.php: File "" does not exist. 

Not sure if it is phpspreadsheet or Data Transfer. Also tracked with task T294585.

-- [[kgh]] (talk) 18:34, 28 October 2021 (UTC)Reply[reply]

This was fixed a few weeks ago, and the fix is now in version 1.4. Yaron Koren (talk) 19:28, 17 February 2022 (UTC)Reply[reply]

Import into embedded templates[edit]

When importing a CSV file, it is not a problem to import data into two different templates as long as they are after each other. You could have a csv file for a page "Jane Doe" with the columns "Person[Fist name]", "Person[Last Name"] and "Skill[Level]". But how to do this in embedded templates? Ist this even possible? How to setup the CSV file to achieve something like

{{Person
...
|Skills={{Skill|Name=Cooking|Level=5}}{{Skill|Name=Guitar|Level=2}}{{Skill|Name=Karate|Level=4}}
...
}}
You would need to have the embedded templates wikitext as text in the CSV file - with the curly brackets and all. There is no better way to do it, unfortunately. Yaron Koren (talk) 15:52, 8 February 2022 (UTC)Reply[reply]
Ok. Next thing I tried is to not embed it in a field, just add additional multiple instance data. The problem is that when you enter fields with the same name, only the first entry gets imported:
In this example the CSV would look like:
"Jane Doe", "Person[Fist name]", "Person[Last Name"], "Skill[Name]", "Skill[Level]", "Skill[Name]", "Skill[Level]".
and it would have "Cooking, 5, Guitar, 2" in the Skill columns. However, if you import this, only the first skill fields gets added. This is not wrong, but in the case of wanting to import multiple instance data, it is unfortunate.
There is a workaround importing it in a separate csv-file and select "Append to existing content", but this it not a very stable approach I guess, because then the sequence of imports becomes crucial. Krabina (talk) 18:48, 17 February 2022 (UTC)Reply[reply]
What would you suggest as syntax in the spreadsheet for this kind of import of what you could call "three-dimensional data"? Yaron Koren (talk) 19:30, 17 February 2022 (UTC)Reply[reply]

Improvements for CSV import[edit]

Hi Yaron. I finally ported my personal improvements to the CSV import form after meaning to share them for a couple of years now. I have them merged in my local copy of the github repo. Should I send them to you for review as a pull request or do you prefer another way? Here is a summary of the changes I made:

  • Improvement - en.json language file - Added new form elements definitions
  • Improvement - DT_Utils.php - Added new options to form: Follow redirects, clear empty cells instead of skip, dry run
  • Improvement - DT_Utils.php - Set 'overwrite only fields in file' by default -> safer option
  • Improvement - DT_Utils.php - Added internal printImportCheckbox function
  • Bug - DT_Utils.php - Moved input html elements outside of label elements -> display elements vertically in form instead of horizontally
  • Improvement - DT_ImportCSV.php - print form at end of the page to allow processing of another file
  • Improvement - DT_ImportCSV.php - added support for follow redirects, clear empty cells and dry run
  • Improvement - DT_ImportCSV.php - added support for Bootstrap classes to make errors and warnings pop in results page Lalquier (talk) 22:09, 12 March 2022 (UTC)Reply[reply]