Extension talk:DumpHTML/LQT Archive 1

Current issues:
--92.195.50.177 14:17, 20 April 2008 (UTC)
 * 1) Notice: Use of OutputPage::setParserOptions is deprecated in ...\GlobalFunctions.php on line 2480
 * 2) Pagenames with nonstandard characters (äöüß etc.) crash the script with a can't open file error
 * I have found that the special characters crash the script because the script is trying to write to a directory that dopes not exist. Add  to the function writeArticle in the dumpHTML.inc file and non-US characters seem to work just fine. -- Seán Prunka

Installation instructions
Some installation instructions would be helpful. Here's what I did. Hopefully someone more knowledgeable than me can edit this and move it to the article page.

If you have web access from your MediaWiki server, this should suffice:

cd /whatever/mediawiki/extensions svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/DumpHTML

I don't, so I had to do this on a separate machine:

cd /tmp svn export http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/DumpHTML

Subversion retrieves files, and reports their names and the revision number.

tar cjvf ~/DumpHTML-version.tar.bz2 DumpHTML rm -rf DumpHTML

Then on my MediaWiki machine:

cd /whatever/mediawiki/extensions tar xjvf ~/DumpHTML-version.tar.bz2

Invocation:

php /whatever/mediawiki/extensions/DumpHTML/dumpHTML.php options

Localsettings.php
Is it possible to call this extension with a line in localsettings.php? --Rovo 01:43, 13 June 2008 (UTC)


 * Rovo, sorry no. It's built to be run from a shell. --Gadlen 09:12, 18 August 2008 (UTC)

Usage Instructions
DumpHTML.php expects to be run from the maintenance directory. The "skin" directory won’t get included in the HTML package if you run it from another directory. So if you are running it on a cron job and putting together a .tar.gz of your wiki for downloading, your shell script might look something like this:


 * 1) !/bin/sh

cd /YourWikiDirectory/extensions/DumpHTML

/bin/php dumpHTML.php -d /YourTargetDirectoryForHTML -k monobook --image-snapshot --force-copy

/bin/tar -czf /TemporaryDirectoryForTarball/YourWikiAllWrappedUp.tar.gz /YourTargetDirectoryForHTML

mv /TemporaryDirectoryForTarball/YourWikiAllWrappedUp.tar.gz /YourWebAccessibleDirectory/YourWikiAllWrappedUp.tar.gz --Gadlen 09:12, 18 August 2008 (UTC)

Downloading external links
I'm developing a resource that will most likely include links to other sites or at least the reports hosted on those other sites. Is it possible for this extension to perhaps download the first level/depth of external links too? --Charlener 02:53, 15 September 2008 (UTC)


 * You could perhaps modify my script (see Bugzilla 8147) accordingly. --Wikinaut 06:43, 15 September 2008 (UTC)

Dumping pages for commons images
I'm able to generate image pages for local images. What data dumps must be loaded to generate the shared (commons) image pages? Best regards. Naudefj 15:06, 7 October 2008 (UTC)


 * I don't know, I haven't used that feature yet. --Wikinaut 16:25, 7 October 2008 (UTC)

I've noticed that the provided HTML dumps include commons image pages. Are they generated with this script? If not, what dumper program is used to generate them? Best regards. Naudefj 21:05, 8 October 2008 (UTC):


 * Have a look to the source. As far as I understand, they _are_ copied indeed. --Wikinaut 06:08, 9 October 2008 (UTC)

Usage with symlinked MW core?
I inherited a setup that has a single MW install and a number of wikis. Each wiki is setup as symlinks for all of MW except ./images and LocalSettings.php. My problem is this script wants to refer to the installation directory of MW instead of the "home" of each wiki. I tried export MW_INSTALL_PATH=".../home_of_wiki" which almost works except the image code still tries to go to the actual install path. I've noticed this behavior in other MW scripts. The core maintenance scripts are supposed to have an option to solve this problem. --Cymen 18:13, 21 October 2008 (UTC)

getFriendlyName
Whats the delay in moving to getFriendlyName? Why is getHasedFilename used?


 * article page section "Filename problems solved by a modified version of DumpHTML" explains it. --Wikinaut 07:32, 7 January 2009 (UTC)

What is the purpose of this? I commented it out, it seems to work fine, but did I create the potential for disaster?


 * use my version from8147as it fixes several problems with non-ASCII-article and image filenames. --Wikinaut 19:01, 7 January 2009 (UTC)
 * when you register an account here on mediawiki, you can make use of the email notification and you will receive an e-mail when the page(s) you are watching is changed, for example, when answering your questions. Make sure to have your e-mail address confirmed and the correct settings in your preferences enabled. --Wikinaut 19:03, 7 January 2009 (UTC)

Inline CSS
Why are the CSS from the page not transfered? --DaSch 21:32, 12 February 2009 (UTC)
 * What do you mean exactly ? --Wikinaut 01:03, 13 February 2009 (UTC)
 * Compare http://www.wecowi.de and http://static.wecowi.de --DaSch 10:59, 13 February 2009 (UTC)


 * I visited your site. One remark: the loading times of your site are very long.
 * Question: What version of DumpHTML have you used ? My version (see section on article page can be downloaded via the URL mentioned in8147) and has fixed some problems, especially with filenames. Can you please try this ? Regarding the "standard" version, if have no idea. Perhaps a good idea to ask Tim Starling during the Developers meeting in Berlin. --Wikinaut 11:59, 13 February 2009 (UTC)

I just can't dump
Well, I just tried tonight to dump my wiki to HTML. My wiki's language is spanish. There was a error in a page called "Flashback (animación de Flash)", my operative system is "Windows 7" what do I have to do? --MisterWiki 16:15, 24 July 2009 (UTC)
 * Please help me --MisterWiki 21:17, 28 July 2009 (UTC)


 * Special characters crash the script. My guess is it's the "ó" in the title. If there aren't many, rename those pages to leave out the ó or any other accents.


 * I have found that the special characters crash the script because the script is trying to write to a directory that dopes not exist. Add  to the function writeArticle in the dumpHTML.inc file and non-US characters seem to work just fine. -- Seán Prunka

Error: Cannot modify header information
Me can't dump either, used Windows XP Pro, MediaWiki 1.13.4, PHP 5.2.8 (apache2handler), MySQL 5.1.30-community. Still the same error like in the old one from the year 2005 (https://bugzilla.wikimedia.org/show_bug.cgi?id=4132).

Error message: Warning: Cannot modify header information - headers already sent by (output star ted at \extensions\DumpHTML\dumpHTML.inc:619) i n \includes\WebResponse.php on line 10

--Wissenslogistiker 10:36, 14 August 2009 (UTC)


 * Remove ?> from the end of dumpHTML.inc. —Emufarmers(T 02:01, 15 August 2009 (UTC)

Blows up on PostgreSQL
...on MediaWiki 1.15.1, though this may not be the extension's fault:

WARNING: destination directory already exists, skipping initialisation Creating static HTML dump in directory /my/target/directory. Using database localhost Starting from page_id 1 of 727 Processing ID: 1 Warning: pg_query: Query failed: ERROR: column "mwuser.user_id" must appear in the GROUP BY clause or be used in an aggregate function in /usr/share/mediawiki/includes/db/DatabasePostgres.php on line 580 Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging information. zsh: segmentation fault MW_INSTALL_PATH=/my/symlinkfarm/path php5 dumpHTML.php  -d

I don't know where to report issues, so here it goes. --Wwwwolf 11:14, 12 October 2009 (UTC)

PostgreSQL and Mediawiki 1.12.0
If the error is DB connection error: No database connection It may be a problem logging into the database. Looking in the PostgreSQL logs revealed I had to adapt pg_hba.conf

--Albert25 11:45, 18 January 2011 (UTC)

This is the error message I get when I try to execute the dumpHTML.php file on my local machine. Does anybody know a fix for that?


 * I have a similar problem which I cant solve:

DB connection error: Ein Verbindungsversuch ist fehlgeschlagen, da die Gegenstel le nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergest ellte Verbindung war fehlerhaft, da der verbundene Host nicht reagiert hat. (localhost)
 * Mediawiki 13.3, Win7, Mowes webserver, just installed php5.3 without webserver support to execute the dumpHTML.php
 * I can trace the problem to occur in includes/db/Loadbalancer.php function reallyOpenConnection, $db = new $class( $host, $user, $password, $dbname, 1, $flags ); just times out
 * database class is DatabaseMysql (I cant find the source), host, database name, user and password are correct
 * any help?

Minor: pages vrs page titles
"If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if your wiki pages or files had non-ASCII characters, which is likely, then you probably need to change the link references,"

Should this be "if your wiki page titles or filenames had non-ASCII characters"? If correct, it would be much clearer.

How to export Monobook Skin to the HTMLdump?
When I export with the HTMLdump extension it always uses the Offline Skin. And its realy ugly. How can I make my Export to the Monobook Skin? Thank you.

Howto provide login data
I have a MW V1.16.0 with the Lockdown-extension installed and need a username and password to look at it. It is a Windows system so I used the modified Version of dumphtml that produces the hash-filenames. How can i provide username and password with DumpHTML? First time using, DumpHTML asked me providing a -group parameter and i did. DumpHTML then produced a lot of stuff. But i cant login to the index site in the static wiki. Furthermore some (many,most) pages and their pathes are missing in the static wiki. The page "login required" (Anmeldung erforderlich in German) exist multiple, multiple times. Everytime a page is shown it is this one, each on another path and filename. I tried to give 'read' permission to * in localsettings.php, but then the extension produces a static wiki without any style, pictures...and even many pages and paths linked to do not exist.
 * get the version for MW 1.16 - there is a --group parameter. use the group "user" (sysop doesn't work for me) --212.114.205.190 11:48, 29 September 2011 (UTC)

Error
When i try to run the script, i always get the error message: default users are not allowed to read, please specify (--group=sysop). I also tried it with this option, but then i become the error message "the specified user group is not allowed to read". Any ideas? :)
 * The group "user" works for me --212.114.205.190 11:49, 29 September 2011 (UTC)

Bug with german umlauts in filenames of images
When there is an umlaut in the filename of an image, the image will be saved in the dump but with a wrong name - the link in the HTML is not working. Does somebody know how to fix this? --212.114.205.190 12:51, 29 September 2011 (UTC)

Unicode diacritic character on dumped html
My chars, mostly diacritic chars in dumped htmls seem changed.

Examples:
 * Saṃyutta Nikāya -> Sa峁儁utta Nik膩ya
 * … -> 鈥�
 * Soṇadaṇḍa Sutta -> So峁嘺da峁囜笉a

Does anyone have similar issue?

Thanks. Benzwu