Extension:DumpHTML
![]() | This extension is currently not actively maintained! Although it may still work, any bug reports or feature requests will more than likely be ignored. If you are interested in taking on the task of developing and maintaining this extension, you can request repository ownership. As a courtesy, you may want to contact the author. You should also remove this template and list yourself as maintaining the extension in the page's {{extension}} infobox. |
dumpHTML Release status: unmaintained |
|
---|---|
Implementation | Data extraction |
Description | Creates a simple HTML dump of a MediaWiki installation. |
Author(s) | Tim Starling, Kelson |
Latest version | 1.18+ |
License | GNU General Public License 2.0 or later |
Download | |
Translate the DumpHTML extension if it is available at translatewiki.net | |
Issues | Open tasks · Report a bug |
dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation. MediaWiki versions before 1.12.0 used the maintenance script dumpHTML.php instead.
Beware, cowboy![edit]
DumpHTML required a lot of work and is permanently broken since August 2008. It stopped working shortly after it was split from core in 2008; complicated in 2009; worsened by ResourceLoader in 2010 and then in 2011 and later.
The only alive human known to have managed using dumpHTML with success is Kelson, to produce Kiwix ZIM files (with a lot of hacks). There were plans [1] [2] [3] to fix dumpHTML, but they've been abandoned in 2013.
A simple functioning solution to produce static HTML from MediaWiki doesn't currently exist! Modern developers use Parsoid and mwoffliner: [4] [5]. Very brave PHP developers willing to fix dumpHTML should probably plan some weeks of work on it; sysadmins may try using the file cache and check the HTML files produced in the cache directory.
Parameters[edit]
dumpHTML does not function like a normal extension; you must run it from the command line.
Option/Parameter | Description |
---|---|
-d <dest> | destination directory |
-s <start> | start ID |
-e <end> | end ID |
-k <skin> | skin to use (defaults to offline) |
--no-overwrite | skip existing HTML files |
--checkpoint <file> | use a checkpoint file to allow restarting of interrupted dumps |
--slice <n/m> | split the job into m segments and do the n'th one |
--images | only do image description pages |
--shared-desc | only do shared (commons) image description pages |
--no-shared-desc | don't do shared image description pages |
--categories | only do category pages |
--redirects | only do redirects |
--special | only do miscellaneous stuff |
--interlang | allow interlanguage links |
--image-snapshot | copy all images used to the destination directory |
--compress | generate compressed version of the html pages |
--udp-profile <N> | profile 1/N rendering operations using ProfilerSimpleUDP |
--munge-title <HOW> | available munging algorithms: none, md5, windows |
Example to create a complete snapshot including image and media files and image thumbnail files in directory wikidump (LINUX)
/usr/bin/php /srv/www/mediawiki/extensions/DumpHTML/dumpHTML.php -d /srv/www/mediawiki/wikidump -k monobook --image-snapshot
Known issues[edit]
Warning! This extension is not properly maintained at the moment! You may encounter a number of issues. Any help fixing these (especially by sending patches to Gerrit) is greatly appreciated!
Filename problems solved by a modified version of DumpHTML[edit]
- fixed via (--munge-title <HOW> available munging algorithms: none, md5, windows) in r115629
If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if the wiki pages or files had non-ASCII characters (which is likely) then you probably need to change the link references, the directories, and filenames from UTF-8 to your Windows character encoding (for example to codepage 1252 for Western-European systems), but browsers may still have difficulties accessing the files.
Bugzilla 8147 "Filenames in the HTML static dump" has a patch for DumpHTML.inc that converts article, image, thumbnail image, and media filenames to their MD5-hashed version, which avoids character encoding problems on different operation systems.
Skin hacking[edit]
If you modified your skin (e.g. monobook) then this script will likely fail. Upgrade/update your mediawiki installation and replace any "hacked" skins, then re-try.
Extensions compatibility[edit]
For the same reason, some extensions modifying output aren't compatible with DumpHTML, like Extension:SyntaxHighlight_GeSHi.
If you use InstantCommons[edit]
If you use your dump on a custom MediaWiki install using InstantCommons, the script will consider your images files are in the images/wikimediacommons folder of the target directory.
Thus, if you encounter a message as:
Warning: file_put_contents(/tmp/wiki/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg): failed to open stream: No such file or directory in [...]/w/extensions/DumpHTML/dumpHTML.inc on line 1377
You have to download http://upload.wikimedia.org/wikipedia/commons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg to /tmp/yourdump/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg and restart the dump operation.
Static Wikipedia[edit]
See http://dumps.wikimedia.org/ and for example http://dumps.wikimedia.org/other/static_html_dumps/ for static snapshot examples. The last HTML dumps there were generated in 2008 (task T17017).