Extension:DumpHTML

From MediaWiki.org
Jump to: navigation, search
MediaWiki extensions manual
Crystal Clear app error.png
dumpHTML

Release status: unstable

Implementation Data extraction
Description Creates a simple HTML dump of a MediaWiki installation.
Author(s) Tim Starling, Kelson
Latest version 1.18.0+
License GPL or Any OSI approved license
Download

Translate the DumpHTML extension if it is available at translatewiki.net

Check usage and version matrix; code metrics

dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation. MediaWiki versions before 1.12.0 used the maintenance script dumpHTML.php instead.

Beware, cowboy![edit | edit source]

DumpHTML required a lot of work and is permanently broken since August 2008. It stopped working shortly after it was split from core in 2008; complicated in 2009; worsened by ResourceLoader in 2010 and then in 2011 and later.

The only alive human known to have managed using dumpHTML with success is Kelson, to produce Kiwix ZIM files (with a lot of hacks). There were plans [1] [2] [3] to fix dumpHTML, but they've been abandoned in 2013.

A simple functioning solution to produce static HTML from MediaWiki doesn't currently exist! Modern developers use Parsoid and mwoffliner: [4] [5]. Very brave PHP developers willing to fix dumpHTML should probably plan some weeks of work on it; sysadmins may try using the file cache and check the HTML files produced in the cache directory.

Parameters[edit | edit source]

dumpHTML does not function like a normal extension; you must run it from the command line.

Option/Parameter Description
-d <dest> destination directory
-s <start> start ID
-e <end> end ID
-k <skin> skin to use (defaults to offline)
--no-overwrite skip existing HTML files
--checkpoint <file> use a checkpoint file to allow restarting of interrupted dumps
--slice <n/m> split the job into m segments and do the n'th one
--images only do image description pages
--shared-desc only do shared (commons) image description pages
--no-shared-desc don't do shared image description pages
--categories only do category pages
--redirects only do redirects
--special only do miscellaneous stuff
--interlang allow interlanguage links
--image-snapshot copy all images used to the destination directory
--compress generate compressed version of the html pages
--udp-profile <N> profile 1/N rendering operations using ProfilerSimpleUDP
--munge-title <HOW> available munging algorithms: none, md5, windows

Example to create a complete snapshot including image and media files and image thumbnail files in directory wikidump (LINUX)

/usr/bin/php /srv/www/mediawiki/extensions/DumpHTML/dumpHTML.php -d /srv/www/mediawiki/wikidump -k monobook --image-snapshot

Known issues[edit | edit source]

Warning! This extension is not properly maintained at the moment! You may encounter a number of issues. Any help fixing these (especially by sending patches to Gerrit) is greatly appreciated!

Filename problems solved by a modified version of DumpHTML[edit | edit source]

  • fixed via (--munge-title <HOW> available munging algorithms: none, md5, windows) in r115629

If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if the wiki pages or files had non-ASCII characters (which is likely) then you probably need to change the link references, the directories, and filenames from UTF-8 to your Windows character encoding (for example to codepage 1252 for Western-European systems), but browsers may still have difficulties accessing the files.

Bugzilla 8147 "Filenames in the HTML static dump" has a patch for DumpHTML.inc that converts article, image, thumbnail image, and media filenames to their MD5-hashed version, which avoids character encoding problems on different operation systems.

Skin hacking[edit | edit source]

If you modified your skin (e.g. monobook) then this script will likely fail. Upgrade/update your mediawiki installation and replace any "hacked" skins, then re-try.

Extensions compatibility[edit | edit source]

For the same reason, some extensions modifying output aren't compatible with DumpHTML, like Extension:SyntaxHighlight_GeSHi.

If you use InstantCommons[edit | edit source]

If you use your dump on a custom MediaWiki install using InstantCommons, the script will consider your images files are in the images/wikimediacommons folder of the target directory.

Thus, if you encounter a message as:

Warning: file_put_contents(/tmp/wiki/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg):
failed to open stream: No such file or directory in [...]/w/extensions/DumpHTML/dumpHTML.inc on line 1377

You have to download http://upload.wikimedia.org/wikipedia/commons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg to /tmp/yourdump/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg and restart the dump operation.

Static Wikipedia[edit | edit source]

See http://dumps.wikimedia.org/ and for example http://dumps.wikimedia.org/other/static_html_dumps/ for static snapshot examples. The last HTML dumps there were generated in 2008 (bug 15017).