Extension:DumpHTML

From MediaWiki.org
(Redirected from Manual:DumpHTML.inc)
Jump to: navigation, search
Language: English
MediaWiki extensions manual - list
Crystal Clear action run.png
dumpHTML

Release status: stable

Implementation Data extraction
Description Creates a simple HTML dump of a MediaWiki installation.
Author(s) Tim Starling
Last version 1.12.0+
License GPL or Any OSI approved license
Download Download snapshot
Subversion [Help]

Browse source code
View code changes

Check usage (experimental)

dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation. MediaWiki versions before 1.12.0 used the maintenance script dumpHTML.php instead.

Contents

[edit] Parameters

dumpHTML does not function like a normal extension; you must run it from the command line.

Option/Parameter Description
-d <dest> destination directory
-s <start> start ID
-e <end> end ID
-k <skin> skin to use (defaults to htmldump)
--no-overwrite skip existing HTML files
--checkpoint <file> use a checkpoint file to allow restarting of interrupted dumps
--slice <n/m> split the job into m segments and do the n'th one
--images only do image description pages
--shared-desc only do shared (commons) image description pages
--no-shared-desc don't do shared image description pages
--categories only do category pages
--redirects only do redirects
--special only do miscellaneous stuff
--force-copy copy commons instead of symlink, needed for Wikimedia
--interlang allow interlanguage links
--image-snapshot copy all images used to the destination directory
--compress generate compressed version of the html pages
--udp-profile <N> profile 1/N rendering operations using ProfilerSimpleUDP

Example to create a complete snapshot including image and media files and image thumbnail files in directory wikidump (LINUX)

/usr/bin/php /srv/www/mediawiki/extensions/DumpHTML/dumpHTML.php -d /srv/www/mediawiki/wikidump -k monobook --image-snapshot --force-copy

[edit] Known issues

[edit] Filename problems solved by a modified version of DumpHTML

If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if your wiki pages or files had non-ASCII characters, which is likely, then you probably need to change the link references, the directories and filenames from UTF-8 (on Linux) to the character encoding on your Windows (for example to codepage 1252 for Western-European systems), but browsers may still have difficulties to access the files.

Bugzilla 8147 "Filenames in the HTML static dump" comes with such a patch for DumpHTML.inc and converts article, image, thumbnail image and media filenames to their MD5-hashed version. Snapshots can then be written to CD/DVDs and filename character encoding problems on different operation systems are avoided.

[edit] Skin hacking

If you modified your skin (e.g. monobook) then this script will likely fail. Upgrade/update your mediawiki installation and replace any "hacked" skins, then re-try.

[edit] Extensions compatibility

For the same reason, some extensions modifying output aren't compatible with DumpHTML, like Extension:SyntaxHighlight_GeSHi.

[edit] If you use InstantCommons

If you use your dump on a custom MediaWiki install using InstantCommons, the script will consider your images files are in the images/wikimediacommons folder of the target directory.

Thus, if you encounter a message as:

Warning: file_put_contents(/tmp/wiki/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg):
failed to open stream: No such file or directory in [...]/w/extensions/DumpHTML/dumpHTML.inc on line 1377

You have to download http://upload.wikimedia.org/wikipedia/commons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg to /tmp/yourdump/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg and restart the dump operation.

[edit] Static Wikipedia

See http://static.wikipedia.org for static snapshot examples such as enwiki.

Personal tools
Namespaces
Variants
Actions
Site
Support
Download
Development
Communication
Print/export
Toolbox