Extension:DumpHTML
| Language: | English |
|---|
|
dumpHTML Release status: stable |
|
|---|---|
| Implementation | Data extraction |
| Description | Creates a simple HTML dump of a MediaWiki installation. |
| Author(s) | Tim Starling |
| Last version | 1.12.0+ |
| License | GPL or Any OSI approved license |
| Download | Download snapshot Subversion [Help] |
|
Check usage (experimental) |
|
dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation. MediaWiki versions before 1.12.0 used the maintenance script dumpHTML.php instead.
Contents |
[edit] Parameters
dumpHTML does not function like a normal extension; you must run it from the command line.
| Option/Parameter | Description |
|---|---|
| -d <dest> | destination directory |
| -s <start> | start ID |
| -e <end> | end ID |
| -k <skin> | skin to use (defaults to htmldump) |
| --no-overwrite | skip existing HTML files |
| --checkpoint <file> | use a checkpoint file to allow restarting of interrupted dumps |
| --slice <n/m> | split the job into m segments and do the n'th one |
| --images | only do image description pages |
| --shared-desc | only do shared (commons) image description pages |
| --no-shared-desc | don't do shared image description pages |
| --categories | only do category pages |
| --redirects | only do redirects |
| --special | only do miscellaneous stuff |
| --force-copy | copy commons instead of symlink, needed for Wikimedia |
| --interlang | allow interlanguage links |
| --image-snapshot | copy all images used to the destination directory |
| --compress | generate compressed version of the html pages |
| --udp-profile <N> | profile 1/N rendering operations using ProfilerSimpleUDP |
Example to create a complete snapshot including image and media files and image thumbnail files in directory wikidump (LINUX)
/usr/bin/php /srv/www/mediawiki/extensions/DumpHTML/dumpHTML.php -d /srv/www/mediawiki/wikidump -k monobook --image-snapshot --force-copy
[edit] Known issues
[edit] Filename problems solved by a modified version of DumpHTML
If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if your wiki pages or files had non-ASCII characters, which is likely, then you probably need to change the link references, the directories and filenames from UTF-8 (on Linux) to the character encoding on your Windows (for example to codepage 1252 for Western-European systems), but browsers may still have difficulties to access the files.
Bugzilla 8147 "Filenames in the HTML static dump" comes with such a patch for DumpHTML.inc and converts article, image, thumbnail image and media filenames to their MD5-hashed version. Snapshots can then be written to CD/DVDs and filename character encoding problems on different operation systems are avoided.
[edit] Skin hacking
If you modified your skin (e.g. monobook) then this script will likely fail. Upgrade/update your mediawiki installation and replace any "hacked" skins, then re-try.
[edit] Extensions compatibility
For the same reason, some extensions modifying output aren't compatible with DumpHTML, like Extension:SyntaxHighlight_GeSHi.
[edit] If you use InstantCommons
If you use your dump on a custom MediaWiki install using InstantCommons, the script will consider your images files are in the images/wikimediacommons folder of the target directory.
Thus, if you encounter a message as:
Warning: file_put_contents(/tmp/wiki/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg): failed to open stream: No such file or directory in [...]/w/extensions/DumpHTML/dumpHTML.inc on line 1377
You have to download http://upload.wikimedia.org/wikipedia/commons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg to /tmp/yourdump/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg and restart the dump operation.
[edit] Static Wikipedia
See http://static.wikipedia.org for static snapshot examples such as enwiki.
|
|
This extension is being used on one or more of Wikimedia's wikis. This means that the extension is stable and works well enough to be used by such high traffic websites. A full list of the extensions installed on a particular wiki can be seen on the wiki's Special:Version page. |