Extension:DumpHTML
From MediaWiki.org
| Language: |
English |
|
Release status: unknown |
|
|---|---|
| Implementation | Data extraction |
| Description | Creates a simple HTML dump of a MediaWiki installation. |
| Author(s) | Tim Starling |
| Last Version | 1.12.0+ |
| License | GPL or Any OSI approved license |
| Download | Download snapshot |
|
check usage (experimental) |
|
dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation. MediaWiki versions before 1.12.0 used the maintenance script dumpHTML.php instead.
Contents |
[edit] Parameters
dumpHTML does not function like a normal extension; you must run it from the command line.
| Option/Parameter | Description |
|---|---|
| -d <dest> | destination directory |
| -s <start> | start ID |
| -e <end> | end ID |
| -k <skin> | skin to use (defaults to htmldump) |
| --no-overwrite | skip existing HTML files |
| --checkpoint <file> | use a checkpoint file to allow restarting of interrupted dumps |
| --slice <n/m> | split the job into m segments and do the n'th one |
| --images | only do image description pages |
| --shared-desc | only do shared (commons) image description pages |
| --no-shared-desc | don't do shared image description pages |
| --categories | only do category pages |
| --redirects | only do redirects |
| --special | only do miscellaneous stuff |
| --force-copy | copy commons instead of symlink, needed for Wikimedia |
| --interlang | allow interlanguage links |
| --image-snapshot | copy all images used to the destination directory |
| --compress | generate compressed version of the html pages |
| --udp-profile <N> | profile 1/N rendering operations using ProfilerSimpleUDP |
Example to create a complete snapshot including image and media files and image thumbnail files in directory wikidump (LINUX)
/usr/bin/php /srv/www/mediawiki/extension/DumpHTML/dumpHTML.php -d /srv/www/mediawiki/wikidump -k monobook --image-snapshot --force-copy
[edit] Filename problems solved by a modified version of DumpHTML
If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if your wiki pages or files had non-ASCII characters, which is likely, then you probably need to change the link references, the directories and filenames from UTF-8 (on Linux) to the character encoding on your Windows (for example to codepage 1252 for Western-European systems), but browsers may still have difficulties to access the files.
Bugzilla 8147 "Filenames in the HTML static dump" comes with such a patch for DumpHTML.inc and converts article, image, thumbnail image and media filenames to their MD5-hashed version. Snapshots can then be written to CD/DVDs and filename character encoding problems on different operation systems are avoided.
[edit] Skin hacking
If you modified your skin (e.g. monobook) then this script will likely fail. Upgrade/update your mediawiki installation and replace any "hacked" skins, then re-try.
[edit] Static Wikipedia
See http://static.wikipedia.org for static snapshot examples such as enwiki.