Manual:Grabbers

This page describes a series of grabber scripts designed to get a wiki's content without direct database access. If you don't have a database dump or access to the database and you need to move/back up a wiki, the MediaWiki API provides access to get most all of what you need.

Appropriate access on the target wiki is required to get private or deleted data. This document was originally compiled and scripts assembled in order to move Uncyclopedia; because the overall goal was to just get the damn thing moved, 'pretty' was not exactly in our vocabulary when we were setting this up so some of it/them are still kind of a horrible mess.

Stuff to get
If you're moving an entire wiki, these are probably what you need to get. More information on the tables can found on Manual:database layout, but the secondary tables can be rebuilt based on these. Otherwise you probably know what you want.


 * Revisions: text, revision, page, page_restrictions, protected_titles, archive (most hosts will provide an xml dump of at least text, revision, and page tables, which add up to the bulk of a wiki)
 * Logs: logging
 * Interwiki: interwiki
 * Files (including deleted) and file data: image, oldimage, filearchive
 * Users: user, user_groups, user_properties, watchlist, ipblocks
 * Other stuff probably (this is just core; extensions often add other tables and stuff).

Scripts

 * php files should be in the repo on gerrit, though I have no idea how to link to that properly.
 * perl files are from MediaWikiDumper.
 * python files have been added to the mediawiki/tools/grabbers repo.
 * java is MWDumper.
 * No ruby is involved. So far.

grabText.php
Page content.


 * Maintenance script, requires mediawikibot.class.php

Affects text, revision, page, page_restrictions(?) tables, originally based on text.pl. Probably needs to be rewritten again.

Separate calls for getting all live revisions or for starting from after a specified revision (for instance only getting more recent revisions after importing a dump). Only verified to work with the latter, and it can make a bit of a mess.


 * Lacks support for revision deletion and oversight, so probably skips affected revisions (or possibly just dies, don't remember).

After doing this (with a dump or otherwise) if page revisions include user ids, which they should, you will probably need to set the userid autoincrement to after the highest rev_user in the revision table. Otherwise there may be some weird attribution issues with new accounts.

Recommended to import a dump first if you can and just fill in the missing stuff - revisions are huge and take a long time to download - so use MWDumper for that (the importDump maintenance script included with MediaWiki was broken as 1.20, and probably still is). Missing stuff generally includes deleted revisions (archive), protection information (page_restrictions, protected_titles), and obviously anything that changed since the dump was created.
 * Dumps

Note that such dumps, particularly of older wikis, can be unreliable and contain inaccurate information. No idea how to get around this properly.

grabDeletedText.php
Deleted content.


 * Maintenance script, requires mediawikibot.class.php

Affects revision, archive, (?) tables, originally based on text.pl. Probably also needs to be rewritten again.


 * Lacks support for revision deletion and oversight, so probably skips affected revisions (or possibly just dies, don't remember).

grabLogs.php
Stuff that shows up on Special:Log.


 * Maintenance script, requires mediawikibot.class.php

Affects logging table, originally based on logs.pl.


 * Skips revdeleted entries
 * Uses legacy log_params format, not technically correct for 1.19+

grabInterwikiMap.php
Supported interwiki links - show up on Special:Interwiki if Extension:Interwiki is installed.


 * requires mediawikibot.class.php

Affects interwiki table, originally based on interwiki.pl

Can either import all interwikis or just the interlanguage links, though getting all the interwikis is generally recommended to maintain compatibility.

grabFiles.php, grabImages.php
Files and file info (descriptions are page content).


 * Maintenance scripts, require mediawikibot.class.php

If you just want to get the files off something and don't care about the descriptions or old revisions, or something along those lines, you can use grabImages.php to only download them, without affecting the database (and use the importImages or whatever maintenance script that comes with MediaWiki to import them normally)

Use grabFiles.php for a full dump - it imports files directly (such that log entries and file descriptions from other scripts are used), and includes old revisions.

Affects image, oldimage' tables, originally based on images.pl, download-images.pl

grabDeletedFiles.php
Deleted files and file info.


 * Maintenance script, requires mediawikibot.class.php

Affects filearchive table

Only works if the target wiki uses a known (assumes default) deleted file hashing configuration. If you don't know it you apparently need a screenscraper due to a lack of API support for actually downloading files or something.

Extension:MediaWikiAuth
Imports user accounts on login. Note that this requires the site you are copying from to still be active to use their authentication.

Affects user, user_properties, watchlist tables


 * Uses screenscraping as well as the API due to incomplete functionality.
 * Updates user ids in most other tables to match the imported id, though apparently not userid log_params for user creation entries

Python scripts

 * The python scripts will currently populate the ipblocks, user_groups, page_restrictions, and protected_titles tables.

It's recommended that you use python 2.7.2+. You will need to install oursql, and requests.

You need to edit  and set the site you want to import from, and your database information.

The easiest way to run everything is just  which executes all four individual scripts. You can also run each script individually if you choose (so you can run them concurrently).

Note: Autoblocks will not be imported since we do not have the data about which IP address is actually being blocked

Other stuff
Not grabbers, but things to potentially worry about.


 * Configuration stuff - groups, namespaces, etc
 * Extensions
 * Extension stuff - abuse filter, ajaxpoll, checkuser, socialprofile, and others have their own tables and stuff
 * Secondary tables - the above grabber scripts generally just set the primary tables; secondary tables such as category, redirect, site_stats, etc can be rebuilt using other maintenance scripts included with MediaWiki.