This page describes a series of grabber scripts designed to get a wiki's content without direct database access. If you don't have a database dump or access to the database and you need to move/back up a wiki, the MediaWiki API provides access to get most all of what you need. These scripts also require MediaWiki 1.29+ since Gerrit change 376957. The scripts have also been tested in MediaWiki 1.30.
Appropriate access on the target wiki is required to get private or deleted data, but most scripts will just work without such access. This document was originally compiled and scripts assembled in order to move Uncyclopedia; because the overall goal was to just get the damn thing moved, 'pretty' was not exactly in our vocabulary when we were setting this up so some of it/them are still kind of a horrible mess. However, many of them have been revised since then and made more robust, and were used successfully to move several wikis from wikia to a new host.
The way those scripts work is to replicate the database with the same public identifiers (revision ID, log ID, article ID), so most of them must be used on a clean, empty database (with just the table structure) or a database that has the same IDs as the remote wiki being replicated.
Stuff to get
If you're moving an entire wiki, these are probably what you need to get. More information on the tables can found on Manual:database layout, but the secondary tables can be rebuilt based on these. Otherwise you probably know what you want.
- Revisions: text, revision, page, page_restrictions, protected_titles, archive (most hosts will provide an xml dump of at least text, revision, and page tables, which add up to the bulk of a wiki)
- Logs: logging
- Interwiki: interwiki
- Files (including deleted) and file data: image, oldimage, filearchive
- Users: user, user_groups, user_properties, watchlist, ipblocks
- Other stuff probably (this is just core; extensions often add other tables and stuff).
- php files should be in the code repository.
- python files have been added to the repository too.
- java is MWDumper.
- No ruby is involved. So far.
These are maintenance scripts and output their grab straight into the wiki's database. To "install" them:
- get MediaWiki core,
- from the same path, download the scripts, e.g. with
git clone https://gerrit.wikimedia.org/r/mediawiki/tools/grabbers.git.
- It needs a working LocalSettings.php with database credentials, and a working MediaWiki database, so be sure you've set up the wiki first
- You can create it quickly by running
php maintenance/install.php --server="http://dummy/" --dbname=grabber --dbserver="localhost" --installdbuser=root --installdbpass=rootpassword --lang=en --pass=aaaaa --dbuser=grabber --dbpass=grabber --scriptpath=/ GrabberWiki Admin
- Some configuration variables in LocalSettings.php that those scripts support: $wgDBtype, Manual:$wgCompressRevisions, External storage.
- You can create it quickly by running
- If you're importing all the contents with grabText.php, be sure to remove all rows from
texttables prior to running the script.
- If you need to login on the target wiki on recent versions of MediaWiki (which is sometimes required, when grabbing deleted text, or desirable due to higher api limits), you need to set up a bot password on the external wiki.
|grabText.php||Page content (live).||
|grabNewText.php||New content, for filling in edits from after a dump was created and imported.||
|grabLogs.php||Stuff that shows up on Special:Log.||
|grabInterwikiMap.php||Supported interwiki links - show up on Special:Interwiki if Extension:Interwiki is installed.||
|grabFiles.php||Files and file info, including old versions (descriptions are page content).||
|grabNewFiles.php||Files and file info to update a site that had used grabFiles.php already.||
|grabImages.php||Current file versions, without database info||n/a||
|grabDeletedFiles.php||Deleted files and file info.||
|grabUserGroups.php||Groups users belong to.||
|grabProtectedTitles.php||Pages protected from creation.||
- The tools/grabbers/ python scripts will currently populate the ipblocks, user_groups, page_restrictions, and protected_titles tables.
You need to edit
settings.py and set the site you want to import from, and your database information.
The easiest way to run everything is just
$ python python_all.py which executes all four individual scripts. You can also run each script individually if you choose (so you can run them concurrently).
Note: Autoblocks will not be imported since we do not have the data about which IP address is actually being blocked
- Wiki-Export - another set of scripts to grab pages and files, in python
If you used grabbers to import all revisions and files, the tables would be already populated with the same user IDs as the original wiki. Use this extension to populate the user table with "stub" rows that won't contain password information. Then configure the extension as a PasswordAuthenticationProvider as described on the extension's manual. If a user attempts to login and the user is still a stub one in the user table, the extension will do the login on the remote wiki instead. If the login is successful, will create a new hash of the password and store it on the database. Any further logins will be made only locally.
This is recommended over Extension:MediaWikiAuth, since all user IDs would be already correct.
Download source code: https://github.com/ciencia/mediawiki-extensions-StubUserWikiAuth
Imports user accounts on login. Note that this requires the site you are copying from to still be active to use their authentication.
Affects user, user_properties, watchlist tables
- Uses screenscraping as well as the API due to incomplete functionality.
- Updates user ids in most other tables to match the imported id, though apparently not userid log_params for user creation entries
Not grabbers, but things to potentially worry about.
- Configuration stuff - groups, namespaces, etc
- Extension stuff - ajaxpoll, checkuser, socialprofile, and others have their own tables and stuff
- Secondary tables - the above grabber scripts generally just set the primary tables; secondary tables such as category, redirect, site_stats, etc can be rebuilt using other maintenance scripts included with MediaWiki, usually rebuildall.php.