Thread:Project:Support desk/Cleaning up externallinks table

I'm looking for a way to find and delete all records in the externallinks table with a page identifier el_from that doesn't exist as a page_id in the page table.

I'm having:


 * MediaWiki 1.19.1
 * PHP 5.3.21-1
 * MySQL 5.1.61

I'll explain why I want to do this.

I'm running a fairly new wiki about genealogy in Brussels (Belgium). A few days ago, the wiki was under attack and about 50,000 spam accounts were created. This was due to the fact that anyone could create his own account without a bureaucrat or administrator having to approve it (my mistake, it's the very first wiki ever I installed).

I took away the initial cause of the problem by installing Extension:ConfirmAccount. That seems alright, no new (spam) accounts have been created since.

Now I'm working on cleaning up the database. I'm doing this with phpMyAdmin, directly into the MySQL itself, since I couldn't find any MediaWiki extension that not only removes the spam accounts, but also all other data (in other tables) related to them (e.g. the spam accounts created pages, user pages and all other sorts of data in quite a few tables). I do know little about MySQL and PHP, but let's say I have sufficient insight to proceed in a reasoned way.

To clean up the mess, I go through all the tables and I remove records related to the spam accounts. Luckely, the wiki is fairly young and only two real user accounts exist, so I could filter on user_id to remove the spam accounts in the user table itself, and filter on foreign keys to remove related records in all other tables. After removing large portings of records from a table, I defragment and optimize the table. Although still time consuming, this seems to work fine and I believe I didn't mess up things.

Now I'm at the stage that only two more tables should be cleaned up. These are the page table and the externallinks table.

There are about 35,000 spam records in the page table, I will have to manually remove them by sight, since there is no formal criterion to tell if a page is real or spam. But spam pages are grouped in large chuncks, so this simplifies the process a bit.

'''The externallinks table contains about 200,000 links. There's no way to clean up the table manually by sight, because I can't always tell which link is real and which is spam (even if I could, this process would be very error-prone). So, this is why I'm looking for a way to find and delete all records in the externallinks table that refer to non-existent / no longer existing pages in the page table.'''

Thank you in advance! Kind regards.