User:Halz~mediawikiwiki/Mass despamming

From mediawiki.org

Draft page. Useful to make this proper documentation page?


Approaches to remove spam from a heavily spam clogged wiki. This is a similar but different challenge to preventing the spam in the first place.

Extensions[edit]

Extension:Nuke - This performs page deletes (as in sysop deletes, which can still be undone), but lets you do lots of them quickly. Couple of things to note about using this:

  • It only queries over newly created pages, retrieving the most recent first. For de-spamming this can be very useful since newest created pages might be the spam, while older ones are not. Leaving all filter fields blank, it's very useful for combatting spam as it comes in.
  • The "pattern" field lets you put in SQL LIKE patterns e.g. "%viagra%" for any page title with that word in it (not the same as a perl regexp). It only applies to the page title, not to the content.
  • It only queries over the recent changes database table. For a big or seriously spam-bloated wiki this may be very big, but not necessarily big enough to include everything in your wiki. For example you might find 500 pages mentioning viagra in the title. Delete them all. Do this again several times... and after seeming to clean them all, still have lots of pages left over with this in the title. There is a maintenance script which lets you rebuild the recent changes table, which may help with this.

Extension:BatchDelete - For deleting given a list of page titles (sysop delete, which can still be undone)

  • easy to install and does what you'd expect.
  • generate the list of pages to delete via some other means, e.g. using SQL queries described below

Extension:NukeDPL - offers an advanced query language for finding pages to delete.

  • It's a small extension which allows mass deleting based on the cleverness within another much more complicated extension DPL (3rd party)
  • This small glue extension doesn't seem very well maintained. It has a crappy install procedure and generally feels a bit flaky, but it can be made to work.
  • The query language seems powerful, but the display doesn't give useful debug info and...
  • Weirdly straightforward content queries don't work

Find spammy pages[edit]

Surprisingly hard. If spam is flooding in, it becomes a challenge of finding the non-spammy page (Otherwise you can assume it's all spam, and blindly use Extension:Nuke)

SQL[edit]

Handy query if you have raw SQL access. Can be used to supply page titles to BatchDelete. See the conditions at the bottom.

SELECT page_namespace AS ns,
       page_title,
       first_edit,
       rev_timestamp AS last_edit,
       rev_count_query.rev_count,
       /*rev_users,*/
       user_name,
       user_registration
FROM wiki_page
JOIN wiki_revision ON page_latest = rev_id
JOIN wiki_text ON rev_text_id = old_id
JOIN (
  SELECT rev_page,
         MIN(rev_timestamp) AS first_edit,
         COUNT(*) AS rev_count,
         COUNT(DISTINCT rev_user) AS rev_users,
         MAX(rev_user) the_user_id /*Warning: depends on rev_users=1 clause*/
  FROM wiki_revision
  GROUP BY rev_page
) AS rev_count_query ON rev_count_query.rev_page=page_id
JOIN wiki_user ON user_id=the_user_id
WHERE old_text LIKE '%viagra%'
AND rev_users=1                         /* pages with only one user involved */
/*AND first_edit > 20120401000000 */      /* pages first edited after date */
/*AND user_registration > 20110801000000 */ /* where the user first registered after date */
AND user_name NOT IN ('Halz', 'OtherName', 'AndAnother') /* user whitelist */
LIMIT 15;