Thread:Project:Support desk/need help cleaning up spammed wiki

Greetings,

We've run a university-based Mediawiki instance

https://www.artsrn.ualberta.ca/fwa_mediawiki/index.php?title=FolkwaysAlive!_Wiki_Main_Page

for years, but for several of them it was unfortunately subject to heavy spam attacks, resulting in the daily insertion of 1000s of junk pages containing links to commercial websites (presumably to boost google pagerank?). I noticed and reported but no one did anything. For a long time I think a user didn't even need to create an account to create a page.

With an administrative change, new IS people eventually removed much of the junk and closed the wiki to any new posts except from admins, resulting in a less useful but still well-used ROM type wiki, but much spam remains.

Now I'm being told (due to liability issues) either to eliminate all the spam, or to manually copy good pages to a new (better secured) instance. But I've created 1000s of useful pages, there are probably tens of thousands of spam pages, and don't have unlimited time either to hand-copy or to hand-delete, and so neither option looks good.

The spam pages are not trivial to identify, unless by an external algorithm (see 2 below). The spam bots created new users, so we can't just delete pages of particular users (blacklist), though we might have better luck deleting all pages except those created by a particular set of users (whitelist), since there are not so many on the latter. Spam bots typically did not overwrite good pages, but created new ones. These are not usually orphans or regular pages - most often they are Talk pages, which are a well-defined class of page. And talk pages are rarely if ever useful in our wiki. So it may be easy to avoid harmful false positives if we want to get most of it; harder to avoid harmful false negatives if we want to get all of it.

In any case, what I think I need are tools allowing me to mass delete pages likely to contain spam. The more flexible, the better.

1) For instance, again, nearly all the spam is on Talk pages. I wouldn't mind deleting them all, as I've never used them.  But I don't know how.  Is there a way to flexibly delete all pages that meet a certain condition, e.g. that are Talk pages or that contain a certain string or were created by a certain class of user (for instance, allowing me to provide a whitelist of good users, and deleting pages for all the other users)?

2) We were prompted by a report from Research and Education Networking ISAC (http://www.ren-isac.net) providing URLs for spam pages.  If their algorithms (I don't know what they are) could generate a complete list, perhaps I could delete the corresponding pages - but I'd need a tool to do this also, taking a URL list as input and deleting corresponding pages.

Prior to any mass deletions we could archive a copy of the wiki, allowing us to restore false positives later. Then we could mass copy the remaining (good) pages to a new mediawiki instance with better protection, and any missing pages could be manually retrieved from the archive.

I'd really appreciate any advice on this- either for (1) and (2) above or new strategies - so long as they won't require exorbitant amounts of time - I don't want to lose so many years of work, but I don't have years to put into saving them either!

many, many thanks,

Michael Frishkopf michaelf@ualberta.ca wiki: https://www.artsrn.ualberta.ca/fwa_mediawiki/index.php?title=FolkwaysAlive!_Wiki_Main_Page

PS: Here is our version information:

Product	Version MediaWiki	1.21.1 PHP	5.3.10-1ubuntu3.11 (cgi-fcgi) MySQL	5.5.37-0ubuntu0.12.04.1

Entry point URLs

Entry point	URL Article path	/fwa_mediawiki/index.php?title=$1 Script path	/fwa_mediawiki index.php	/fwa_mediawiki/index.php api.php	/fwa_mediawiki/api.php load.php	/fwa_mediawiki/load.php