Manual:Combating spam

Wikis are a common target for spammers wishing to promote products or web sites due to their open editing nature. MediaWiki offers a number of features designed to help block wiki spam.

Common tools used to combat wiki spam typically fall into these categories:
 * Requiring user validation and/or CAPTCHA on certain operations, such as edits, new external links or new user creation
 * Blocking robots and open proxies operating on known blacklisted IP addresses
 * Blocking edits which add specific unwanted keywords or external links
 * Blocking specific username and page title patterns commonly used by spambots
 * Blocking edits by new or anonymous users to specific often-targeted pages
 * Blocking registrations from known spammer usernames or e-mail addresses
 * Whitelisting known-good editors (such as admins, regular contributors) while leaving captcha or other restrictions as applied to new, unknown or anonymous users only
 * Cleanup scripts or bulk deletion (Nuke) of existing posts from recently-banned spambots

Normally a combination of various methods will be used, in an attempt to keep the number of spam, robot and open-proxy edits to a minimum while limiting the amount of disruption caused to legitimate users of the site.

Individual page protection
Frequently-spammed pages may be protected from editing by new and anonymous users by using semi-protection of individual pages. Often, the same page will be being hit repeatedly by spambots and, as most abusive edits on wikis which don't require registration to edit are from anonymous sources, blocking edits to these specific pages by anyone other than established users can prevent re-creation of deleted spamdump pages.

Common patterns observed in spambot-created pagenames include:
 * Name of a legit content page on the same wiki, appended with '/' or with '/index.php'
 * Talk page for an article, often outside main space (Forum_talk: or Category_talk: are little-used, so make common targets)
 * Forum: and discussion pages

Typically, any page which is already a regular visitor to special:log/delete on an individual wiki is a good candidate for page protection.

Edit filtering
MediaWiki provides a means to filter the text of edits in order to block undesirable additions, through the $wgSpamRegex configuration variable. You can use this to block additional snippets of text or markup associated with common spam attacks. For example:



will block edits which attempt to add hidden or overflowing elements, which is a common "trick" used in a lot of mass-edit attacks to attempt to hide the spam from viewers.

SpamBlacklist
A popular extension for MediaWiki is the SpamBlacklist extension which blocks edits that add blacklisted URLs to pages. The TitleBlacklist extension may also be useful, as a means to prevent re-creation of specific groups of pages which are being used by the 'bots to dump linkspam.

CAPTCHA
One of the more common methods of weeding out automated submissions is to use a CAPTCHA. The ConfirmEdit extension for MediaWiki provides an extensible CAPTCHA framework which can be triggered on a number of events, including


 * all edits
 * edits adding new, unrecognized external links
 * user registration

The extension ships with a default test, but this is a reference implementation, and is not intended for production use. Wiki operators installing ConfirmEdit on a public wiki are advised to use one of the CAPTCHA modules contained within the extension (there are five in total).

It is important to note that CAPTCHAs can block more than undesirable bots: if a script is unable to pass a CAPTCHA, then so is a screen reader, or other software or aid used by the blind or visually impaired. One of the options in CAPTCHA, the "reCAPTCHA" widget, includes an alternative audio CAPTCHA for such cases. Still, you should consider the implications of such a barrier, and possibly provide an alternative means for affected users to create accounts and contribute.

IP address blacklists
Much of the most problematic spam received on MediaWiki sites comes from addresses long known by other webmasters as bot or open proxy sites. These bots typically generate large numbers of automated registrations to forum sites, comment spam to blogs and page vandalism to wikis (most often linkspam, although existing content is sometimes blanked, prepended with random gibberish characters or edited in such a way as to break existing Unicode text). Left unchecked, Special:RecentChanges can quickly fill with random-character edit summaries from IP addresses in foreign countries far outside your target audience, with "content" turned into either random text or linkspam.

A relatively simple CAPTCHA may significantly reduce the problem, as may blocking the creation of certain often-spammed pages (such as names with trailing '/', '/index.php' and specific, targeted talk: or forum: pages). These measures do not eliminate the problem, however, and at some point tightening security for all users will inconvenience legitimate contributors.

It would be preferable, instead of relying solely on CAPTCHA or other precautions which affect all users, to target specifically those IPs already known by other site masters to be havens of net.abuse. Many lists are already available, for instance stopforumspam.com has a list of "All IPs in CSV" which (as of 2009) contains about seventy-five thousand IPs of known spambots.

An example: importing stopforumspam's IP list
To ban that many spambots individually would be prohibitively slow, but grabbing a copy of the existing list (assuming a *nix host shell, and importing the stopforumspam list into MediaWiki) can be done in mere minutes:

The list of spambot IP's will appear as a one million-byte long line of text, each address separated by commas. Now do a search-and-replace from  to   to break this huge list into one 75000-line list of bad IP's.

Go to the top of the list and add:

Go to the end of the list and add:

Alternately, the following *nix commands (issued in your extensions directory) do all of the above:

You may want to save these commands in a file called e.g. updateBannedIPs.sh, so you can run it periodically.

Once you've saved your new-found list of bad bots, simply include it from LocalSettings.php as if it were any standard MediaWiki extension:

You have just banned seventy-five thousand spammers, all hopefully without any disruptive effect on your legitimate users. That should make things a wee bit quieter, at least for a while...

¡Adios!

The time you save by not having to block each of these known spambots and their favourite target pages from each of your wikis individually can now be used to allocate a few more megabytes of APC cache... as, with this 1300000-byte file added to your configuration, you'll need it. You've just said «adieu» to a lot of the worst of the known spammers on the Internet. Good riddance!

Honeypots, DNS BL's and HTTP BL's
75000 dead spammers. Not bad, but any proper BOFH at this point would be bored and eagerly looking for the 75001'st spam IP to randomly block. And why not?

Fortunately, dynamically-updated lists of spambots, open proxies and other problem IP's are widely available. Many also allow usernames or e-mail addresses (for logged-in users) to be automatically checked against the same blacklists.

One form of blacklist which may be familiar to MediaWiki administrators is the DNS BL. Hosted on a domain name server, a DNS blacklist is a database of IP addresses. An address lookup determines if an IP attempting to register or edit is an already-known source of net abuse.

The $wgEnableSorbs and $wgSorbsUrl options in MediaWiki provide a primitive example of access to a DNS blacklist. Set  in LocalSettings.php and IP addresses listed as open proxies by SORBS are blocked. (Note: the trailing '.' in 	'http.dnsbl.sorbs.net.' is required.)

The DNS blacklist operates as follows:
 * A wiki gets an edit or new-user registration request from some random IP address (for example, in the format '123.45.67.89')
 * The four IP address bytes are placed into reverse order, then followed by the name of the desired DNS blacklist server
 * The resulting address is requested from the domain name server (in this example, '89.67.45.123.http.dnsbl.sorbs.net.')
 * The server returns not found (NXDOMAIN) if the address is not on the blacklist

The lookup in an externally-hosted blacklist typically adds no more than a few seconds to the time taken to save an edit. Unlike $wgProxyKey settings, which must be loaded on each page read or write, the use of the DNS blacklist only takes place during registration or page edits. This leaves the speed at which the system can service page read requests (the bulk of your traffic) unaffected.

While the original SORBS was primarily intended for dealing with open web proxies and e-mail spam, there are other lists specific to web spam (forums, blog comments, wiki edits) which therefore may be more suitable:
 * .opm.tornevall.org. operates in a very similar manner to SORBS DNSBL, but targets open proxies and web-form spamming. Much of its content is consolidated from other existing lists of abusive IP's.
 * .dnsbl.httpbl.org. specifically targets 'bots which harvest e-mail addresses from web pages for bulk mail lists, leave comment spam or attempt to steal passwords using dictionary attacks. It requires the user register with projecthoneypot.org for a 12-character API key. If this key (for example) were 'myapitestkey', a lookup which would otherwise look like '89.67.45.123.http.dnsbl.sorbs.net.' or '89.67.45.123.opm.tornevall.org.' would need to be 'myapitestkey.89.67.45.123.dnsbl.httpbl.org.'
 * Web-based blacklists can identify spammer's e-mail addresses and user information beyond a simple IP address, but there is no standard format for the reply from an HTTP blacklist server. For instance, a request for http://botscout.com/test/?ip=123.45.67.89 would return "Y|IP|4" if the address is blacklisted ('N' or blank if OK), while a web request for http://www.stopforumspam.com/api?ip=123.45.67.89 would return "ip yes  2009-04-16 23:11:19  41" if the address is blacklisted (the time, date and count can be ignored) or blank if the address is good.

With no one standard format by which a blacklist server responds to an enquiry, no built-in support for most on-line lists of known spambots exists in the stock MediaWiki package. The inability to specify more than one blacklist server further limits the usefulness of the built-in $wgEnableSorbs and $wgSorbsUrl options. Since 58061, MediaWiki has been able to check multiple DNSBLs by defining $wgSorbsUrl as an array.

As most blacklist operators provide very limited software support (often targeted to non-wiki applications, such as phpBB or Wordpress), third-party adaptations of these clients have been built and deployed on some wikis to check spambots. As the same spambots create similar problems on most open-content websites, the worst offenders attacking MediaWiki sites will also be busily targeting thousands of non-wiki sites with spam in blog comments, forum posts and guestbook entries.

Automatic query of multiple blacklist sites is therefore already in widespread use protecting various other forms of open-content sites and the spambot names, ranks and IP addresses are by now already all too well known. A relatively small number of spambots appear to be behind a large percentage of the overall problem. Even where admins take no prisoners, a pattern where the same spambot IP which posted linkspam to the wiki a second ago is spamming blog comments somewhere else now and will be spammming forum posts a few seconds from now on a site half a world away has been duly noted. One shared external blacklist entry can silence one problematic 'bot from posting on thousands of sites.

This greatly reduces the number of individual IP's which need to be manually blocked, one wiki and one forum at a time, by local administrators.

But what's this about honeypots?
Some anti-spam sites, such as projecthoneypot.org, provide code which you are invited to include in your own website pages. Typically, the pages contain one or more unique, randomised and hidden e-mail addresses or links, intended not for your human visitors but for spambots. Each time the page is served, the embedded addresses are automatically changed, allowing individual pieces of spam to be directly and conclusively matched to the IP address of bots which harvested the addresses from your sites. The IP address which the bot used to view your site is automatically submitted to the operators of the blacklist service. Often a link to a fake 'comment' or 'guest book' is also hidden as a trap to bots which post spam to web forms. See Honeypot (computing).

Once the address of the spammer is known, it is added to the blacklists (see above) so that you and others will in future have one less unwanted robotic visitor to your sites.

While honeypot scripts and blacklist servers can automate much of the task of identifying and dealing with spambot IPs, most blacklist sites do provide links to web pages on which one can manually search for information about an IP address or report an abusive IP as a spambot. It may be advisable to include some of these links on the special:blockip pages of your wiki for the convenience of your site's administrators.

More lists of proxy and spambot IPs
Typically, feeding the address of any bot or open proxy into a search engine will return many lists on which these abusive IP's have already been reported. In some cases, the lists will be part of anti-spam sites, in others a site advocating the use of open proxies will list not only the proxy which has been being abused to spam your wiki installation but hundreds of other proxies like it which are also open for abuse.

While any plain-text lists of open proxies must still be imported into your wiki manually, a Spambot Search Tool may be configured as an automated script to query any of the following databases:


 * 1) fSpamlist - fspamlist.com
 * 2) StopForumSpam - stopforumspam.com
 * 3) Sorbs - sorbs.net
 * 4) Spamhaus - spamhaus.org
 * 5) SpamCop - spamcop.net
 * 6) ProjectHoneyPot - projecthoneypot.org
 * 7) Bot Scout - botscout.com
 * 8) DroneBL - dronebl.org
 * 9) AHBL - ahbl.org

It is also possible to block wiki registrations from anonymised sources such as Tor proxies (Tor Project - torproject.org), from bugmenot users or from e-mail addresses (listed by undisposable.net) intended solely for one-time use.

See also Blacklists Compared - 1 March 2008 and spamfaq.net for lists of blacklists. Do keep in mind that lists intended for spam e-mail abatement will generate many false positives if installed to block comment spam on wikis or other web forms. Automated use of a list that blacklists all known dynamic user IP address blocks, for instance, could render your wiki all but unusable.

To link to IP blacklist sites from the Special:Blockip page of your wiki (as a convenience to admins wishing to manually check if a problem address is an already-known 'bot):
 * 1) Add one line to LocalSettings.php to set: $wgNamespacesWithSubpages[NS_SPECIAL] = true;
 * 2) Add the following text in MediaWiki:Blockiptext to display: " Check this IP at Domain Tools, OpenRBL, Project Honeypot, Spam Cop, Spamhaus, Stop Forum Spam. "

This will add an invitation to "check this IP at: Domain Tools, OpenRBL, Project Honeypot, Spam Cop, Spamhaus, Stop Forum Spam" to the page from which admins ask to block an IP. An IP address is sufficient information to make comments on Project Honeypot against spambots, Stop Forum Spam is less suited to reporting anon-IP problems as it requires username, IP and e-mail under which a problem 'bot is attempting to register on your sites. The policies and capabilities of other blacklist-related websites may vary.

Note that blocking the address of the spambot posting to your site is not the same as blocking the URL's of specific external links being spammed in the edited text. Do both. Both approaches used in combination, as a means to supplement (but not replace) other anti-spam tools such as title or username blacklists and tests which attempt to determine whether an edit is made by a human or a robot (captcha, bad behaviour or akismet) can be a very effective means to separate spambots from real, live human visitors.

rel="nofollow"
Under the default configuration, MediaWiki adds rel="nofollow" to external links in wiki pages, to indicate that these are user-supplied, might contain spam, and should therefore not be used to influence page ranking algorithms. Popular search engines such as Google honour this attribute.

You can switch off this behaviour on a site-wide basis using $wgNoFollowLinks or on a per-namespace basis using the $wgNoFollowNsExceptions configuration variable.

Use of the rel="nofollow" attribute alone will not stop spammers attempting to add marketing to a page, but it will at least prevent them from benefiting through increased page rank. Nonetheless, it should never be relied upon as the primary method of controlling spam as its effectiveness is inherently limited. It does not keep spam off your site.

Restrict editing
In some cases, it is sufficient (and appropriate) to restrict editing pages to those users who have created an account. This restriction will halt a number of automated attacks. This approach can be coupled with, for example, requiring a captcha during account registration, as described above, or blocking usernames matching a certain regular expression using the Username Blacklist extension.

It is also possible to configure MediaWiki to require e-mail verification before editing certain pages; if this capability is used, it is best combined with one of the blacklists of known spambot e-mail addresses in order to prevent automated registrations.


 * "Preventing access" overview
 * $wgGroupPermissions configuration

Extensions

 * Extension:AbuseFilter
 * Extension:AkismetKlik
 * Extension:Bad Behavior
 * Extension:Check Spambots
 * Extension:ConfirmEdit
 * Extension:EmailAddressImage and Extension:EmailObfuscator
 * Extension:GlobalBlocking
 * Extension:Nuke
 * Extension:SpamBlacklist and Extension:SpamRegex
 * Extension:TitleBlacklist