Manual:Combating spam

Wikis are a common target for spammers wishing to promote products or web sites due to their open editing nature. MediaWiki offers a number of features designed to help block wiki spam.

Common tools used to combat wiki spam typically fall into these categories:
 * Requiring user validation and/or captcha on certain operations, such as edits, new external links or new user creation
 * Blocking robots and open proxies operating on known blacklisted IP addresses
 * Blocking edits which add specific unwanted keywords or external links
 * Blocking specific username and page title patterns commonly used by spambots
 * Blocking edits by new or anonymous users to specific often-targeted pages
 * Blocking registrations from known spammer usernames or e-mail addresses
 * Whitelisting known-good editors (such as admins, regular contributors) while leaving captcha or other restrictions as applied to new, unknown or anonymous users only
 * Cleanup scripts or bulk deletion (extension:nuke) of existing posts from recently-banned spambots

Normally a combination of various methods will be used, in an attempt to keep the number of spam, robot and open-proxy edits to a minimum while limiting the amount of disruption caused to legitimate users of the site.

Individual page protection
Frequently-spammed pages may be protected from editing by new and anonymous users by using semi-protection of individual pages. Often, the same page will be being hit repeatedly by spambots and, as most abusive edits on wikis which don't require registration to edit are from anonymous sources, blocking edits to these specific pages by anyone other than established users can prevent re-creation of deleted spamdump pages.

Common patterns observed in spambot-created pagenames include:
 * Name of a legit content page on the same wiki, appended with '/' or with '/index.php'
 * Talk page for an article, often outside main space (Forum_talk: or Category_talk: are little-used, so make common targets)
 * Forum: and discussion pages

Typically, any page which is already a regular visitor to special:log/delete on an individual wiki is a good candidate for page protection.

Edit filtering
MediaWiki provides a means to filter the text of edits in order to block undesirable additions, through the $wgSpamRegex configuration variable. You can use this to block additional snippets of text or markup associated with common spam attacks. For example:



will block edits which attempt to add hidden or overflowing elements, which is a common "trick" used in a lot of mass-edit attacks to attempt to hide the spam from viewers.

SpamBlacklist
A popular extension for MediaWiki is the SpamBlacklist extension which blocks edits that add blacklisted URLs to pages. The TitleBlacklist extension may also be useful, as a means to prevent re-creation of specific groups of pages which are being used by the 'bots to dump linkspam.

Captcha
One of the more common methods of weeding out automated submissions is to use a CAPTCHA. The ConfirmEdit extension for MediaWiki provides an extensible captcha framework which can be triggered on a number of events, including


 * all edits
 * edits adding new, unrecognized external links
 * user registration

The extension ships with a default test, but this is a reference implementation, and is not intended for production use. Wiki operators installing ConfirmEdit on a public wiki are advised to either Instructions on how to do this are supplied with the extension.
 * use the FancyCaptcha plugin, and generate a set of decent captcha images using the supplied Python script, or
 * use the ReCAPTCHA plugin.

It is important to note that captchas can block more than undesirable bots; if a script is unable to pass a captcha, then so is a screen reader, or other software or aid used by the blind or visually impaired. You should therefore consider the implications of such a barrier, and provide an alternative means for affected users to create accounts and contribute.


 * ConfirmEdit overview
 * ConfirmEdit extension
 * Extension:ReCAPTCHA

IP address blacklists
Much of the most problematic spam received on MediaWiki sites comes from addresses long known by other webmasters as 'bot or open proxy sites. These 'bots typically generate large numbers of automated registrations to forum sites, comment spam to blogs and page vandalism to wikis (most often linkspam, although existing content is sometimes blanked, prepended with random gibberish characters or edited in such a way as to break existing Unicode text). Left unchecked, special:recentchanges can quickly fill with random-character edit summaries from IP addresses in foreign countries far outside your target audience, with "content" turned into either random text or linkspam.

A relatively simple 'captcha' may significantly reduce the problem, as may blocking the creation of certain often-spammed pages (such as names with trailing '/', '/index.php' and specific, targetted talk: or forum: pages). These measures do not eliminate the problem, however, and at some point tightening security for all users will inconvenience legitimate contributors.

It would be preferable to, instead of relying solely on 'captcha' or other precautions which affect all users, to target specifically those IP's already known by other site masters to be havens of net.abuse. Many lists are already available, for instance stopforumspam.com has a list of "All IP's in CSV" which (as of 2009) contains about seventy-five thousand IP's of known spambots.

An example: importing stopforumspam's IP list
To ban that many spambots individually would be prohibitively slow, but grabbing a copy of the existing list (assuming a *nix host shell, and importing the stopforumspam list into MediaWiki) can be done in mere minutes:

The list of spambot IP's will appear as a one million-byte long line of text, each address separated by commas. Now do a search-and-replace from  to   to break this huge list into one 75000-line list of bad IP's. (In this example, Joe's Own Editor 'joe' is used as it tolerates unlimited line length and allows \n to be used to substitute the newline character. Other standard means of search-and-replacement may also be available, depending which packages are installed on your system.)

Go to the top of the list and add:

Go to the end of the list and add:

Note: Alternately, the following *nix commands do all of the above without needing an editor:

Once you've saved your new-found list of bad 'bots, simply include it from LocalSettings.php as if it were any standard MediaWiki extension:

You have just banned seventy-five thousand spammers, all hopefully without any disruptive effect on your legitimate users. That should make things a wee bit quieter, at least for a while...

¡Adios!

The time you save by not having to block each of these known spambots and their favourite target pages from each of your wikis individually can now be used to allocate a few more megabytes of APC cache... as, with this 1300000-byte file added to your configuration, you'll need it. You've just said «adieu» to a lot of the worst of the known spammers on the Internet. Good riddance!

Honeypots, DNS BL's and HTTP BL's
75000 dead spammers. Not bad, but any proper BOFH at this point would be bored and eagerly looking for the 75001'st spam IP to randomly block. And why not?

Fortunately, dynamically-updated lists of spambots, open proxies and other problem IP's are widely available. Many also allow usernames or e-mail addresses (for logged-in users) to be automatically checked against the same blacklists.

One form of blacklist which may be familiar to MediaWiki administrators is the DNS BL. Hosted on a domain name server, a DNS blacklist is a database of IP addresses. An address lookup determines if an IP attempting to register or edit is an already-known source of net abuse.

The $wgEnableSorbs and manual:$wgSorbsUrl options in MediaWiki provide a primitive example of access to a DNS blacklist. Set  in manual:LocalSettings.php and IP addresses listed as open proxies by SORBS are blocked. (Note: the trailing '.' in 	'http.dnsbl.sorbs.net.' is required.)

DNS BL will block only problematic IP addresses; it does not blacklist e-mail addresses, usernames, or browser ("user-agent") names. It does allow new spambots to be blacklisted automatically as they are discovered, as well as being an effective tool against open-proxy abuse.

The DNS blacklist operates as follows:
 * A wiki gets an edit or new-user registration request from some random IP address (for example, in the format '123.45.67.89')
 * The four IP address bytes are placed into reverse order, then followed by the name of the desired DNS blacklist server
 * The resulting address is requested from the domain name server (in this example, '89.67.45.123.http.dnsbl.sorbs.net.')
 * The server returns not found (NXDOMAIN) if the address is not on the blacklist

The lookup in an externally-hosted blacklist typically adds no more than a few seconds to the time taken to save an edit. Unlike $wgProxyKey settings, which must be loaded on each page read or write, the use of the DNS blacklist only takes place during registration or page edits. This leaves the speed at which the system can service page read requests (the bulk of your traffic) unaffected.

While the original SORBS was primarily intended for dealing with open web proxies and e-mail spam, there are other lists which are oriented to web spam (forums, blog comments, wiki edits) and which therefore may be more suitable:
 * .opm.tornevall.org. operates in a very similar manner to SORBS DNSBL, but targets open proxies and web-form spamming. Much of its content is consolidated from other existing lists of abusive IP's.
 * .dnsbl.httpbl.org. differs slightly, in that it requires the user register with projecthoneypot.org to get a 12-character API key. If this key (for example) were 'myapitestkey', a lookup for '123.45.67.89' which would otherwise look like '89.67.45.123.http.dnsbl.sorbs.net.' or '89.67.45.123.opm.tornevall.org.' would need to be changed to be like 'myapitestkey.89.67.45.123.dnsbl.httpbl.org.' - adding the API key before the address to be tested.

There are also small differences in the values returned if the requested address is on the list. In all cases, a not found (NXDOMAIN) reply does mean the IP address in question is not on the list, therefore likely legit.

This leaves a few evident limitations in MediaWiki's built-in DNS blacklist handler:
 * The code in MediaWiki's User.php and SpecialUserLogon only looks for one DNS BL server. It does not allow for automated checks to multiple servers from the same wiki.
 * There is nowhere to add the API key for servers (such as Project Honeypot's "httpbl.org") which require it
 * The response from the DNS blacklist is not checked beyond an address being found or not found (NXDOMAIN). If a server returns a value to indicate why an address is on the list, this extra information is discarded. Servers which list anything other than spambots or open proxies must therefore not be used - some blacklists intended for spam e-mail control list many known dynamic-IP ranges which are otherwise harmless.

Servers which use formats other than DNS to deliver updated blacklist information are also not directly compatible. For instance (based on a search for an IP address '123.45.67.89'):
 * a web request for http://botscout.com/test/?ip=123.45.67.89 would return "Y|IP|4" if the address is blacklisted, "N..." or blank if not. An API key from the site is required if making more than twenty requests per day.
 * a web request for http://www.stopforumspam.com/api?ip=123.45.67.89 would return "ip yes  2009-04-16 23:11:19  41" if the address is blacklisted (the time, date and count can be ignored) or blank if the address is good.

While the web-based blacklist has the advantage of being able to list e-mail addresses or problem user information beyond a simple IP address, there is no standard for the reply format of an HTTP blacklist server and no built-in support for HTTP BL in MediaWiki. It would in theory be possible to create a MediaWiki GetBlockedStatus extension to make these or any other blacklist sites interoperable with MediaWiki, but (as of 2009) the client code currently being offered by operators of these lists appears targeted for individual forum packages (such as phpBB) only. MediaWiki will not automatically query an HTTP server to check if an address is blacklisted.

But what's this about honeypots?
Some anti-spam sites, such as projecthoneypot.org, provide code which you are invited to include in your own website pages. Typically, the pages contain one or more unique, randomised and hidden e-mail addresses or links, intended not for your human visitors but for spambots. Each time the page is served, the embedded addresses are automatically changed, allowing individual pieces of spam to be directly and conclusively matched to the IP address of 'bots which harvested the addresses from your sites. The IP address which the 'bot used to view your site is automatically submitted to the operators of the blacklist service. See Honeypot (computing).

Once the address of the spammer is known, it is added to the blacklists (see above) so that you and others will in future have one less unwanted robotic visitor to your sites.

While honeypot scripts and blacklist servers can automate much of the task of identifying and dealing with spambot IP's, most blacklist sites do provide links to web pages on which one can manually search for information about an IP address or report an abusive IP as a spambot. It may be advisable to include some of these links on the special:blockip pages of your wiki for the convenience of your site's administrators.

More lists of proxy and spambot IP's
Typically, feeding the address of any 'bot or open proxy into a search engine will return many lists on which these abusive IP's have already been reported. In some cases, the lists will be part of anti-spam sites, in others a site advocating the use of open proxies will list not only the proxy which has been being abused to spam your wiki installation but hundreds of other proxies like it which are also open for abuse.

rel="nofollow"
Under the default configuration, MediaWiki adds rel="nofollow" to external links in wiki pages, to indicate that these are user-supplied, might contain spam, and should therefore not be used to influence page ranking algorithms. Popular search engines such as Google honour this attribute.

You can switch off this behaviour on a site-wide basis using the $wgNoFollowLinks configuration variable, e.g.



You can also configure a list of namespaces for which the rel="nofollow" attribute will not be set, using the $wgNoFollowNsExceptions configuration variable, e.g.



will switch this off for the main namespace.

Use of the rel="nofollow" attribute alone will not stop spammers attempting to add marketing to a page, but it will prevent them from benefiting through increased page ranks. Nonetheless, it should never be relied upon as the primary method of controlling spam as its effectiveness is inherently limited. It does not keep spam off your site.

Restrict editing
In some cases, it is sufficient (and appropriate) to restrict editing pages to those users who have created an account. This restriction will halt a number of automated attacks. This approach can be coupled with, for example, requiring a captcha during account registration, as described above, or blocking usernames matching a certain regular expression using the Username Blacklist extension.

It is also possible to configure MediaWiki to require e-mail verification before editing certain pages; if this capability is used, it is best combined with one of the blacklists of known spambot e-mail addresses in order to prevent automated registrations.


 * "Preventing access" overview
 * $wgGroupPermissions configuration

Extensions

 * Extension:Bad Behavior
 * Extension:ConfirmEdit
 * Extension:Nuke
 * Extension:ReCAPTCHA
 * Extension:SpamBlacklist
 * Extension:SpamRegex
 * Extension:TitleBlacklist