Manual:Combating spam

Like all current dynamic web sites, wikis are a common target for spammers wishing to promote products or web sites. MediaWiki offers a number of features designed to combat vandalism in general (see); on this page we deal specifically with wiki spam.

Overview
Common tools used to combat wiki spam typically fall into these categories:
 * Requiring log in and/or a CAPTCHA on certain operations, such as edits, adding external links, or new user creation
 * Blocking edits from known blacklisted IP addresses or IPs running open proxies
 * Blocking edits which add specific unwanted keywords or external links
 * Blocking specific username and page title patterns commonly used by spambots
 * Blocking edits by new or anonymous users to specific often-targeted pages
 * Whitelisting known-good editors (such as admins, regular contributors) while placing restrictions on new or anonymous users
 * Cleanup scripts or bulk deletion (Nuke) of existing posts from recently-banned spambots

Normally a combination of various methods will be used, in an attempt to keep the number of spam, robot and open-proxy edits to a minimum while limiting the amount of disruption caused to legitimate users of the site.

Note that many of these features are not activated by default. If you are running a MediaWiki installation on your server/host, then you are the only one who can make the necessary configuration changes! By all means ask your users to help watch out for wiki spam (and do so yourself) but these days spam can easily overwhelm small wiki communities. It helps to raise the bar a little. You should also note however, that none of these solutions can be considered completely spam-proof. Always visit 'Recent changes' (Special:RecentChanges) periodically!

Quickest solutions to try first
Fighting spam should not be too hard. If you want to quickly and drastically reduce spam, try these few steps first.
 * Install and setup the basic antivandalism extensions (already shipped with last releases), in particular Extension:ConfirmEdit.
 * Configure QuestyCaptcha and change default settings if needed for your wiki.

If you still have problems, read the rest of this page for further solutions and post a message on mediawiki-l for help.

CAPTCHA
One of the more common methods of weeding out automated submissions is to use a CAPTCHA, a system that tries to distinguish humans from automated systems by asking the user to solve a task that is difficult for machines. The ConfirmEdit extension for MediaWiki provides an extensible CAPTCHA framework which can be triggered on a number of events, including
 * all edits,
 * edits adding new, unrecognized external links,
 * user registration.

The extension ships with a default test, but this is a reference implementation, and is not intended for production use. Wiki operators installing ConfirmEdit on a public wiki are advised to use one of the CAPTCHA modules contained within the extension (there are five in total).

The most robust CAPTCHAs available today are your custom QuestyCaptcha questions, if you tailor them tightly to your wiki's audience and update them frequently. ReCaptcha is nowadays beaten by most spammers ; the Asirra CAPTCHA, which asks the user to distinguish cats and dogs, is particularly obnoxious to users but may be effective.

It is important to note that CAPTCHAs can block more than undesirable bots: if a script is unable to pass a CAPTCHA, then so is a screen reader, or other software or aid used by the blind or visually impaired. One of the options in CAPTCHA, the "reCAPTCHA" widget, includes an alternative audio CAPTCHA for such cases - but some computer users fail hearing tests and reading tests, so this is not a complete solution. You should consider the implications of such a barrier, and possibly provide an alternative means for affected users to create accounts and contribute, which is a legal requirement in some jurisdictions.

CAPTCHAs have some disadvantages in terms of accessibility and inconvenience to your real human users: it may block users who are blind or visually impaired (reCAPTCHA includes an audio CAPTCHA for such cases). For this reason it is recommended not to use them on every edit, but only on account creation and anonymous edits that insert links (these are the default settings for ConfirmEdit, used by Wikimedia Foundation projects). Consider providing an alternative means for affected users to create accounts and contribute, which is a legal requirement in some jurisdictions.

Also it will not completely spam-proof your wiki; according to Wikipedia "Spammers pay about $0.80 to $1.20 for each 1,000 solved CAPTCHAs to companies employing human solvers in Bangladesh, China, India, and many other developing nations." For this reason it should be combined with other mechanisms.

rel="nofollow"
Under the default configuration, MediaWiki adds  to external links in wiki pages, to indicate that these are user-supplied, might contain spam, and should therefore not be used to influence page ranking algorithms. Popular search engines such as Google honour this attribute.

You can switch off this behaviour on a site-wide basis using $wgNoFollowLinks or on a per-namespace basis using the $wgNoFollowNsExceptions configuration variable.

Use of the  attribute alone will not stop spammers attempting to add marketing to a page, but it will at least prevent them from benefiting through increased page rank; we know for sure that some check this. Nonetheless, it should never be relied upon as the primary method of controlling spam as its effectiveness is inherently limited. It does not keep spam off your site.

See NoIndexHistory. Note that putting it on all external links is a rather heavy handed anti-spam tactic, which you may decide not to use (switch off the rel=nofollow option). See Nofollow for a debate about this. It's good to have this as the installation default though. It means lazy administrators who are not thinking about spam problems, will tend to have this option enabled. For more information, see Manual:Costs and benefits of using nofollow.

Antispam routine: tailored measures
Every spammer is different, even though they all look boringly similar. If the general countermeasures are not enough, before taking extreme steps make use of the tools which allow you to deal with the specific problems you have.

Individual page protection
Often, the same page will be hit repeatedly by spambots. Common patterns observed in spambot-created pagenames include talk page, often outside main space (e.g. Category_talk: are little-used, so make common targets), and other discussion pages

As most abusive edits on wikis which don't require registration to edit are from anonymous sources, blocking edits to these specific pages by anyone other than established users can prevent re-creation of deleted spamdump pages. Typically, any page which is already a regular visitor to special:log/delete on an individual wiki is a good candidate for page protection.
 * Semi-protection of individual pages.
 * In addition, this can be combined with changing the minimum requirements for MediaWiki to identify users as 'autoconfirmed'.
 * One may apply cascading protection to one or more pages that have links to the most frequently spammed pages. One can also use this trick to set up a handy list for use by admins.

Abuse filter
Extension:AbuseFilter allows privileged users to create rules to target the specific type of spam your wiki is receiving, and automatically prevent the action and/or block the user. It can examine many properties of the edit, such as the username, user's age, text added, links added, and so on. It is most effective in cases where you have one or more skilled administrators who are willing to assist in helping you fight spam. The abuse filter can be effective even against human-assisted spammers, but requires continual maintenance to respond to new types of attacks.

Examples for combating automatic spam can be found on Manual:Combating spam/AbuseFilter examples.

SpamBlacklist
The above approach will become too cumbersome if you attempt to block more than a handful of spammy URLs. A better approach is to have a long blacklist identifying many known spamming URLs.

A popular extension for MediaWiki is the SpamBlacklist extension which blocks edits that add blacklisted URLs to pages: it allows such a list to be constructed on-wiki with the assistance of privileged users, and allows the use of lists retrieved from external sources (by default, it uses the extensive Spam blacklist).

The TitleBlacklist extension may also be useful, as a means to prevent re-creation of specific groups of pages which are being used by the 'bots to dump linkspam.

Open proxies
Open proxies are a danger mostly because they're used as a way to circumvent countermeasures targeted to specific abuser; see also No open proxies.

Some bots exist, e.g. on Wikimedia wikis, to detect and block open proxies IPs, but their code is often not public. Most such blocks are performed manually, when noticing the abuse. It's hence important to be able to tell whether an abusing IP is an open proxy or something else, to decide how to deal with it; even more so if it's an IP used by a registered user, retrieved with the CheckUser extension.

Since 1.22, $wgApplyIpBlocksToXff is available to make blocks more effective.

Hardcore measures
The following measures are for the more technical savvy sysadmins who know what they're doing: they're harder to set up properly and monitor; if implemented bad, they may be too old to be still effective, or even counterproductive for your wiki.

$wgSpamRegex
MediaWiki provides a means to filter the text of edits in order to block undesirable additions, through the $wgSpamRegex configuration variable. You can use this to block additional snippets of text or markup associated with common spam attacks.

Typically it's used to exclude URLs (or parts of URLS) which you do not want to allow users to link to. Users are presented with an explanatory message, indicating which part of their edit text is not allowed. Extension:SpamRegex allows editing of this variable on-wiki.

This prevents any mention of 'online-casino' or 'buy-viagra' or 'adipex' or 'phentermine'. The '/i' at the end makes the search case insensitive. It will also block edits which attempt to add hidden or overflowing elements, which is a common "trick" used in a lot of mass-edit attacks to attempt to hide the spam from viewers.

$wgBlockOpenProxies
By setting $wgBlockOpenProxies to true in your LocalSettings.php, MediaWiki will automatically scan each editing IP for open HTTP proxies. Such scans may be interpreted as hostile by some system administrators, and so this measure is not recommended.

Apache configuration changes
In addition to changing your MediaWiki configuration, if you are running MediaWiki on Apache, you can make changes to your Apache web server configuration to help stop spam. These settings are generally either placed in your virtual host configuration file, or in a file called .htaccess in the same location as LocalSettings.php (note that if you have a shared web host, they must enable AllowOverride to allow you to use an .htaccess file).

Filtering by user agent
When you block a spammer on your wiki, search your site's access log by IP to determine what user agent string that IP supplied. For example:

The access log location for your virtual host is generally set using the CustomLog directive. Once you find the accesses, you'll see some lines like this:

The user agent is the last quoted string on the line, in this case an empty string. Some spammers will use user agent strings used by real browsers, while others will use malformed or blank user agent strings. If they are in the latter category, you can block them by adding this to your .htaccess file (adapted from this page):

SetEnvIf User-Agent ^regular expression matching user agent string goes here$ spammer=yes

Order allow,deny allow from all deny from env=spammer

This will return a 403 Forbidden error to any IP connecting with a user agent matching the specified regular expression. Take care to escape all necessary regexp characters in the user agent string such as. - with backslashes (\). To match blank user agents, just use "^$".

Even if the spammer's user agent string is used by real browsers, if it is old or rarely encountered, you can use rewrite rules to redirect users to an error page, advising them to upgrade their browser:

RewriteCond %{HTTP_USER_AGENT} "Mozilla/5\.0 \(Windows; U; Windows NT 5\.1; en\-US; rv:1\.9\.0\.14\) Gecko/2009082707 Firefox/3\.0\.14 \(\.NET CLR 3\.5\.30729\)" RewriteCond %{REQUEST_URI} !^/forbidden/pleaseupgrade.html RewriteRule ^(.*)$ /forbidden/pleaseupgrade.html [L]

Preventing blocked spammers from consuming resources
A persistent spammer or one with a broken script may continue to try to spam your wiki after they have been blocked, needlessly consuming resources. By adding a deny from pragma such as the following to your .htaccess file, you can prevent them from loading pages at all, returning a 403 Forbidden error instead: Order allow,deny allow from all deny from 195.230.18.188

IP address blacklists
Much of the most problematic spam received on MediaWiki sites comes from addresses long known by other webmasters as bot or open proxy sites, though there's only anecdotal evidence for this. These bots typically generate large numbers of automated registrations to forum sites, comment spam to blogs and page vandalism to wikis: most often linkspam, although existing content is sometimes blanked, prepended with random gibberish characters or edited in such a way as to break existing Unicode text.

A relatively simple CAPTCHA may significantly reduce the problem, as may blocking the creation of certain often-spammed pages. These measures do not eliminate the problem, however, and at some point tightening security for all users will inconvenience legitimate contributors.

It may be preferable, instead of relying solely on CAPTCHA or other precautions which affect all users, to target specifically those IPs already known by other site masters to be havens of net.abuse. Many lists are already available, for instance stopforumspam.com has a list of "All IPs in CSV" which (as of feb. 2012) contains about 200,000 IPs of known spambots.

CPU usage and overload
Note that, when many checks are performed on attempted edits or pageviews, bots may easily overload your wiki disrupting it more than they would if it was unprotected. Keep an eye on the resource cost of your protections.

DNSBL
You can set MediaWiki to check each editing IP address against one or more DNSBLs (DNS-based blacklists), which requires no maintenance but slightly increases edit latency. For example, you can add this line to your LocalSettings.php to block many open proxies and known forum spammers:

For details of these DNSBLs, see Spamhaus: XBL and dnsbl.tornevall.org. For a list of DNSBLs, see Comparison of DNS blacklists. See also Manual:$wgEnableDnsBlacklist, Manual:$wgDnsBlacklistUrls.

Bad Behavior and Project HoneyPot
Bad Behavior is a first defense line blocking all requests by known spammers identified via HTTP headers, IP address, and other metadata; it is available as a MediaWiki extension, see Extension:Bad Behavior.

For maximum effectiveness, it should be combined with an http:BL API Key, which you can get by signing up for Project Honey Pot, a distributed spam tracking project. To join Project HoneyPot you will need to add a publicly accessible file to your webserver, then use the following extension code in your LocalSettings.php (or an included PHP file) to embed a link to it in every page:

Set $wgHoneyPotPath to the path of the honeypot page in your LocalSettings.php (e.g. "/ciralix.php"). You may change the form of the link above to any of the alternatives suggested by Project HoneyPot.

Once you're signed up, choose Services&rarr;HTTP Blacklist to get an http:BL API Key, and put your key in Bad Behavior's settings.ini.

$wgProxyList
Warning: This particular technique will substantially increase page load time and server load if the IP list is large. Use with caution.

You can set the variable $wgProxyList to a list of IPs to ban. This can be populated periodically from an external source using a cron script such as the following:

You then set in your LocalSettings.php:

If you do this and you use APC cache for caching, you may need to increase apc.shm_size in your php.ini to accommodate such a large list.

If your distribution's awk exits with a Segmentation fault, you may be more successful with replacing awk by sed.

You may want to save these commands in a file called e.g. updateBannedIPs.sh, so you can run it periodically.

You can also use a PHP-only solution to download the ip-list from stopforumspam. Todo so check the PHP script available here.

You have just banned one hundred forty thousand spammers, all hopefully without any disruptive effect on your legitimate users, and said «adieu» to a lot of the worst of the known spammers on the Internet. Good riddance! That should make things a wee bit quieter, at least for a while… You've just

Honeypots, DNS BL's and HTTP BL's
140,000 dead spammers. Not bad, but any proper BOFH at this point would be bored and eagerly looking for the 140,001st spam IP to randomly block. And why not?

Fortunately, dynamically-updated lists of spambots, open proxies and other problem IP's are widely available. Many also allow usernames or e-mail addresses (for logged-in users) to be automatically checked against the same blacklists.

One form of blacklist which may be familiar to MediaWiki administrators is the DNS BL. Hosted on a domain name server, a DNS blacklist is a database of IP addresses. An address lookup determines if an IP attempting to register or edit is an already-known source of net abuse.

The $wgEnableDnsBlacklist and $wgDnsBlacklistUrls options in MediaWiki provide a primitive example of access to a DNS blacklist. Set  in LocalSettings.php and IP addresses listed as HTTP spam are blocked.)

The DNS blacklist operates as follows:
 * A wiki gets an edit or new-user registration request from some random IP address (for example, in the format '123.45.67.89')
 * The four IP address bytes are placed into reverse order, then followed by the name of the desired DNS blacklist server
 * The resulting address is requested from the domain name server (in this example, '89.67.45.123.zen.spamhaus.org.' and '89.67.45.123.dnsbl.tornevall.org.')
 * The server returns not found (NXDOMAIN) if the address is not on the blacklist. If is on either blacklist, the edit is blocked.

The lookup in an externally-hosted blacklist typically adds no more than a few seconds to the time taken to save an edit. Unlike $wgProxyKey settings, which must be loaded on each page read or write, the use of the DNS blacklist only takes place during registration or page edits. This leaves the speed at which the system can service page read requests (the bulk of your traffic) unaffected.

While the original SORBS was primarily intended for dealing with open web proxies and e-mail spam, there are other lists specific to web spam (forums, blog comments, wiki edits) which therefore may be more suitable:
 * .opm.tornevall.org. operates in a very similar manner to SORBS DNSBL, but targets open proxies and web-form spamming. Much of its content is consolidated from other existing lists of abusive IP's.
 * .dnsbl.httpbl.org. specifically targets 'bots which harvest e-mail addresses from web pages for bulk mail lists, leave comment spam or attempt to steal passwords using dictionary attacks. It requires the user register with projecthoneypot.org for a 12-character API key. If this key (for example) were 'myapitestkey', a lookup which would otherwise look like '89.67.45.123.http.dnsbl.sorbs.net.' or '89.67.45.123.opm.tornevall.org.' would need to be 'myapitestkey.89.67.45.123.dnsbl.httpbl.org.'
 * Web-based blacklists can identify spammer's e-mail addresses and user information beyond a simple IP address, but there is no standard format for the reply from an HTTP blacklist server. For instance, a request for http://botscout.com/test/?ip=123.45.67.89 would return "Y|IP|4" if the address is blacklisted ('N' or blank if OK), while a web request for http://www.stopforumspam.com/api?ip=123.45.67.89 would return "ip yes  2009-04-16 23:11:19  41" if the address is blacklisted (the time, date and count can be ignored) or blank if the address is good.

With no one standard format by which a blacklist server responds to an enquiry, no built-in support for most on-line lists of known spambots exists in the stock MediaWiki package. The inability to specify more than one blacklist server further limits the usefulness of the built-in $wgEnableDnsBlacklist and $wgDnsBlacklistUrls options. Since 58061, MediaWiki has been able to check multiple DNSBLs by defining $wgDnsBlacklistUrls as an array.

As most blacklist operators provide very limited software support (often targeted to non-wiki applications, such as phpBB or Wordpress), third-party adaptations of these clients have been built and deployed on some wikis to check spambots. As the same spambots create similar problems on most open-content websites, the worst offenders attacking MediaWiki sites will also be busily targeting thousands of non-wiki sites with spam in blog comments, forum posts and guestbook entries.

Automatic query of multiple blacklist sites is therefore already in widespread use protecting various other forms of open-content sites and the spambot names, ranks and IP addresses are by now already all too well known. A relatively small number of spambots appear to be behind a large percentage of the overall problem. Even where admins take no prisoners, a pattern where the same spambot IP which posted linkspam to the wiki a second ago is spamming blog comments somewhere else now and will be spamming forum posts a few seconds from now on a site half a world away has been duly noted. One shared external blacklist entry can silence one problematic 'bot from posting on thousands of sites.

This greatly reduces the number of individual IP's which need to be manually blocked, one wiki and one forum at a time, by local administrators.

But what's this about honeypots?
Some anti-spam sites, such as projecthoneypot.org, provide code which you are invited to include in your own website pages. Typically, the pages contain one or more unique, randomised and hidden e-mail addresses or links, intended not for your human visitors but for spambots. Each time the page is served, the embedded addresses are automatically changed, allowing individual pieces of spam to be directly and conclusively matched to the IP address of bots which harvested the addresses from your sites. The IP address which the bot used to view your site is automatically submitted to the operators of the blacklist service. Often a link to a fake 'comment' or 'guest book' is also hidden as a trap to bots which post spam to web forms. See Honeypot (computing).

Once the address of the spammer is known, it is added to the blacklists (see above) so that you and others will in future have one less unwanted robotic visitor to your sites.

While honeypot scripts and blacklist servers can automate much of the task of identifying and dealing with spambot IPs, most blacklist sites do provide links to web pages on which one can manually search for information about an IP address or report an abusive IP as a spambot. It may be advisable to include some of these links on the special:blockip pages of your wiki for the convenience of your site's administrators.

More lists of proxy and spambot IPs
Typically, feeding the address of any bot or open proxy into a search engine will return many lists on which these abusive IP's have already been reported. In some cases, the lists will be part of anti-spam sites, in others a site advocating the use of open proxies will list not only the proxy which has been being abused to spam your wiki installation but hundreds of other proxies like it which are also open for abuse.

While any plain-text lists of open proxies must still be imported into your wiki manually, a Spambot Search Tool may be configured as an automated script to query any of the following databases:


 * 1) fSpamlist - fspamlist.com
 * 2) StopForumSpam - stopforumspam.com
 * 3) Sorbs - sorbs.net
 * 4) Spamhaus - spamhaus.org
 * 5) SpamCop - spamcop.net
 * 6) ProjectHoneyPot - projecthoneypot.org
 * 7) Bot Scout - botscout.com
 * 8) DroneBL - dronebl.org
 * 9) AHBL - ahbl.org
 * 10) s5h spam - all.s5h.net

It is also possible to block wiki registrations from anonymised sources such as Tor proxies (Tor Project - torproject.org), from bugmenot users or from e-mail addresses (listed by undisposable.net) intended solely for one-time use.

See also Blacklists Compared - 1 March 2008 and spamfaq.net for lists of blacklists. Do keep in mind that lists intended for spam e-mail abatement will generate many false positives if installed to block comment spam on wikis or other web forms. Automated use of a list that blacklists all known dynamic user IP address blocks, for instance, could render your wiki all but unusable.

To link to IP blacklist sites from the Special:Blockip page of your wiki (as a convenience to admins wishing to manually check if a problem address is an already-known 'bot):
 * 1) Add one line to LocalSettings.php to set: $wgNamespacesWithSubpages[NS_SPECIAL] = true;
 * 2) Add the following text in MediaWiki:Blockiptext to display: " Check this IP at Domain Tools, OpenRBL, Project Honeypot, Spam Cop, Spamhaus, Stop Forum Spam. "

This will add an invitation to "check this IP at: Domain Tools, OpenRBL, Project Honeypot, Spam Cop, Spamhaus, Stop Forum Spam" to the page from which admins ask to block an IP. An IP address is sufficient information to make comments on Project Honeypot against spambots, Stop Forum Spam is less suited to reporting anon-IP problems as it requires username, IP and e-mail under which a problem 'bot is attempting to register on your sites. The policies and capabilities of other blacklist-related websites may vary.

Note that blocking the address of the spambot posting to your site is not the same as blocking the URL's of specific external links being spammed in the edited text. Do both. Both approaches used in combination, as a means to supplement (but not replace) other anti-spam tools such as title or username blacklists and tests which attempt to determine whether an edit is made by a human or a robot (captcha, bad behaviour or akismet) can be a very effective means to separate spambots from real, live human visitors.

If spam has won the battle
You can still win the war! MediaWiki offers you the tools to do so; just consolidate your positions until you're ready to attack again. See Manual:Combating vandalism and in particular Cleaning up, Restrict editing.

Other ideas
This page lists features which are currently included, or available as patches, but on the discussion page you will find many other ideas for anti-spam features which could be added to MediaWiki, or which are under development.

Extensions

 * AbuseFilter &mdash; allows edit prevention and blocking based on a variety of criteria
 * Bad Behavior
 * Check Spambots &mdash; queries online databases and DNSRBLs to detect known spam vectors
 * CheckUser &mdash; allows, among other things, the checking of the underlying IP addresses of account spammers to block them. Allows mass-blocking of spammers from similar locations.
 * FlaggedRevs
 * SimpleAntiSpam &mdash; adds an invisible input field into the edit view and checks if the box was filled; if it was, the extension disallows the edit. Won't affect human users in any way. This functionality is now part of MediaWiki core since 1.22.
 * SpamRegex &mdash; allows basic blocking of edits containing spam domains with a single regex
 * Category:Spam management extensions &mdash; category exhaustively listing spam management extensions

Useful only on some wiki farms:
 * AntiBot &mdash; a simple framework for spambot checks and trigger payloads.
 * EmailAddressImage
 * GlobalBlocking

Bundled in the installer
The standard tarball available for download now contains most of the main anti-spam extensions, including the following:


 * ConfirmEdit &mdash; adds various types of CAPTCHAs to your wiki
 * Asirra &mdash; CAPTCHA based on distinguishing cats and dogs
 * QuestyCaptcha &mdash; CAPTCHA based on answering questions
 * Nuke &mdash; removes all contributions by a user or IP
 * SpamBlacklist &mdash; prevents edits containing spam domains, list is editable on-wiki by privileged users

Settings

 * Manual:$wgBlockOpenProxies
 * Manual:$wgDnsBlacklistUrls
 * Manual:$wgEmailConfirmToEdit
 * Manual:$wgEnableDnsBlacklist
 * Manual:$wgGroupPermissions
 * Manual:$wgProxyList
 * Manual:$wgSpamRegex
 * Manual:$wgApplyIpBlocksToXff