Manual:Combating spam

From mediawiki.org

Like all current dynamic web sites, wikis are a common target for spammers wishing to promote products or websites. MediaWiki offers a number of features designed to combat vandalism in general. On this page we deal specifically with wiki spam, which is often automated.

Overview

Common tools used to combat wiki spam typically fall into these categories:

  • Requiring log in and/or a CAPTCHA on certain operations, such as edits, adding external links, or new user creation
  • Blocking edits from known blacklisted IP addresses or IPs running open proxies
  • Blocking edits which add specific unwanted keywords or external links
  • Blocking specific username and page title patterns commonly used by spambots
  • Blocking edits by new or anonymous users to specific often-targeted pages
  • Whitelisting known-good editors (such as admins, regular contributors) while placing restrictions on new or anonymous users
  • Cleanup scripts or bulk deletion (Extension:Nuke ) of existing posts from recently-banned spambots

Normally a combination of various methods will be used, in an attempt to keep the number of spam, robot and open-proxy edits to a minimum while limiting the amount of disruption caused to legitimate users of the site.

Note that many of these features are not activated by default. If you are running a MediaWiki installation on your server/host, then you are the only one who can make the necessary configuration changes! By all means ask your users to help watch out for wiki spam (and do so yourself) but these days spam can easily overwhelm small wiki communities. It helps to raise the bar a little. You should also note however, that none of these solutions can be considered completely spam-proof. Checking "Recent changes" (Special:RecentChanges) regularly is an effective practice.

Quickest solutions to try first

Fighting spam should not be too hard. If you want to quickly and drastically reduce spam, try these few steps first.

If you still have problems, read the rest of this page for further solutions and post a message on mediawiki-l for help.

Antispam setup basics

CAPTCHA

One of the more common methods of weeding out automated submissions is to use a CAPTCHA, a system that tries to distinguish humans from automated systems by asking the user to solve a task that is difficult for machines. The ConfirmEdit extension for MediaWiki provides an extensible CAPTCHA framework which can be triggered on a number of events, including

  • all edits
  • edits adding new, unrecognized external links
  • user registration

The extension ships with a default test, but this is a reference implementation, and is not intended for production use. Wiki operators installing ConfirmEdit on a public wiki are advised to use one of the CAPTCHA modules contained within the extension (there are five in total).

The most robust CAPTCHAs available today are your custom QuestyCaptcha questions, if you tailor them tightly to your wiki's audience and update them frequently. ReCaptcha is nowadays beaten by most spammers[1]; the Asirra CAPTCHA, which asks the user to distinguish cats and dogs, is particularly obnoxious to users but may be effective.

It is important to note that CAPTCHAs can block more than undesirable bots: if a script is unable to pass a CAPTCHA, then so is a screen reader, or other software or aid used by the blind or visually impaired. One of the options in CAPTCHA, the "reCAPTCHA" widget, includes an alternative audio CAPTCHA for such cases - but some computer users fail hearing tests and reading tests, so this is not a complete solution. You should consider the implications of such a barrier, and possibly provide an alternative means for affected users to create accounts and contribute, which is a legal requirement in some jurisdictions.[2]

Also it will not completely spam-proof your wiki; Spammers pay about $0.80 to $1.20 for each 1,000 solved CAPTCHAs to companies employing human solvers in Bangladesh, China, India, and many other developing nations.[3] For this reason it should be combined with other mechanisms.

rel="nofollow"

Under the default configuration, MediaWiki adds rel="nofollow" to external links in wiki pages, to indicate that these are user-supplied, might contain spam, and should therefore not be used to influence page ranking algorithms. Popular search engines such as Google honour this attribute.

You can switch off this behaviour on a site-wide basis using $wgNoFollowLinks or on a per-namespace basis using the $wgNoFollowNsExceptions configuration variable.

Use of the rel="nofollow" attribute alone will not stop spammers attempting to add marketing to a page, but it will at least prevent them from benefiting through increased page rank; we know for sure that some check this. Nonetheless, it should never be relied upon as the primary method of controlling spam as its effectiveness is inherently limited. It does not keep spam off your site.

See NoIndexHistory. Note that putting it on all external links is a rather heavy handed anti-spam tactic, which you may decide not to use (switch off the rel=nofollow option). See Nofollow for a debate about this. It's good to have this as the installation default though. It means lazy administrators who are not thinking about spam problems, will tend to have this option enabled. For more information, see Manual:Costs and benefits of using nofollow.

Antispam routine: tailored measures

Every spammer is different, even though they all look boringly similar. If the general countermeasures are not enough, before taking extreme steps make use of the tools which allow you to deal with the specific problems you have.

Individual page protection

Often, the same page will be hit repeatedly by spambots. Common patterns observed in spambot-created pagenames include talk page, often outside main space (e.g. Category_talk: are little-used, so make common targets), and other discussion pages

As most abusive edits on wikis which don't require registration to edit are from anonymous sources, blocking edits to these specific pages by anyone other than established users can prevent re-creation of deleted spamdump pages. Typically, any page which is already a regular visitor to special:log/delete on an individual wiki is a good candidate for page protection.

  • Semi-protection of individual pages.
    • In addition, this can be combined with changing the minimum requirements for MediaWiki to identify users as 'autoconfirmed'.
  • One may apply cascading protection to one or more pages that have links to the most frequently spammed pages. One can also use this trick to set up a handy list for use by admins.

Abuse filter

Extension:AbuseFilter allows privileged users to create rules to target the specific type of spam your wiki is receiving, and automatically prevent the action and/or block the user.

It can examine many properties of the edit, such as the username, user's age, text added, links added, and so on. It is most effective in cases where you have one or more skilled administrators who are willing to assist in helping you fight spam. The abuse filter can be effective even against human-assisted spammers, but requires continual maintenance to respond to new types of attacks.

Examples for combating automatic spam can be found on Manual:Combating spam/AbuseFilter examples .

SpamBlacklist

The above approach will become too cumbersome if you attempt to block more than a handful of spammy URLs. A better approach is to have a long blacklist identifying many known spamming URLs.

A popular extension for MediaWiki is the SpamBlacklist extension which blocks edits that add blacklisted URLs to pages: it allows such a list to be constructed on-wiki with the assistance of privileged users, and allows the use of lists retrieved from external sources (by default, it uses the extensive m:Spam blacklist).

The TitleBlacklist extension may also be useful, as a means to prevent re-creation of specific groups of pages which are being used by the bots to dump linkspam.

Open proxies

Open proxies are a danger mostly because they're used as a way to circumvent countermeasures targeted to specific abuser; see also No open proxies.

Some bots exist, e.g. on Wikimedia wikis, to detect and block open proxies IPs, but their code is often not public. Most such blocks are performed manually, when noticing the abuse. It's hence important to be able to tell whether an abusing IP is an open proxy or something else, to decide how to deal with it; even more so if it's an IP used by a registered user, retrieved with the CheckUser extension.

Several extensions, particularly the Tor block extension, blocks a range of open proxies.

Since 1.22, $wgApplyIpBlocksToXff is available to make blocks more effective.

Hardcore measures

The following measures are for the more technical savvy sysadmins who know what they're doing: they're harder to set up properly and monitor; if implemented badly, they may be too old to be still effective, or even counterproductive for your wiki.

$wgSpamRegex

MediaWiki provides a means to filter the text of edits in order to block undesirable additions, through the $wgSpamRegex configuration variable. You can use this to block additional snippets of text or markup associated with common spam attacks.

Typically it's used to exclude URLs (or parts of URLS) which you do not want to allow users to link to. Users are presented with an explanatory message, indicating which part of their edit text is not allowed. Extension:SpamRegex allows editing of this variable on-wiki.

$wgSpamRegex = "/online-casino|buy-viagra|adipex|phentermine|adult-website\.com|display:none|overflow:\s*auto;\s*height:\s*[0-4]px;/i";

This prevents any mention of 'online-casino' or 'buy-viagra' or 'adipex' or 'phentermine'. The '/i' at the end makes the search case insensitive. It will also block edits which attempt to add hidden or overflowing elements, which is a common "trick" used in a lot of mass-edit attacks to attempt to hide the spam from viewers.

Apache configuration changes

In addition to changing your MediaWiki configuration, if you are running MediaWiki on Apache, you can make changes to your Apache web server configuration to help stop spam. These settings are generally either placed in your virtual host configuration file, or in a file called .htaccess in the same location as LocalSettings.php (note that if you have a shared web host, they must enable AllowOverride to allow you to use an .htaccess file).

Filtering by user agent

When you block a spammer on your wiki, search your site's access log by IP to determine which user agent string that IP supplied. For example:

grep ^195.230.18.188 /var/log/apache2/access.log

The access log location for your virtual host is generally set using the CustomLog directive. Once you find the accesses, you'll see some lines like this:

195.230.18.188 - - [16/Apr/2012:16:50:44 +0000] "POST /index.php?title=FlemmingCoakley601&action=submit HTTP/1.1" 200 24093 "-" ""

The user agent is the last quoted string on the line, in this case an empty string. Some spammers will use user agent strings used by real browsers, while others will use malformed or blank user agent strings. If they are in the latter category, you can block them by adding this to your .htaccess file (adapted from this page):

SetEnvIf User-Agent ^regular expression matching user agent string goes here$ spammer=yes

Order allow,deny
allow from all           
deny from env=spammer

This will return a 403 Forbidden error to any IP connecting with a user agent matching the specified regular expression. Take care to escape all necessary regexp characters in the user agent string such as . ( ) - with backslashes (\). To match blank user agents, just use "^$".

Even if the spammer's user agent string is used by real browsers, if it is old or rarely encountered, you can use rewrite rules to redirect users to an error page, advising them to upgrade their browser:

RewriteCond %{HTTP_USER_AGENT} "Mozilla/5\.0 \(Windows; U; Windows NT 5\.1; en\-US; rv:1\.9\.0\.14\) Gecko/2009082707 Firefox/3\.0\.14 \(\.NET CLR 3\.5\.30729\)"
RewriteCond %{REQUEST_URI} !^/forbidden/pleaseupgrade.html
RewriteRule ^(.*)$ /forbidden/pleaseupgrade.html [L]

Preventing blocked spammers from consuming resources

A persistent spammer or one with a broken script may continue to try to spam your wiki after they have been blocked, needlessly consuming resources. By adding a deny from pragma such as the following to your .htaccess file, you can prevent them from loading pages at all, returning a 403 Forbidden error instead:

Order allow,deny
allow from all
deny from 195.230.18.188

IP address blacklists

Much of the most problematic spam received on MediaWiki sites comes from addresses long known by other webmasters as bot or open proxy sites, though there's only anecdotal evidence for this. These bots typically generate large numbers of automated registrations to forum sites, comment spam to blogs and page vandalism to wikis: most often linkspam, although existing content is sometimes blanked, prepended with random gibberish characters or edited in such a way as to break existing Unicode text.

A relatively simple CAPTCHA may significantly reduce the problem, as may blocking the creation of certain often-spammed pages. These measures do not eliminate the problem, however, and at some point tightening security for all users will inconvenience legitimate contributors.

It may be preferable, instead of relying solely on CAPTCHA or other precautions which affect all users, to target specifically those IPs already known by other site masters to be havens of net.abuse. Many lists are already available, for instance stopforumspam.com has a list of "All IPs in CSV" which (as of feb. 2012) contains about 200,000 IPs of known spambots.

CPU usage and overload

Note that, when many checks are performed on attempted edits or pageviews, bots may easily overload your wiki disrupting it more than they would if it was unprotected. Keep an eye on the resource cost of your protections.

DNSBL

You can set MediaWiki to check each editing IP address against one or more DNSBLs (DNS-based blacklists), which requires no maintenance but slightly increases edit latency. For example, you can add this line to your LocalSettings.php to block many open proxies and known forum spammers:

$wgEnableDnsBlacklist = true;
$wgDnsBlacklistUrls = array( 'xbl.spamhaus.org', 'dnsbl.tornevall.org' );

For details of these DNSBLs, see Spamhaus: XBL and dnsbl.tornevall.org. For a list of DNSBLs, see Comparison of DNS blacklists. See also Manual:$wgEnableDnsBlacklist , Manual:$wgDnsBlacklistUrls .

$wgProxyList

Warning Warning: This particular technique will substantially increase page load time and server load if the IP list is large. Use with caution.

You can set the variable $wgProxyList to a list of IPs to ban. This can be populated periodically from an external source using a cron script such as the following:

#!/bin/bash
cd /your/web/root
wget https://www.stopforumspam.com/downloads/listed_ip_30_ipv46.gz
gzip -d listed_ip_30_ipv46.gz
cat > bannedips.php << 'EOF'
<?php
$wgProxyList = array(
EOF
sed -e 's/^/  "/; s/$/",/' < listed_ip_30_ipv46 >> bannedips.php
printf '%s\n' '");' >> bannedips.php
rm -f listed_ip_30_ipv46

You then set in your LocalSettings.php:

require_once "$IP/bannedips.php";

You may want to save these commands in a file called e.g. updateBannedIPs.sh, so you can run it periodically.

You can also use a PHP-only solution to download the ip-list from stopforumspam. To do so check the PHP script available here.

If you do this and you use APC cache for caching, you may need to increase apc.shm_size in your php.ini to accommodate such a large list.

You have just banned one hundred forty thousand spammers, all hopefully without any disruptive effect on your legitimate users, and said «adieu» to a lot of the worst of the known spammers on the Internet. Good riddance! That should make things a wee bit quieter, at least for a while…

Honeypots, DNS BLs and HTTP BLs

140,000 dead spammers. Not bad, but any proper BOFH at this point would be bored and eagerly looking for the 140,001st spam IP to randomly block. And why not?

Fortunately, dynamically-updated lists of spambots, open proxies and other problem IPs are widely available. Many also allow usernames or email addresses (for logged-in users) to be automatically checked against the same blacklists.

One form of blacklist which may be familiar to MediaWiki administrators is the DNS BL. Hosted on a domain name server, a DNS blacklist is a database of IP addresses. An address lookup determines if an IP attempting to register or edit is an already-known source of net abuse.

The $wgEnableDnsBlacklist and $wgDnsBlacklistUrls options in MediaWiki provide a primitive example of access to a DNS blacklist. Set the following settings in LocalSettings.php and IP addresses listed as HTTP spam are blocked:

$wgEnableDnsBlacklist = true;
$wgDnsBlacklistUrls = array( 'xbl.spamhaus.org', 'opm.tornevall.org' );

The DNS blacklist operates as follows:

  • A wiki gets an edit or new-user registration request from some random IP address (for example, in the format '123.45.67.89')
  • The four IP address bytes are placed into reverse order, then followed by the name of the desired DNS blacklist server
  • The resulting address is requested from the domain name server (in this example, '89.67.45.123.zen.spamhaus.org.' and '89.67.45.123.dnsbl.tornevall.org.')
  • The server returns not found (NXDOMAIN) if the address is not on the blacklist. If it is on either blacklist, the edit is blocked.

The lookup in an externally-hosted blacklist typically adds no more than a few seconds to the time taken to save an edit. Unlike $wgProxyKey settings, which must be loaded on each page read or write, the use of the DNS blacklist only takes place during registration or page edits. This leaves the speed at which the system can service page read requests (the bulk of your traffic) unaffected.

While the original SORBS was primarily intended for dealing with open web proxies and email spam, there are other lists specific to web spam (forums, blog comments, wiki edits) which therefore may be more suitable:

  • .opm.tornevall.org. operates in a very similar manner to SORBS DNSBL, but targets open proxies and web-form spamming. Much of its content is consolidated from other existing lists of abusive IPs.
  • .dnsbl.httpbl.org. specifically targets bots which harvest email addresses from web pages for bulk mail lists, leave comment spam or attempt to steal passwords using dictionary attacks. It requires the user register with projecthoneypot.org for a 12-character API key. If this key (for example) were 'myapitestkey', a lookup which would otherwise look like '89.67.45.123.http.dnsbl.sorbs.net.' or '89.67.45.123.opm.tornevall.org.' would need to be 'myapitestkey.89.67.45.123.dnsbl.httpbl.org.'
  • Web-based blacklists can identify spammer's email addresses and user information beyond a simple IP address, but there is no standard format for the reply from an HTTP blacklist server. For instance, a request for http://botscout.com/test/?ip=123.45.67.89 would return "Y|IP|4" if the address is blacklisted ('N' or blank if OK), while a web request for http://www.stopforumspam.com/api?ip=123.45.67.89 would return "ip yes 2009-04-16 23:11:19 41" if the address is blacklisted (the time, date and count can be ignored) or blank if the address is good.

With no one standard format by which a blacklist server responds to an enquiry, no built-in support for most on-line lists of known spambots exists in the stock MediaWiki package. Since rev:58061, MediaWiki has been able to check multiple DNSBLs by defining $wgDnsBlacklistUrls as an array.

Most blacklist operators provide very limited software support (often targeted to non-wiki applications, such as phpBB or Wordpress). As the same spambots create similar problems on most open-content websites, the worst offenders attacking MediaWiki sites will also be busily targeting thousands of non-wiki sites with spam in blog comments, forum posts and guestbook entries.

Automatic query of multiple blacklist sites is therefore already in widespread use protecting various other forms of open-content sites and the spambot names, ranks and IP addresses are by now already all too well known. A relatively small number of spambots appear to be behind a large percentage of the overall problem. Even where admins take no prisoners, a pattern where the same spambot IP which posted linkspam to the wiki a second ago is spamming blog comments somewhere else now and will be spamming forum posts a few seconds from now on a site half a world away has been duly noted. One shared external blacklist entry can silence one problematic bot from posting on thousands of sites.

This greatly reduces the number of individual IPs which need to be manually blocked, one wiki and one forum at a time, by local administrators.

But what's this about honeypots?

Some anti-spam sites, such as projecthoneypot.org, provide code which you are invited to include in your own website pages.

Typically, the pages contain one or more unique, randomised and hidden email addresses or links, intended not for your human visitors but for spambots. Each time the page is served, the embedded addresses are automatically changed, allowing individual pieces of spam to be directly and conclusively matched to the IP address of bots which harvested the addresses from your sites. The IP address which the bot used to view your site is automatically submitted to the operators of the blacklist service. Often a link to a fake 'comment' or 'guest book' is also hidden as a trap to bots which post spam to web forms. See Honeypot (computing).

Once the address of the spammer is known, it is added to the blacklists (see above) so that you and others will in future have one less unwanted robotic visitor to your sites.

While honeypot scripts and blacklist servers can automate much of the task of identifying and dealing with spambot IPs, most blacklist sites do provide links to web pages on which one can manually search for information about an IP address or report an abusive IP as a spambot. It may be advisable to include some of these links on the special:blockip pages of your wiki for the convenience of your site's administrators.

More lists of proxy and spambot IPs

Typically, feeding the address of any bot or open proxy into a search engine will return many lists on which these abusive IPs have already been reported.

In some cases, the lists will be part of anti-spam sites, in others a site advocating the use of open proxies will list not only the proxy which has been being abused to spam your wiki installation but hundreds of other proxies like it which are also open for abuse. It is also possible to block wiki registrations from anonymised sources such as Tor proxies (Tor Project - torproject.org), from bugmenot fake account users or from email addresses (listed by undisposable.net) intended solely for one-time use.

See also Blacklists Compared - 1 March 2008 and spamfaq.net for lists of blacklists. Do keep in mind that lists intended for spam email abatement will generate many false positives if installed to block comment spam on wikis or other web forms. Automated use of a list that blacklists all known dynamic user IP address blocks, for instance, could render your wiki all but unusable.

To link to IP blacklist sites from the Special:Blockip page of your wiki (as a convenience to admins wishing to manually check if a problem address is an already-known bot):

  1. Add one line to LocalSettings.php to set: $wgNamespacesWithSubpages [NS_SPECIAL] = true;
  2. Add the following text in MediaWiki:Blockiptext to display:
"Check this IP at [http://whois.domaintools.com/{{SUBPAGENAME}} Domain Tools], [http://openrbl.org/?i={{SUBPAGENAME}} OpenRBL], [http://www.projecthoneypot.org/ip_{{SUBPAGENAME}} Project Honeypot], [http://www.spamcop.net/w3m?action=checkblock&ip={{SUBPAGENAME}} Spam Cop], [http://www.spamhaus.org/query/bl?ip={{SUBPAGENAME}} Spamhaus], [http://www.stopforumspam.com/ipcheck/{{SUBPAGENAME}} Stop Forum Spam]."

This will add an invitation to "check this IP at: Domain Tools, OpenRBL, Project Honeypot, Spam Cop, Spamhaus, Stop Forum Spam" to the page from which admins ask to block an IP. An IP address is sufficient information to make comments on Project Honeypot against spambots, Stop Forum Spam is less suited to reporting anon-IP problems as it requires username, IP and email under which a problem bot is attempting to register on your sites. The policies and capabilities of other blacklist-related websites may vary.

Note that blocking the address of the spambot posting to your site is not the same as blocking the URLs of specific external links being spammed in the edited text. Do both. Both approaches used in combination, as a means to supplement (but not replace) other anti-spam tools such as title or username blacklists and tests which attempt to determine whether an edit is made by a human or a robot (captchas or akismet) can be a very effective means to separate spambots from real, live human visitors.

If spam has won the battle

You can still win the war! MediaWiki offers you the tools to do so; just consolidate your positions until you're ready to attack again. See Manual:Combating vandalism and in particular Cleaning up, Restrict editing.

See External links for other tools without MediaWiki support.

Other ideas

This page lists features which are currently included, or available as patches, but on the discussion page you will find many other ideas for anti-spam features which could be added to MediaWiki, or which are under development.

See also

Extensions

  • AbuseFilter — allows edit prevention and blocking based on a variety of criteria
  • A slimmed down ConfirmAccount can be used to moderate new user registrations, (doesn't require captchas).
  • CheckUser — allows, among other things, the checking of the underlying IP addresses of account spammers to block them. Allows mass-blocking of spammers from similar locations.
  • FlaggedRevs
  • SpamRegex — allows basic blocking of edits containing spam domains with a single regex
  • StopForumSpam — allows for checking edits against the StopForumSpam service and allows for submitting data back to it when blocking users.
  • Category:Spam management extensions — category exhaustively listing spam management extensions
  • Moderation — don't show edits to normal users until approved by a moderator. This extension has the advantage that spam links are never shown to the public, so not creating incentive to post spam.

Useful only on some wiki farms:

Commercial services:

Bundled in the installer

The standard tarball available for download now contains most of the main anti-spam extensions, including the following:

  • ConfirmEdit — adds various types of CAPTCHAs to your wiki
  • Nuke — removes all contributions by a user or IP
  • SpamBlacklist — prevents edits containing spam domains, list is editable on-wiki by privileged users

Settings

External links


References

  1. Example: «Automatically solve captchas: GSA Captcha Breaker + Mega Ocr (Solves Recaptcha!)» says user Senukexcr.
  2. For example, Section 508 Standards for Electronic and Information Technology
  3. The New York Times 25 April 2010 Spammers Pay Others to Answer Security Tests by Vikas Bajaj