Manual:$wgSpamRegex

Details
Text matching this regular expression (or "regex") will be recognised as Wiki Spam.

This is one of mediawiki's most effective built in anti-spam features. It will not block all spam, but it can reduce spam dramatically, with almost no negative impact upon legitimate users. The setting of this configuration variable will control how mediawiki examines the text of contributions and comes back with an answer. Is this spam? yes or no?

A Large Example
The following example is a good setting to try out on your wiki, if it is a medium/small size wiki suffering from spamming attacks. Paste the following into your LocalSettings.php file:

$wgSpamRegex = "/". "s-e-x|zoofilia|sexyongpin|grusskarte|geburtstagskarten|animalsex|". "sex-with|dogsex|adultchat|adultlive|camsex|sexcam|livesex|sexchat|". "chatsex|onlinesex|adultporn|adultvideo|adultweb.|hardcoresex|hardcoreporn|". "teenporn|xxxporn|lesbiansex|livegirl|livenude|livesex|livevideo|camgirl|". "spycam|voyeursex|casino-online|online-casino|kontaktlinsen|cheapest-phone|". "laser-eye|eye-laser|fuelcellmarket|lasikclinic|cragrats|parishilton|". "paris-hilton|paris-tape|2large|fuel-dispenser|fueling-dispenser|huojia|". "jinxinghj|telematicsone|telematiksone|a-mortgage|diamondabrasives|". "reuterbrook|sex-plugin|sex-zone|lazy-stars|eblja|liuhecai|". "buy-viagra|-cialis|-levitra|boy-and-girl-kissing|". //These match spammy words "dirare\.com|". //This matches dirare.com a spammer's domain name "overflow:\s*auto|". //This matches against overflow:auto "height:\s*[0-4]px|". //This matches against height:0px (most CSS hidden spam) "\<\s*a\s*href|". //This blocks &lt;a href links entirely, forcing wiki syntax "display\s*:\s*none". //This matches against display:none "/i";

Note that the second-last line does not have the "|" at the end of the string.

This example incorporates common spamming keywords (some taken from Meta Spam Blacklist) and also techniques for blocking CSS hidden spam.

Using regular expressions to block spam
Hopefully you can guess how the above example works. Experiment with the $wgSpamRegex setting, and test out some edits on your SandBox page, to see what gets blocked. But beware! Take care to avoid false positives i.e. incorrectly matching legitimate edits. More on this later, but first lets understand what's going on here.

The setting which you assign to $wgSpamRegex, is a regular expression (See wikipedia's 'Regular_expression' article). The above example shows a regexp being built up over several lines, using php's dot syntax to concatenate strings. This makes this long regexp look a bit tidier in some ways, but also a bit more complicated.

Simple Example
Here's a more simple example:

 $wgSpamRegex = "/buy-viagra/"; 

Remember the idea is to decide... Is this spam? yes or no? With this example, any contribution text containing 'buy-viagra' will match as spam. The '/' symbol at the beginning and end, are part of the regular expression syntax.

Block several different words/domains
Lets extend our example to try to match more kinds of spam:

 $wgSpamRegex = "/buy-viagra|adultporn|online-casino|dirare\.com|sexcluborgy\.net/"; 

Using a '|' symbol between words, the above example will block several different spammy words, and also some domain names which are promoted by spammers.

The $wgSpamRegex is applied to all contributed text, including the spam link URLs. As such, blocking domain names can be a very effective way of getting rid of a particular spammer.

AVOID FALSE POSITIVES!
Avoiding false positives is the real challenge here, and it's best illustrated with a bad example:

 $wgSpamRegex = "/cialis/"; 
 * 1) Don't do this!

Lots of spammers like to talk about 'cialis' (some kind of drug. Who cares? not us!) and so you might be tempted to match the word as a spam, but...

...this will prevent users mentioning the word 'specialist'.

Look closely... See how easy it is to make this kind of mistake?

Be careful with your regexp setting. You want to stop spammers, without inconveniencing your users.

Other regexp tricks
Regular expressions are very powerful. $wgSpamRegex matching is applied to all text added by a user while they are editing a page on your wiki, not just URLS. This gives you the power to block anything you don't like, if you can work out a good regexp to match it (be as specific as possible to avoid false positives). In the following section on CSS Hidden Spam we make use of this power.

Spam match message
Normally when the $wgSpamRegex setting matches some spam, the following message is displayed:

''The page you wanted to save was blocked by the spam filter. This is probably caused by a link to an external site.''

The following text is what triggered our spam filter: [word/domain name which was blocked]

You can change this message if you like. This text is on an editable wiki page in the MediaWiki namespace. Simply click 'Special Pages' -> 'All System Messages' and the follow the links for 'Spamprotectionmatch' or 'spamprotectiontext'. If get 'View Source' instead of 'Edit' on the top tab, then you dont have permission to edit. You need to log in as an sysop user (or the WikiSysop user which you configured during installation.

Displaying/Hiding the matched text
If you've made a regex which is too restrictive, or you've made some other mistake in the setting, then you may get false positives. Indeed the full example above might match ligitmate text in some rare circumstances (Maybe your users really do want to talk about buying viagra). By displaying the text which matched, the MediaWiki:Spamprotectionmatch message helps to reduce problems caused by false positives. It allows your users to accurately report problem to you, about your $wgSpamRegex setting. It also allows them to figure out a workaround, so they can continue with their wiki editing.

Unfortunately it's also a very useful bit of information for spammers visiting your site. Some spammers are automated bots, so they won't be seeing this information anyway, however many spammers (beleive it or not) are humans. These humans could go to the trouble of looking at the matching information, and trying to devise a workaround (e.g. just missing out the domain name that you have blocked, but linking to various other domains). It's difficult to know how prevalent this kind of behaviour is, but if you wanted to make life more difficult for them...

...you could hide the spam matching information by simply setting your MediaWiki:Spamprotectionmatch message as empty. You should only do this if you are very aware of the above points about false positives, and have carefully designed your regexp to avoid them.

A message to the spammers
Occasionally spammers have openly discussed their behaviour with the people who fight spam, and the people who are victims of it. From these discussions it's clear that they really believe they are not doing anything wrong. We should tell them otherwise.

Edit your 'MediaWiki:Spamprotectiontext' page, and write a message to the spammers. It's better if it's your own words. If a spammer visits many different wikis and gets many different messages telling them to quit. Who knows? Maybe they'll start to think about what they are doing. It's probably better to keep the language reasonably polite. You are attempting to reason with them after all. Also remember your legitimate users might end up getting this message in the case of false positives.

Example:

''This website is not here to help you promote your site in search engine rankings. Going around wiki sites like this, adding irrelevant messages with links, is called 'wiki spamming'. It is thoroughly anti-social thing to do. Just because something isn't illegal, doesn't mean it isn't wrong.''

In many cases it's a waste of time, but it would be nice if just a few of these people put their talents to better uses.

CSS Hidden Spam
MediaWiki is quite permissive when it comes to HTML tags, and CSS style definitions (see meta Help:HTML_in_wikitext)

This has given spammers the opportunity to invent a sneaky trick to hide their spam from view. It doesn't show up on your pages, but it does show up in your edit boxes, and the changes show up in your 'recent changes' display. As such it causes confusion to your legitimate users, and that's before you consider the effects of helping a spammer by hosting their links. Generally 'CSS Hidden Spam' is all bad. Just because you can't see it (easily), doesn't mean you can ignore it.

The problem was identified by the folks at chongqed.org in 2005, but has got a lot worse in 2006, to the point where it seems most mediawiki spammers are using this trick.

We can use a regular expression to prevent the CSS tricks which they are using. Two of these are incorporated in the full example above (combined using the '|' symbol):

To prevent CSS hidden spam of the form : overflow:\s*auto;\s*height:\s*[0-4]px;

To prevent CSS hidden spam of the form style="display:none;": style\s*=\s*\"\s*display\s*:\s*none";

For a slightly more strict setting you might prefer to disallow various attributes of the style tag altogether: $wgSpamRegex = "/\<.*style.*?(display|position|overflow|visibility|height)\s*:.*? *>/i";

...but you may find this starts to restrict your users more than you would like.

Block ALL external links
You can block all external links by using this regex:  $wgSpamRegex = "/^http:|^\[[^][]*\]$/"; </PRE> Obviously this is restrictive to your real users (can't link to anything anymore). As such, it is a poor solution to the spam problem (although marginally better than a complete lock down)
 * 1) Block ALL external links

If you are going to use this, make sure your 'MediaWiki:Spamprotectiontext' page has an explanation of what you've done.

Other ways of fighting wiki spam
$wgSpamRegex is one of the most effective anti-spam features available in MediaWiki, but there are some other tricks. See Anti-spam Features