Extension:SpamBlacklist

From MediaWiki.org

Jump to: navigation, search

         

Manual on MediaWiki Extensions
List of MediaWiki Extensions
Crystal Clear action run.png
SpamBlacklist

Release status: experimental

SpamBlacklist.gif
Implementation  Page action
Description Regex-based spam filter.
Author(s)  Tim Starling
MediaWiki  1.6.0+
License Any OSI approved license
Download Download snapshot

Subversion [Help]
Browse source code

readme

check usage (experimental)

The SpamBlacklist extension prevents edits that contain URL hosts that match regular expression patterns defined in specified files or wiki pages. When someone tries to save the page, it checks the text against a potentially very large list of illegal host names. If there is a match, it displays an error message to the user and refuses to save the page.

Contents

[edit] Installation

The extension might work with MediaWiki version 1.6.0 or greater.

[edit] Basic installation

  1. Save the SpamBlacklist files to a subdirectory called SpamBlacklist in your extensions directory. You should have at least the following three files in that SpamBlacklist directory.
    • SpamBlacklist/SpamBlacklist.php
    • SpamBlacklist/SpamBlacklist_body.php
    • SpamBlacklist/SpamBlacklist.i18n.php
  2. Add the following line to LocalSettings.php in your MediaWiki root directory:
require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );

The list of bad URLs can be drawn from multiple sources. These sources are configured with the $wgSpamBlacklistFiles global variable. This global variable can be set in LocalSettings.php, AFTER including SpamBlacklist.php.

$wgSpamBlacklistFiles is an array, each value containing either a URL, a filename or a database location. Specifying a database location allows you to draw the blacklist from a page on your wiki. The format of the database location specifier is "DB: <db name> <title>".

The local pages MediaWiki:Spam-blacklist and MediaWiki:Spam-whitelist will always be used, whatever additional files are listed.

After defining $wgSpamBlacklistFiles and requiring the extension in LocalSettings.php, the filter should be active.

[edit] Custom blacklist sources

The primary source for a MediaWiki-compatible blacklist file is the Wikimedia spam blacklist on Meta-Wiki, at http://meta.wikimedia.org/wiki/Spam_blacklist. The default configuration loads this list once every 10-15 minutes. However, the Wikimedia spam blacklist can only be edited by trusted administrators. Since the list is used by large, diverse wikis with many thousands of external links, the Wikimedia blacklist is comparatively conservative in the links it blocks. You can suggest modifications to the blacklist at http://meta.wikimedia.org/wiki/Talk:Spam_blacklist.

If you'd like to draw the list of bad host names from multiple or different sources, add the $wgSpamBlacklistFiles array after the line including the extension. Note that once you define $wgSpamBlacklistFiles, the default behaviour (checking the Meta-Wiki blacklist) no longer takes place. $wgSpamBlacklistFiles is an array, each value containing either a URL, a filename, or a database location. Specifying a database location allows you to draw the blacklist from a page on your wiki.

The format of the database location specifier is "DB: [db name] [title]". [db name] should exactly match the value of $wgDBname in LocalSettings.php. You should create the required page name [title] in the default namespace of your wiki. If you do this, it is strongly recommended that you protect the page from general editing. Besides the obvious danger that someone may add a regex that matches everything, please note that an attacker with the ability to input arbitrary regular expressions may be able to generate segfaults in the PCRE library.

For example:

require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );
$wgSpamBlacklistFiles = array(
   "$IP/extensions/SpamBlacklist/wikimedia_blacklist", // Wikimedia's list
   //  database      title
   "DB: wikidb My_spam_blacklist",    
);

In the above example, the spam blacklist will be constructed from two sources: a file called wikimedia_blacklist in the SpamBlacklist directory of the Wiki installation, and the contents of a page on the wiki called My_spam_blacklist. If you are not hosting the Wikimedia blacklist locally, you'll need to change that line to something like:

"http://meta.wikimedia.org/w/index.php?title=Spam_blacklist&action=raw&sb_ver=1", // Wikimedia's list

[edit] Whitelist

A corresponding whitelist can be maintained by editing the MediaWiki:Spam-whitelist message. This is useful if you would like to override select entries from another wiki's blacklist that you are using.

[edit] Notes

  • The extension and this documentation was written by Tim Starling and is ambiguously licensed.
  • This extension examines only new external links being added by wiki editors. To check user agents add Bad Behaviour or Akismet, to check an editor's IP address against lists of known spambots, supplement this with Check Spambots. As the various tools for combating spam on MediaWiki use different methods to spot abuse, the safeguards are best used in combination.
  • The Extension:SpamBlacklist/update script is a cron script that can automate updates from shared blacklists. If you are using memcached, you will also have to delete the spam_blacklist_regexes key (for example, using maintenance/mcc.php).
  • If you're sharing a server and cache with several wikis, you may improve your cache performance by modifying getSharedBlacklists and clearCache in SpamBlacklist_body.php to use $wgSharedUploadDBname (or a specific DB if you do not have a shared upload DB) rather than $wgDBname. Be sure to get all references! The regexes from the separate MediaWiki:Spam-blacklist and MediaWiki:Spam-whitelist pages on each wiki will still be applied.

[edit] Usage

[edit] Syntax

Everything on a line after a '#' character is ignored (for comments). All other strings are regex fragments which will only match inside URLs.

Notes
  • Do not add "http://"; this would fail, since the regex will match after "http://" (or "https://") inside URLs.
  • Furthermore "www" is unneeded, since the regex will match any subdomains. By giving "www\." explicitly one can match specific subdomains.
  • The '^' and '$' anchors match the beginning and end of the page, not the beginning and end of the URL.
  • Slashes don't need to be escaped by Backslashes. This will be done automatically by the script.
Example

If you put this site in a new line as follows:

\bexample\.com

It will block all urls which contain the string example.com unless there is a letter [a-z] before example.com, so e.g. http://www.example.com, http://www.this-example.com, http://www.google.de/search?q=example.com will be blocked. http://www.thisexample.com will not be blocked by this line.

[edit] Performance

The extension creates a single regex statement which looks like /https?:\/\/[a-z0-9\-.]*(line 1|line 2|line 3|....)/Si (where all slashes within the lines are escaped automatically). It saves this in a small "loader" file to avoid loading all the code on every page view. Page view performance will not be affected even if you're not using a bytecode cache like MMCache, although using a cache is strongly recommended for any MediaWiki installation.

The regex match itself generally adds an insignificant overhead to page saves (on the order of 100ms in our experience). However, loading the spam file from disk or the database, and constructing the regex, may take a significant amount of time depending on your hardware. If you find that enabling this extension slows down saves excessively, try installing a supported bytecode cache. The SpamBlacklist extension will cache the constructed regex if such a system is present.

[edit] External blacklist servers (RBL's)

In its standard form, this extension requires that the blacklist be constructed manually. While regular expression wildcards are permitted, and a blacklist originated on one wiki may be re-used by many others, there is still some effort required to add new patterns in response to spam or remove patterns which generate false-positives.

Much of this effort may be reduced by supplementing the spam regex with lists of known domains advertised in spam e-mail. The regex will catch common patterns (like "casino-" or "-viagra") while the external blacklist server will automatically update with names of specific sites being promoted through spam.

In the filter() function in SpamBlacklist_body.php, approximately halfway between the file start and end, are the lines:

       # Do the match
       wfDebugLog( 'SpamBlacklist', "Checking text against " . count( $blacklists ) .
           " regexes: " . implode( ', ', $blacklists ) . "\n" );

Directly above this section (which does the actual regex test on the extracted links), one could add additional code to check the external RBL servers:

        # Do RBL checks
        $retVal = false;
        $wgAreBelongToUs = array('l1.apews.org.', 'block.rhs.mailpolice.com.', 'multi.surbl.org.', 'multi.uribl.com.');
        foreach( $addedLinks as $link ) {
              $link_url=parse_url($link);
              $link_url=$link_url['host'];
              if ($link_url) {
                   foreach( $wgAreBelongToUs as $base ) {
                        $host = "$link_url.$base";
                        $ipList = gethostbynamel( $host );
                        if( $ipList ) {
                           wfDebug( "RBL match: Hostname $host is {$ipList[0]}, it's spam says $base!\n" );
                           $ip = wfGetIP();
                           wfDebugLog( 'SpamBlacklistHit', "$ip caught submitting spam: {$link_url} per RBL {$base}\n" );
                           $retVal = $link_url . ' (blacklisted by ' . $base .')';
                           wfProfileOut( $fname );
                           return $retVal;
                        }
                   }
              }
        }
 
        # if no match found on RBL server, continue normally with regex tests...

This ensures that, if an edit contains URL's from already-blacklisted spam domains, an error is returned to the user indicating which link cannot be saved due to its appearance on an external spam blacklist. If nothing is found, the remaining regex tests are allowed to run normally, so that any manually-specified 'suspicious pattern' in the URL may be identified and blocked.

Note that the RBL servers list just the base domain names - not the full URL path - so http://example.com/casino-viagra-lottery.html will trigger RBL only if "example.com" itself were blacklisted by name by the external server. The regex, however, would be able to block on any of the text in the URL and path, from "example" to "lottery" and everything in between. Both approaches carry some risk of false-positives (the regex because of the use of wildcard expressions, the external RBL as these servers are often created for other purposes - such as control of abusive spam e-mail - and may include domains which are not engaged in forum, wiki, blog or guestbook comment spam per se)

[edit] Stability

This extension has not been widely tested outside the Wikimedia Foundation. Although it has been in use on Wikimedia websites since December 2004, it should be considered experimental. Its design is simple, with little input validation, so unexpected behaviour due to incorrect regular expression input or non-standard configuration is entirely possible.

[edit] See also

This extension is being used on one or more of Wikimedia's wikis. It means that the extension is stable and works well enough to be used by such high traffic websites. A full list of the extensions installed on a particular wiki is produced by Special:Version on that wiki.