Extension:SpamBlacklist

From mediawiki.org
This page is a translated version of the page Extension:SpamBlacklist and the translation is 26% complete.
この拡張機能は MediaWiki 1.21 以降に同梱されています。 そのため再度ダウンロードする必要はありません。 しかし、提供されているその他の手順に従う必要はあります。
MediaWiki 拡張機能マニュアル
SpamBlacklist
リリースの状態: 安定
実装 ページの操作
説明 正規表現ベースのスパムフィルターを提供する
作者 Tim Starlingトーク
最新バージョン 継続的な更新
互換性の方針 MediaWiki とともにリリースされるスナップショット。 master には後方互換性がありません。
MediaWiki 1.31+
ライセンス GNU 一般公衆利用許諾書 2.0 以降
ダウンロード
README
  • $wgBlacklistSettings
  • $wgLogSpamBlacklistHits
  • sboverride
  • spamblacklistlog
Quarterly downloads 51 (Ranked 100th)
Public wikis using 4,601 (Ranked 180th)
translatewiki.net で翻訳を利用できる場合は、SpamBlacklist 拡張機能の翻訳にご協力ください
問題点 未解決のタスク · バグを報告
この拡張機能の改名提案は、タスク T254649で議論されています。

「SpamBlacklist」拡張機能は、指定されたファイルまたはウィキ ページで定義された正規表現パターンと一致するドメインを持つ URLを含む編集と、指定されたメール アドレスを使用する利用者による登録を防ぎます。

誰かがページを保存しようとすると、この拡張機能は、不正なホスト名の (潜在的にとても巨大な) リストに対してテキストをチェックします。 マッチするものがある場合、拡張機能は利用者に対してエラーメッセージを表示してページの保存を拒否します。

インストールとセットアップ

インストール

  • ダウンロードして、ファイルをextensions/フォルダー内のSpamBlacklistという名前のディレクトリ内に配置します。
    開発者とコード寄稿者は、上記の代わりに以下を使用してGitからインストールします:cd extensions/
    git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/SpamBlacklist
  • 以下のコードを LocalSettings.php ファイルの末尾に追加します:
    wfLoadExtension( 'SpamBlacklist' );
    
  • お好みで設定を変更してください
  • Yes 完了 – ウィキの「Special:Version」に移動して、拡張機能が正しくインストールされたことを確認します。

ブラックリストの設定

追加のソースがリストされていても、次のローカル ページが常に使用されます:

URLのブロックリストのデフォルトの追加ソースは、m:Spam blacklistにある Meta-Wiki のウィキメディアスパムブロックリストです。 デフォルトでは、拡張機能はこのリストを使用し、10~15 分ごとに再読み込みします。 多くのウィキでは、このリストを使用するだけでほとんどのスパム行為をブロックできます。 ただし、ウィキメディアのブロック リストは、数十万の外部リンクを持つ大規模なウィキの多様なグループによって使用されているため、ブロックするリンクは比較的保守的です。

ウィキメディアのスパム ブロック リストは、管理者のみが編集できます。ただし、m:Talk:Spam blacklist でブロック リストの変更を提案できます。

自分のウィキに他の悪いURL を追加することができます。LocalSettings.php のグローバル変数$wgBlacklistSettingsにそれらをリストします。以下の例を参照してください。

$wgBlacklistSettingsは2レベルの配列です。最上位のキーはspamまたはemailです。 URL、ファイル名、またはデータベースの場所のいずれかを含む各値の配列を取ります。

LocalSettings.phpで $wgBlacklistSettingsを使用すると、[[m:Spam blacklist]]のデフォルト値は使用されなくなります。そのブロック リストにアクセスするには、手動で追加する必要があります。以下の例を参照してください。

Specifying a database location allows you to draw the block list from a page on your wiki.

The format of the database location specifier is ">DB: [db name] [title]". [db name] should exactly match the value of $wgDBname in LocalSettings.php .

You should create the required page name [title] in the default namespace of your wiki. If you do this, it is strongly recommended that you protect the page from general editing. Besides the obvious danger that someone may add a regex that matches everything, please note that an attacker with the ability to input arbitrary regular expressions may be able to generate segfaults in the PCRE library.

If you want to, for instance, use the English-language Wikipedia's spam block list in addition to the standard Meta-Wiki one, you could call the following in LocalSettings.php , AFTER wfLoadExtension( 'SpamBlacklist' ); call:

$wgBlacklistSettings = [
	'spam' => [
		'files' => [
			"https://meta.wikimedia.org/w/index.php?title=Spam_blacklist&action=raw&sb_ver=1",
			"https://en.wikipedia.org/w/index.php?title=MediaWiki:Spam-blacklist&action=raw&sb_ver=1"
		],
	],
];

Here is an example of an entirely local set of block lists: the administrator is using the update script to generate a local file called "wikimedia_blacklist" that holds a copy of the Meta-Wiki blacklist, and has an additional block list on the wiki page "My spam block list":

$wgBlacklistSettings = [
	'spam' => [
		'files' => [
			"$IP/extensions/SpamBlacklist/wikimedia_blacklist", // Wikimedia's list
			// database, title
			'DB: wikidb My_spam_block_list',    
		],
	],
];

Logging

By default, the extension does not log hits to a spam blacklist log. To enable logging set $wgLogSpamBlacklistHits = true;. You can use the spamblacklist user right to control access to the logs. Every signed-in user can view the logs by default.


問題点

Backtrack Limit

If you encounter issues with the block list, you may want to increase the backtrack limit.

However on the other hand, this can reduce your security against DOS [1] attacks, as the backtrack limit is a performance limit:

// Bump the Perl Compatible Regular Expressions backtrack memory limit
// (PHP 5.3.x default, 1000K, is too low for SpamBlacklist)
ini_set( 'pcre.backtrack_limit', '8M' );

Hardened Wikis

The SpamBlacklist will not allow editing if the wiki is hardened. Hardening includes limiting open_basedir so that curl is not on-path, and setting allow_url_fopen=Off in php.ini.

In the hardened case, SpamBlacklist will cause an exception when Guzzle attempts to make a network request. The Guzzle exception message is, GuzzleHttp requires cURL, the allow_url_fopen ini setting, or a custom HTTP handler.

Safe list

A corresponding safe list can be maintained by editing the MediaWiki:Spam-whitelist page. This is useful if you would like to override certain entries from another wiki's block list that you are using. Wikimedia wikis, for instance, sometimes use the spam block list for purposes other than combating spam.

It is questionable how effective the Wikimedia spam block lists are at keeping spam off of third-party wikis. Some spam might be targeted only at Wikimedia wikis, or only at third-party wikis, which would make Wikimedia's blacklist of little help to said third-party wikis in those cases. Also, some third-party wikis might prefer that users be allowed to cite sources that Wikipedia does not allow. Sometimes what one wiki considers useless spam, another wiki might consider useful.

Users may not always realize that, when a link is rejected as spammy, it does not necessarily mean that the individual wiki they are editing has specifically chosen to ban that URL. Therefore, wiki system administrators may want to edit the Manual:System messages at MediaWiki:Spamprotectiontext and/or MediaWiki:Spamprotectionmatch on your wiki to invite users to make suggestions at MediaWiki talk:Spam-whitelist for pages that should be added by a Manual:管理者 to the safe list. For example, you could put, for MediaWiki:Spamprotectiontext:

The text you wanted to save was blocked by the spam filter. This is probably caused by a link to a blacklisted external site. {{SITENAME}} maintains [[MediaWiki:Spam-blacklist|its own block list]]; however, most blocking is done by means of [[metawikimedia:Spam-blacklist|Meta-Wiki's block list]], so this block should not necessarily be construed as an indication that {{SITENAME}} made a decision to block this particular text (or URL). If you would like this text (or URL) to be added to [[MediaWiki:Spam-whitelist|the local spam safe list]], so that {{SITENAME}} users will not be blocked from adding it to pages, please make a request at [[MediaWiki talk:Spam-whitelist]]. A [[Project:Sysops|sysop]] will then respond on that page with a decision as to whether it should be listed as safe.

注記

  • この拡張機能はウィキエディタが新規に導入したリンクのみ検証します。 To check user agents, add Akismet As the various tools for combating spam on MediaWiki use different methods to spot abuse, the safeguards are best used in combination.
  • Users with the sboverride can override the block list and add blocked links to pages. By default this right is only given to bots.

使用法

Syntax

If you would like to create a block list of your own, or modify an existing one, here is the syntax:

Everything on a line after a '#' character is ignored (for comments). All other strings are regex fragments which will only match inside URLs.

注記
  • Do not add "http://"; this would fail, since the regex will match after "http://" (or "https://") inside URLs.
  • Furthermore "www" is unneeded, since the regex will match any subdomains.

By giving "www\." explicitly one can match specific subdomains.

  • The (?<=//|\.) and $ anchors match the beginning and end of the domain name, not the beginning and end of the URL.

The regular anchor ^ won't be of any use.

  • Slashes don't need to be escaped by backslashes, this will be done automatically by the script.

The following line will block all URLs that contain the string "example.com", except where it is immediately preceded or followed by a letter or a number.

\bexample\.com\b

These are blocked:

  • http://www.example.com
  • http://www.this-example.com
  • http://www.google.de/search?q=example.com

These are not blocked:

  • http://www.goodexample.com
  • http://www.google.de/search?q=example.commodity

パフォーマンス

The extension creates a single regex statement which looks like /https?:\/\/[a-z0-9\-.]*(line 1|line 2|line 3|....)/Si (where all slashes within the lines are escaped automatically). It saves this in a small "loader" file to avoid loading all the code on every page view. Page view performance will not be affected even if you're not using a bytecode cache although using a cache is strongly recommended for any MediaWiki installation.

The regex match itself generally adds an insignificant overhead to page saves (on the order of 100ms in our experience). However, loading the spam file from disk or the database, and constructing the regex, may take a significant amount of time depending on your hardware. If you find that enabling this extension slows down saves excessively, try installing a supported bytecode cache. This extension will cache the constructed regex if such a system is present.

If you're sharing a server and cache with several wikis, you may improve your cache performance by modifying getSharedBlacklists and clearCache in SpamBlacklist_body.php to use $wgSharedUploadDBname (or a specific DB if you do not have a shared upload DB) rather than $wgDBname . Be sure to get all references! The regexes from the separate MediaWiki:Spam-blacklist and MediaWiki:Spam-whitelist pages on each wiki will still be applied.

外部のブロック リスト サーバー (RBL's)

In its standard form, this extension requires that the block list be constructed manually. While regular expression wildcards are permitted, and a block list originated on one wiki may be re-used by many others, there is still some effort required to add new patterns in response to spam or remove patterns which generate false-positives.

Much of this effort may be reduced by supplementing the spam regex with lists of known domains advertised in spam email. The regex will catch common patterns (like "casino-" or "-viagra") while the external block list server will automatically update with names of specific sites being promoted through spam.

In the filter() function in includes/SpamBlacklist.php, approximately halfway between the file start and end, are the lines:

       # Do the match
       wfDebugLog( 'SpamBlacklist', "Checking text against " . count( $blacklists ) .
           " regexes: " . implode( ', ', $blacklists ) . "\n" );

Directly above this section (which does the actual regex test on the extracted links), one could add additional code to check the external RBL servers [2]:

        # Do RBL checks
        $retVal = false;
        $wgAreBelongToUs = ['l1.apews.org.', 'multi.surbl.org.', 'multi.uribl.com.'];
        foreach( $addedLinks as $link ) {
              $link_url=parse_url($link);
              $link_url=$link_url['host'];
              if ($link_url) {
                   foreach( $wgAreBelongToUs as $base ) {
                        $host = "$link_url.$base";
                        $ipList = gethostbynamel( $host );
                        if( $ipList ) {
                           wfDebug( "RBL match: Hostname $host is {$ipList[0]}, it's spam says $base!\n" );
                           $ip = wfGetIP();
                           wfDebugLog( 'SpamBlacklistHit', "$ip caught submitting spam: {$link_url} per RBL {$base}\n" );
                           $retVal = $link_url . ' (blacklisted by ' . $base .')';
                           wfProfileOut( $fname );
                           return $retVal;
                        }
                   }
              }
        }

        # if no match found on RBL server, continue normally with regex tests...

This ensures that, if an edit contains URLs from already blocked spam domains, an error is returned to the user indicating which link cannot be saved due to its appearance on an external spam block list. If nothing is found, the remaining regex tests are allowed to run normally, so that any manually-specified 'suspicious pattern' in the URL may be identified and blocked.

Note that the RBL servers list just the base domain names - not the full URL path - so http://example.com/casino-viagra-lottery.html will trigger RBL only if "example.com" itself were blocked by name by the external server. The regex, however, would be able to block on any of the text in the URL and path, from "example" to "lottery" and everything in between. Both approaches carry some risk of false-positives - the regex because of the use of wildcard expressions, and the external RBL as these servers are often created for other purposes - such as control of abusive spam email - and may include domains which are not engaged in forum, wiki, blog or guestbook comment spam per se.

Other spam-fighting tools

There are various helpful manuals on mediawiki.org on combating spam and other vandalism:

Other anti-spam, anti-vandalism extensions include:

References