Admin tools development/Phalanx

From mediawiki.org

Thoughts and lessons learned from the Phalanx extension[edit]

Phalanx is an integrated anti-spam extension originally written for and by Wikia, but nowadays also used by ShoutWiki. It integrates a bunch of anti-spam extensions — BadWords, FilterWords, regexBlock, SpamBlacklist, spamRegex, TextRegex and TitleBlacklist — into one easy-to-use extension.

Pros:

  • easy to use
  • pretty effective
  • plenty of different filters
  • adding a new filter is relatively easy
  • ability to block something on a per-language basis (not sure how stable this is, we at ShoutWiki usually block everything for "all languages")

Cons:

  • unlike with AbuseFilter rules, not everyone can view Phalanx logs and rules (then again, what would be the point of an anti-spam extension if spammers could just easily see what's blocked?)
  • IP blocking interface is a bit unstable, thanks to the recent rewrite (in MediaWiki 1.18) of the Block class and the related interfaces

If we were to use Phalanx on Wikimedia sites...:

  • it would need to be updated (the ShoutWiki fork is at r25850, while SVN HEAD on Wikia's SVN right now is r57491)
  • the hacks specific to a certain wiki farm setup would need to be removed and replaced with something generic/more flexible
  • we'd need to set up a git repo for it and I would need to learn to use git ;)
  • shared DB stuff; Phalanx would need a new global or two (like how AbuseFilter or CentralAuth do things) to define the database where Phalanx DB tables will be stored done
    • this is assuming that we use it as a global solution. While it definitely makes sense, there are some important questions, too:
      1. who would be allowed to access Phalanx and change the rules? Stewards, probably, but I imagine that there'd be some complains about turning the stewards into decision-makers instead of neutral observers...
      2. would this create complicated bureaucracy about the management? I.e. something spammed on enwiki is a legitimate phrase on plwiki and Polish editors are upset that legitimate edits are being blocked.
      • solution: new user group (like how we have global editinterface, rollback, sysop, etc. groups)
  • for now (current codebase as of 20:59, 10 August 2012 (UTC)), there is the option to block by language, but I'm not sure how effective it'd be. For cases like this, having an option to "block this [phrase/e-mail address/user(name)/etc.] for all languages except pl" would be useful (HT Isarra). --Jack Phoenix (Contact) 20:59, 10 August 2012 (UTC)[reply]
    • elaborating a bit on the effectiveness: the Answers-specific QuestionTitleBlock module globals $wgLanguageCode and then calls $blocksData = Phalanx::getFromFilter( Phalanx::TYPE_ANSWERS_QUESTION_TITLE, $wgLanguageCode ); — if the second parameter is not passed to getFromFilter(), for which the default is null, Phalanx will treat it as 'all' and will get blocks effecting all languages...given that the language selection menu is somewhat of a hack for Answer sites, I suspect that further work will be needed if we are to make that option useful. --Jack Phoenix (Contact) 09:45, 24 August 2012 (UTC)[reply]
  • moving BadImageList and whatnot to Phalanx should be entirely possible; it's a different, more political, question if that's wanted (on the other hand, transparency is important, but in anti-spam/anti-vandalism work, transparency can easily be used against you...)
    • well, BadImageList itself is a bit of a special case because it allows whitelisting on the same page. We could, however, move the actual image blacklist into Phalanx and create a new whitelist MediaWiki: page... --Jack Phoenix (Contact) 23:50, 9 August 2012 (UTC)[reply]
  • logging stuff should be converted to new-style calls (see Logging to Special:Log#1.19 and later) and a migration script should be written to convert pre-existing old-style log entries (which ShoutWiki and Wikia will have)
  • maintenance scripts etc. interacting with memcached should be investigated further to make sure they're doing everything correctly
    • IIRC there is some weird manual key building in the code
    • code should use wfForeignMemcKey( $wgPhalanxDatabase, $wgPhalanxDatabasePrefix, ... ); when appropriate
  • expiry dropdown menu sucks, make it not suck
    • the list of possible expiries is hard-coded in the code (Phalanx::getExpireValues()), so it's not possible to add new options by editing the associated MediaWiki message (MediaWiki:Phalanx-expire-durations)
    • it is also not possible to block something for a custom duration (say, 7 hours for example)
    • should lift whatever we can from core SpecialBlock.php, because that form rocks
  • blocking of global accounts should detect CentralAuth's presence and adapt to it accordingly
    • this may (will) need changes to CentralAuth itself
    • again, we will want to keep Phalanx useful for all parties (WMF, ShoutWiki, Wikia, and smaller third parties); WMF has CentralAuth set up, most other sites probably don't; global blocking with CA is different than without CA (when CA isn't installed, sites typically use $wgSharedDB & $wgSharedTables and share the user table, but on WMF, we can't rely on that because the setup is different)
  • Information such as what has been blocked by who must be publicly or semi-publicly available if it is to be used on Wikimedia.

Lessons learned:

  • Phalanx is handy!
  • it's easy to use and regexes are surprisingly easy to learn (the basic stuff anyway)
  • but like everything else, it's not perfect; some spambots will always slip through, so it can't totally eliminate the human factor in anti-spam work
  • the statistics interface (Special:PhalanxStats/Filter-id-goes-here) allows seeing who triggered the filter when (and where) and thus a human can make the decision to block a spambot account even before it has successfully submitted any spam to any wiki