Extension:BayesianFilter/GSoC 2013

Introduction
Currently spambots are overwhelming small wikis and becoming more disruptive to large wikis. Often a spammer will completely replace the legitimate content of a page with their spam, and may add many different links, with a range of URLS and keywords. Other nasty spammers (such as spambots) will edit a mass of pages in few minutes, even replacing good links with bad links such as vandalizing article references.

There exists many seperate anti-spam extensions. They mostly scan the text for blacklist links, keywords, extracting links and matching them against blacklist URLs, preventing spambots by using captcha. I intend to create a Bayesian Spam Filter which does a token(word) based Bayesian filtering as it done in e-mails.

Project goals

 * As a simple way to train the a Bayesian DB create a tool for wiki users to report spam.
 * Understanding the metadata(IP, links, user).
 * Building a token based Naive Bayes Classifier. At this stage it will be an offline classifier, classifying data from wiki dump files.
 * Enhancing pam discrimination using CRM114 Discrimator.
 * After this making it an 'online' classifier, integrating it into the workflow of wiki users.
 * For mediawiki sites which have huge traffics like Wikipedia, developing a job scheduler so that the concurrent edits go in a queue.

Implementation
Yet to add