Extension:BayesianFilter/GSoC 2013

Identity:
Name: Anubhav Agarwal

Email: anubhav914@gmail.com

Project Title: Bayesian Spam Filter

Contact/working info
Timezone: UTC + 5:30

Typical working hours: 10:PM - 4:00 AM

IRC or IM networks/handle(s): anubhav

Project summary
Wikis are a common target for spammers wishing to promote products or web sites due to their open editing nature. Often a spammer will completely replace the legitimate content of a page with their spam, and may add many different links, with a range of URLS and keywords. Other nasty spammers (such as spambots) will edit a mass of pages in few minutes, even replacing good links with bad links such as vandalizing article references.

Mediawiki already provides several extensions for combating spam. I had a look at current Spam Management Extensions. They mostly scan the text for blacklist links, keywords, extracting links and matching them against blacklist URLs, preventing spambots by using captcha. I intend to create a Bayesian Spam Filter extension for combating wiki spam under GSoC 2013 program for MediaWiki. It will be based on token(word) filtering using Bayesian techniques. I think it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.

For more information refer to this discussion on the Wikitech-l mailing list

Model


In words, the MediaWiki extension makes a curl call to a python daemon process. The job goes in a Edit Queue. The Edit Queue will be implemented using Advanced Python Scheduler. There will be some(for example 10) pre-forked filter threads running will serve as consumers for the edit queue. Each filter thread calls the main classifying function, and returns response to the MediaWiki extension. MediaWiki extension will be hooked with abuse filter extension to display the logging of spam edits. Like the logs of recent changes. The response handler of the extension also displays a warning to the user and reverts the edit to the previous version.

Drawbacks in Current Spam Extensions
As previously stated MediaWiki uses various anti-spam extension but they suffer form serious drawbacks.

A Bayesian Filter would overcome all these drawbacks.
 * A lot of extensions like Asirra, KeyCaptcha, QuestyCaptcha involves captha techniques. Captcha has been broken for some time. Not only does this creates a hassle for a genuine user, but also let slips many spam bots and faulty users.
 * Another technique used by extension is blacklist of words and IP's. This is dangerous in sense that many words like sex though listed in blacklists do not necessarily mean spam, when they appear in other words like Sensex, or sexism. A Bayesian classifier only gives a high probability for that word, not for the whole text. A Bayesian filter also learn through alternative way of using this word like s-e-x, etc.
 * Blacklisting IP is problem too because many times a lot of computers correspond to the same IP when they are behind a proxy. Also, spambot extension usually work around this check by changing their IP's or using tor for such services.
 * SpamBlacklist is very useful extension that blocks all the edits that contain bad links. However the administrators need to generate a blacklist and constantly maintain. There is no learning mechanism involved.

Deliverables

 * Automated training of Bayesian Filter Database from roll back and previous wiki data.
 * A traditional token based Bayes offline classifier.
 * Integrating the extension with Abusefilter to provide logging and tagging features
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * Integrating the extension with Abusefilter to provide logging and tagging features
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.

Project Schedule
Until June 17 Community Bonding Period

Setting up my environment. Understanding the source code, MediaWiki architecture. Getting to know mentors, fixing bugs

June 17 - June 24 Gather Corpus of text as spam or ham. This includes:
 * Gathering reverts from STiki. STiki labels documents as vandalism and innocent, so it will be easy to gather the classified data
 * Adding a hook in rollback API, so that a function registerSpam is called whenever that rollback is executed.
 * Gathering Data from Mediawiki deleted pages.

June 24th - June 27th

Preparing a basic skeleton for the Extension.
 * Register a hook in Mediawiki codebase to call the extension whenever edit is saved.
 * Calling the filter bot with appropriate arguments
 * Add a response handler that will give report on page that there is spam

June 27th - July 4th Implementing a Parser Class.
 * Learn NLTK for implementing many parsing functions.
 * Read text from file, tokenize it where separators would be whitespace, commas and periods.
 * Learn Porter Stemming. Implement porter stemming on words after stripping punctuation marks.
 * Calculate the following attributes
 * Total no of characters (C)
 * Ratio of Alpha characters
 * Ratio of digits
 * Ratio of Whitespace Charachters
 * Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ, \,/ )
 * Total no of words (M)
 * list of links
 * Total number of links/Total number of words.
 * Total no of short words/M Two letters or less
 * Ratio of Capital Letters

July 4th - July 11th Adding two tables in the database.
 * First will contain the link to the text file, the attributes mentioned above, the class label(spam or ham).
 * Second will contain unique words and links, there count in spams, and their count in hams.
 * As links are the main reason a post is spam, and most spammers post the links of their websites of MediaWiki, links will be considered as spam or ham with more probability.
 * Integrating the database with the workflow.

July 11th - July 22nd Implement the Filter functionality.
 * Divide the Test Data into 80% training data and 20% cross validating.
 * Implement the DetectSpam functionality which takes into account the attributes calculated above
 * Train the Database using the training data.
 * Check the Efficiency of the spam classifier by testing it on the cross-validation data.
 * Choosing a Laplacian Smoothing parameter by re-iterating the above steps, till it converges to a maximum point.

July 22nd - July 29th Code Review, extra period for changes suggested and unseen delays.

July 29th- August 2nd Mid-term Evaluation Period

August 3rd - August 10th

Writing Unit Tests. Testing the whole work flow. Fixing bugs

August 10th - August 17th Submit the DB patches. Code Review by mentors. Roll out the feature for some wiki users. Fixing bugs.

August 17th- August 24th
 * Integrating the Spam Filter Extension with Abuse Filter.
 * Create Log pages of spam reports
 * Learn about Query Optimization. Indexing the tables, and optimizing query to improve the speed.

August 24th - September 3rd Implement a Job Queue.
 * Implementing a multi server single queue model. All the concurrent edits go in the queue.
 * A fixed number of processing threads pull out the edits, call the DetectSpam utility.
 * Implement Locking techniques.

September 3rd to September 10th One week backup for suggested changes, fixing bugs and unseen delays.

September 10th to September 17th

Writing unit tests. Testing the entire workflow.

September 17th - Soft Pencil Down date

September 17th - September 23rd Improve documentation.

September 23rd - Hard Pencils Down.

Motivations
Wiki pages are a great tool for knowledge management and improving documentation. Organizations and users of Wikimedia are much greater than any other organization. This mean a lot of people using it, their ranks a lot higher in google search algorithm than other sites, and hence a lot of spammers/spam bots hammering the sites. This makes a anti-spam extension on mediawiki a most important functionality. My main motive for choosing Mediawiki is because it is the most widely used application, almost every organization, perhaps the very same where I might be working in future. There is no greater joy than seeing your code running smoothly and helping a million users.

I chose Spam Filter cause, I am keenly interested in machine learning and data mining. I have a theoretical knowledge of these which I now want to put to test. I feel Summer of Code is very first good step of contributing in open source community. I figure this project is a way to fulfill all my aims.

Open source experience
This is my first time I will be contributing in an open source community. I have a fixed a small bug in SpamBlacklist extension. In a week I wish to fix:

Bug about Log SpamBlack list hits to learn about how to create special pages and integrate the database with the code.

A bug in abusefilter extension, particularly this one. To know about how abusefilter works, and how can I incorporate the metadata in vandalism detection

Participation
I keep a steady pace, and check my work as I go along. I plan to keep my mentors updated about the progress via IRC. I will create a Github repository for the project, where they can see the code as it goes along. After the end of each milestone as listed in deliverables I will be pushing it to Gerrit for review. I consider IRC, wikitech-l mailing list and stackoverflow to be the best resources for answering my queries. I am very dedicated in whatever I do. I would like to continue on this project even after GSOC, and give effort to push the code to the stable version as soon as possible. I plan to actively maintain my code and do bug-fixing even after my GSOC time period is over

About Me
I am a 4th year student, doing B.Tech in Computer Science from IIT, Roorkee. I am proficient in web development, Php and Python. My previous works include

Questionnaire, an tool on the model of Surveymonkey for creating surveys. Here is the UI of it

A facebook app which was based on the idea of Orkut Crushlist. You can checkout the code here.

I made a social CRM tool for a 500 startup funded startup, Mygola where I worked as an intern.

I am very dedicated in whatever I do. I am keenly interested in machine learning and data mining, that is why chose this project. I would like to continue on this project even after GSOC, and give effort to push the code to the stable version as soon as possible. I plan to actively maintain my code and do bug-fixing even after my GSOC time period is over.

Proof of Concept
Bayesian Spam Filtering works on the basis of Bayes Theorem. Let's suppose the suspected text contains the word "discount". This message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts, all it can do is compute probabilities.

The formula used by the software to determine that is derived from Bayes' theorem


 * $$\Pr(S|W) = \frac{\Pr(W|S) \cdot \Pr(S)}{\Pr(W|S) \cdot \Pr(S) + \Pr(W|H) \cdot \Pr(H)}$$

where:


 * $$\Pr(S|W)$$ is the probability that a message is a spam, knowing that the word "discount" is in it;
 * $$\Pr(S)$$ is the overall probability that any given message is spam;
 * $$\Pr(W|S)$$ is the probability that the word "discount" appears in spam messages;
 * $$\Pr(H)$$ is the overall probability that any given message is not spam (is "ham");
 * $$\Pr(W|H)$$ is the probability that the word "discount" appears in ham messages.

The Bayesian spam filtering software makes the "naive" assumption that the words present in the message are independent events. That is wrong in natural languages like English, where the probability of finding an adjective, for example, is affected by the probability of having a noun. With that assumption, one can derive another formula from Bayes' theorem:


 * $$p = \frac{p_s p_1 p_2 \cdots p_N}{p_1 p_2 \cdots p_N + (1 - p_1)(1 - p_2) \cdots (1 - p_N)}$$

where:
 * $$p$$ is the probability that the suspect message is spam;
 * $$p$$ is overall probability any message is spam.
 * $$p_1$$ is the probability $$p(S|W_1)$$ that it is a spam knowing it contains a first word (for example "replica");
 * $$p_2$$ is the probability $$p(S|W_2)$$ that it is a spam knowing it contains a second word (for example "watches");
 * etc...
 * $$p_N$$ is the probability $$p(S|W_N)$$ that it is a spam knowing it contains an Nth word (for example "home").

The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.