Extension:BayesianFilter/GSoC 2013

Identity:
Name: Anubhav Agarwal

Email: anubhav914@gmail.com

Project Title: Bayesian Spam Filter

Contact/working info
Timezone: UTC + 5:30

Typical working hours: 10:PM - 4:00 AM

IRC or IM networks/handle(s): anubhav

Project summary
Wikis are a common target for spammers wishing to promote products or web sites due to their open editing nature. Often a spammer will completely replace the legitimate content of a page with their spam, and may add many different links, with a range of URLS and keywords. Other nasty spammers (such as spambots) will edit a mass of pages in few minutes, even replacing good links with bad links such as vandalizing article references.

Mediawiki already provides several extensions for combating spam. I had a look at current Spam Management Extensions. They mostly scan the text for blacklist links, keywords, extracting links and matching them against blacklist URLs, preventing spambots by using captcha. I intend to create a Bayesian Spam Filter extension for combating wiki spam under GSoC 2013 program for MediaWiki. It will be based on token(word) filtering using Bayesian techniques. I think it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.

For more information refer to this discussion on the Wikitech-l mailing list

Deliverables

 * Automated training of Bayesian Filter Database from roll back and previous wiki data.
 * A traditional token based Bayes offline classifier.
 * Integrating the extension with Abusefilter to provide logging and tagging features
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * Integrating the extension with Abusefilter to provide logging and tagging features
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.

Project Schedule
Until June 17 Community Bonding Period

Setting up my environment. Understanding the source code, MediaWiki architecture. Getting to know mentors, fixing bugs

June 17 - June 24 Gather Corpus of text as spam or ham. This includes:
 * Gathering reverts from STiki. STiki labels documents as vandalism and innocent, so it will be easy to gather the classified data
 * Adding a hook in rollback API, so that a function registerSpam is called whenever that rollback is executed.
 * Gathering Data from Mediawiki deleted pages.

June 24th - June 27th

Preparing a basic skeleton for the Extension.
 * Register a hook in Mediawiki codebase to call the extension whenever edit is saved.
 * Calling the filter bot with appropriate arguments
 * Add a response handler that will give report on page that there is spam

June 27th - July 4th Implementing a Parser Class.
 * Read text from file, tokenize it where separators would be whitespace, commas and periods.
 * Learn Porter Stemming. Implement porter stemming on words after stripping punctuation marks.
 * Calculate the following attributes
 * Total no of characters (C)
 * Ratio of Alpha characters
 * Ratio of digits
 * Ratio of Whitespace Charachters
 * Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ, \,/ )
 * Total no of words (M)
 * list of links
 * Total number of links/Total number of words.
 * Total no of short words/M Two letters or less
 * Ratio of Capital Letters

July 4th - July 11th Adding two tables in the database.
 * First will contain the link to the text file, the attributes mentioned above, the class label(spam or ham).
 * Second will contain unique words and links, there count in spams, and their count in hams.
 * As links are the main reason a post is spam, and most spammers post the links of their websites of mediawiki, links will be considered as spam or ham with more probability.
 * Integrating the database with the workflow.

July 11th - July 22nd Implement the Filter functionality.
 * Divide the Test Data into 80% training data and 20% cross validating.
 * Implement the DetectSpam functionality which takes into account the attributes calculated above
 * Train the Database using the training data.
 * Check the Efficiency of the spam classifier by testing it on the cross-validation data.
 * Choosing a Laplacian Smoothing parameter by re-iterating the above steps, till it converges to a maximum point.

July 22nd - July 29th Code Review, extra period for changes suggested and unseen delays.

July 29th- August 2nd Mid-term Evaluation Period

August 3rd - August 10th

Writing Unit Tests. Testing the whole work flow. Fixing bugs

August 10th - August 17th Submit the DB patches. Code Review by mentors. Roll out the feature for some wiki users. Fixing bugs.

August 17th- August 24th
 * Integrating the Spam Filter Extension with Abuse Filter.
 * Create Log pages of spam reports
 * Learn about Query Optimization. Indexing the tables, and optimizing query to improve the speed.

August 24th - September 3rd Implement a Job Queue.
 * Implementing a multi server single queue model. All the concurrent edits go in the queue.
 * A fixed number of processing threads pull out the edits, call the DetectSpam utility.
 * Implement Locking techniques.

September 3rd to September 10th One week backup for suggested changes, fixing bugs and unseen delays.

September 10th to September 17th

Writing unit tests. Testing the entire workflow.

September 17th - Soft Pencil Down date

September 17th - September 23rd Improve documentation.

September 23rd - Hard Pencils Down.

Motivations
Wiki pages are a great tool for knowledge management and improving documentation. Organizations and users of Wikimedia are much greater than any other organization. This mean a lot of people using it, their ranks a lot higher in google search algorithm than other sites, and hence a lot of spammers/spam bots hammering the sites. This makes a anti-spam extension on mediawiki a most important functionality. My main motive for choosing Mediawiki is because it is the most widely used application, almost every organization, perhaps the very same where I might be working in future. There is no greater joy than seeing your code running smoothly and helping a million users.

I chose Spam Filter cause, I am keenly interested in machine learning and data mining. I have a theoretical knowledge of these which I now want to put to test. I feel Summer of Code is very first good step of contributing in open source community. I figure this project is a way to fulfil all my aims.

Open source experience
This is my first time I will be contributing in an open source community. I have a fixed a small bug in SpamBlacklist extension. In a week I wish to fix:

Bug about Log SpamBlack list hits to learn about how to create special pages and integrate the database with the code.

A bug in abusefilter extension, particularly this one. To know about how abusefilter works, and how can I incorporate the metadata in vandalism detection

Participation
I keep a steady pace, and check my work as I go along. I plan to keep my mentors updated about the progress via IRC. I will create a Github repository for the project, where they can see the code as it goes along. After the end of each milestone as listed in deliverables I will be pushing it to Gerrit for review. I consider IRC, wikitech-l mailing list and stackoverflow to be the best resources for answering my queries. I am very dedicated in whatever I do. I would like to continue on this project even after GSOC, and give effort to push the code to the stable version as soon as possible. I plan to actively maintain my code and do bug-fixing even after my GSOC time period is over

About Me
I am a 4th year student, doing B.Tech in Computer Science from IIT, Roorkee. I am proficient in web development, Php and Python. My previous works include

Questionnaire, an tool on the model of Surveymonkey for creating surveys. Here is the UI of it

A facebook app which was based on the idea of Orkut Crushlist. You can checkout the code here.

I made a social CRM tool for a 500 startup funded startup, Mygola where I worked as an intern.

I am very dedicated in whatever I do. I am keenly interested in machine learning and data mining, that is why chose this project. I would like to continue on this project even after GSOC, and give effort to push the code to the stable version as soon as possible. I plan to actively maintain my code and do bug-fixing even after my GSOC time period is over.