User:Anubhav iitr

Identity:
Name: Anubhav Agarwal

Email: anubhav914@gmail.com

Project Title: Bayesian Spam Filter

Contact/working info
Timezone: UTC + 5:30

Typical working hours: 10:PM - 4:00 AM

IRC or IM networks/handle(s): anubhav

Project summary
Wikis are a common target for spammers wishing to promote products or web sites due to their open editing nature. Often a spammer will completely replace the legitimate content of a page with their spam, and may add many different links, with a range of URLS and keywords. Other nasty spammers (such as spambots) will edit a mass of pages in few minutes, even replacing good links with bad links such as vandalizing article references.

Mediawiki already provides several extensions for combating spam. I had a look at current Spam Management Extensions. They mostly scan the text for blacklist links, keywords, extracting links and matching them against blacklist URLs, preventing spambots by using captcha. I intend to create a Bayesian Spam Filter extension for combating wiki spam under GSoC 2013 program for mediaiwiki. It will be an offline and online spam classifier based on token(word) filtering using baysian techniques. I think it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.

For more information refer to this discussion on the Wikitech-l mailing list

Deliverables

 * Automated training of Bayesian Filter Database from roll back and previous wiki data.
 * A traditional token based Bayes offline classifier.
 * After this making it an 'online' classifier, actually integrating it into the work flow of users of the wiki so that spam bots can be stopped immediately.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * After this making it an 'online' classifier, actually integrating it into the work flow of users of the wiki so that spam bots can be stopped immediately.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.

Project Schedule
June 17 - June 24 Gather Corpus of text as spam or ham. This includes:
 * Gathering reverts from STiki. STiki labels documents as vandalism and innocent, so it will be easy to gather the classified data
 * Adding a hook in rollback API, so that a function registerSpam is called whenever that rollback is executed.
 * Gathering Data from Mediawiki deleted pages.

June 24th - June 30th

Preparing a basic skeleton for the Extension.
 * Register a hook in Mediawiki codebase to call the extension whenever edit is saved.
 * Calling the filter bot with appropriate arguments
 * Add a response handler that will give report on page that there is spam

June 24 - July 4th Implementing a Parser Class.
 * Read text from file, tokenize it where separators would be whitespace, commas and periods.
 * Learn Porter Stemming. Implement porter stemming on words after stripping punctuation marks.
 * Calculate the following attributes
 * Total no of characters (C)
 * Ratio of Alpha characters
 * Ratio of digits
 * Ratio of Whitespace Charachters
 * Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ, \,/ )
 * Total no of words (M)
 * Total number of links/Total number of words.
 * Total no of short words/M Two letters or less
 * Ratio of Capital Letters

July 4th - July 11th Adding two tables in the database.
 * First will contain the link to the text file, the attributes mentioned above, the class label(spam or ham).
 * Second will contain unique words, there count in spams, and their count in hams.
 * Integrating the database with the workflow.

July 11th - July 22nd Implement the offline Classifier.
 * Divide the Test Data into 80% training data and 20% cross validating.
 * Implement the DetectSpam functionality which takes into account the attributes calculated above
 * Train the Database using the training data.
 * Check the Efficiency of the spam classifier by testing it on the cross-validation data.
 * Choosing a Laplacian Smoothing parameter by re-iterating the above steps, till it converges to a maximum point.

July 22nd - July 29th Code Review, extra period for changes suggested, unseen delays, improving documentation, fixing bugs.

July 29th- August 2nd Mid-term Evaluation Period

The offline classifier mentioned above will a python utility.

August 3rd - August 10th Register a hook in Mediawiki. Creating a Extension. Integrating it with the offline classifier python process. Creating Special log pages that will show the log of spam entries submitted.

August 10th- August 17th Integrating the workflow with online users
 * Reporting Spam when an edit is saved.
 * Learn about Query Optimization. Indexing the tables, and optimizing query to improve the speed.

August 17th - August 24th Submit the DB patches. Code Review by mentors. Roll out the feature for some wiki users. Fixing bugs.

August 24th - September 3rd Implement a Job Queue.
 * Implementing a multi server single queue model. All the concurrent edits go in the queue.
 * A fixed number of processing threads pull out the edits, call the DetectSpam utility.
 * Implement Locking techniques.

September 3rd to September 16th Two weeks backup for suggested changes, fixing bugs and unseen delays.

September 17th - Soft Pencil Down date

September 17th - September 23rd Improve documentation.

September 23rd - Hard Pencils Down.

Motivations
Wiki pages are a great tool for knowledge management and improving documentation. Organizations and users of Wikimedia are much greater than any other organization. This mean a lot of people using it, their ranks a lot higher in google search algorithm than other sites, and hence a lot of spammers/spam bots hammering the sites. This makes a anti-spam extension on mediawiki a most important functionality. My main motive for choosing Mediawiki is because it is the most widely used application, almost every organization, perhaps the very same where I might be working in future. There is no greater joy than seeing your code running smoothly and helping a million users.

I chose Spam Filter cause, I am keenly interested in machine learning and data mining. I have a theoretical knowledge of these which I now want to put to test. I feel Summer of Code is very first good step of contributing in open source community. I figure this project is a way to fulfil all my aims.

Open source experience
This is my first time I will be contributing in an open source community. I have a fixed a small bug in SpamBlacklist extension. In a week I wish to fix:

Bug about Log SpamBlack list hits to learn about how to create special pages and integrate the database with the code.

A bug in abusefilter extension, particularly this one. To know about how abusefilter works, and how can I incorporate the metadata in vandalism detection

About Me
I am a 4th year student, doing B.Tech in Computer Science from IIT, Roorkee. I am proficient in web development, Php and Python. My previous works include

Questionnaire, an tool on the model of Surveymonkey for creating surveys. Here is the UI of it

A facebook app which was based on the idea of Orkut Crushlist. You can checkout the code here.

I made a social CRM tool for a 500 startup funded startup, Mygola where I worked as an intern.

I am very dedicated in whatever I do. I am keenly interested in machine learning and data mining, that is why chose this project. I would like to continue on this project even after GSOC, and give effort to push the code to the stable version as soon as possible. I plan to actively maintain my code and do bug-fixing even after my GSOC time period is over.