User:Anubhav iitr

Identity:
Name: Anubhav Agarwal

Email: anubhav914@gmail.com

Project Title: Bayesian Spam Filter

Contact/working info
Timezone: UTC + 5:30

Typical working hours: 10:PM - 4:00 AM

IRC or IM networks/handle(s): anubhav

Project summary
Wikis are a common target for spammers wishing to promote products or web sites due to their open editing nature. Often a spammer will completely replace the legitimate content of a page with their spam, and may add many different links, with a range of URLS and keywords. Other nasty spammers (such as spambots) will edit a mass of pages in few minutes, even replacing good links with bad links such as vandalizing article references.

Mediawiki already provides several extensions for combating spam. I had a look at current Spam Management Extensions. They mostly scan the text for blacklist links, keywords, extracting links and matching them against blacklist URLs, preventing spambots by using captcha. I intend to create a Bayesian Spam Filter extension for combating wiki spam under GSoC 2013 program for mediaiwiki. It will be an offline and online spam classifier based on token(word) filtering using baysian techniques. I think it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.

For more information refer to this discussion on the Wikitech-l mailing list

Deliverables

 * Automated training of Bayesian DB from roll back and previous wiki data.
 * A traditional token based Bayes offline classifier.
 * After this making it an 'online' classifier, actually integrating it into the work flow of users of the wiki so that spam bots can be stopped immediately.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * After this making it an 'online' classifier, actually integrating it into the work flow of users of the wiki so that spam bots can be stopped immediately.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.
 * Adding a job queue that would store the concurrent edits on a huge traffic site like Wikipedia, and revert back the changes that were detected as spam.

Project Schedule
June 17 - June 24 Gather Corpus of text as spam or ham. This includes:
 * Making an button for report as Spam which could only be used by Administrators of Wiki.
 * Adding a hook in rollback API, so that a fuction registerSpam is called whenever that rollback is executed.
 * Gathering Data from mediawiki deleted pages.

June 24 - July 4th Implementing a Parser Class.
 * Read text from file, tokenize it where separators would be whitespace, commas and periods.
 * Learn Porter Stemming. Implement porter stemming on words after stripping punctuation marks.
 * Calculate and return the following attributes
 * Total no of characters (C)
 * Ratio of Alpha characters
 * Ratio of digits
 * Ratio of Whitespace Charachters
 * Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ, \,/ )
 * Total no of words (M)
 * Total no of short words/M Two letters or less
 * Ratio of Capital Letters

July 4th - July 11th Adding two tables in the database.
 * First will contain the link to the text file, the attributes mentioned above, the class label(spam or ham).
 * Second will contain unique words, there count in spams, and their count in hams.

Integrating the database with the workflow.

July 11th - July 22nd Implement the offilne Classifier.
 * Divide the Test Data into 80% training data and 20% cross validating.
 * Implement the DetectSpam functionality which takes into account the attributes calculated above
 * Train the Database using the training data.
 * Check the Efficiency of the spam classifier by testing it on the cross-validation data.
 * Choosing a Laplacian Smoothing parameter by re-iterating the above steps, till it converges to a maximum point.

July 22nd - July 29th Code Review, extra period for changes suggested, unseen delays, improving documentation, fixing bugs.

July 29th- August 2nd Mid-term Evalution Period

The offline classifier mentioned above will a python utility.

August 3rd - August 10th Register a hook in mediawiki. Creating a Extension. Integrating it with the offline classifier python process. Creating Special log pages that will show the log of spam entries submitted.

August 10th- August 17th Integrating the workflow with online users
 * Reporting Spam when an edit is saved.
 * Learn about Query Optimization. Indexing the tables, and optimizing query to improve the speed.

August 17th - August 24th Submit the DB pathces. Code Review by mentors. Roll out the feature for some wiki users. Fixing bugs.

August 24th - September 3rd Implement a Job Queue.
 * Implementing a multi server single queue model. All the concurrent edits go in the queue.
 * A fixed number of processing threads pull out the edits, call the DetectSpam utility.
 * Implement Locking techniques.

September 3rd to September 16th Two weeks backup for suggested changes, fixing bugs and unseen delays.

September 17th - Soft Pencil Down date

September 17th - September 23rd Improve documentation.

September 23rd - Hard Pencils Down.