User:Anubhav iitr

My name is Anubhav Agarwal. I am a 4th year student at IIT Roorkee, India doing major in Computer Science. I am proficient in programming and web development. Contact information at the end of this page.

I wish to apply at Wikimedia this summer under Google Summer of Code 2013. And this is my proposal for the same:

Description
I intend to create a Bayesian Spam Filter for combating the Wiki spam problem. I had a look at current Spam Management Extensions. They mostly scan the text for blacklist links, keywords, extracting links and matching them against blacklist URLs, preventing spambots by using captcha. None of them does a token(word) based Bayesian filtering as it done in e-mails.

Tasks

 * Create a tool for wiki users to report Spam. A a simple way to train the a Bayesian DB. This should be accessible for any user with the permissions to "undo" or "rollback" those changes or to delete the new page/file. Understanding the metadata(IP, links, user) I can extract from the data (perhaps harnessing other services like blacklists).
 * Building a token based Naive Bayes Classifier. At this stage it will be an offline classifier, classifying data from wiki dump files.
 * Enhancing in using various tools for spam discrimination, for instance using CRM114.
 * After this making it an 'online' classifier, actually integrating it into the workflow of users of the wiki so that spam bots can be stopped immediately.
 * Enhancing in using various tools for spam discrimination, for instance using CRM114.
 * After this making it an 'online' classifier, actually integrating it into the workflow of users of the wiki so that spam bots can be stopped immediately.
 * After this making it an 'online' classifier, actually integrating it into the workflow of users of the wiki so that spam bots can be stopped immediately.

Contact
Feel free to start a discussion on my talk page (|talk) or mail me at anubhav914@gmail.com with "wikimedia gsoc" as subject.