Extension:BayesianFilter/GSoC 2013/Project updates

Monthly Reports
June


 * Investigated for data sets corresponding to spamming on wikis, downloaded the data set of STiki, analyzed it if can be used as a training data set. Unfortuantely it didn't work out as the STiki labels vandalism and not spam
 * Investigated the Mediawiki API and its control flow for developing my extension.
 * Thoroughly examined the SpamBlacklist extension and how it works, as it is very close to my extension.

July

Till Now


 * After discussion on IRC and with Chris, decided to make an extension that registers spam rather than a gadget.
 * Created the skeleton for BayesianFilter extension, that as of now registers the reverted edits.
 * Studied the database access and implemented functionalities for registering of undo and rollback edits
 * Implemented a checkbox "Mark this Spam" beside "Watch this page" for undo actions.
 * The source code can be found here https://github.com/anubhav914/BayesianFilter.
 * Made changes to the reverted_edits table and the code as suggested by Platonides
 * My earlier plan was to get the data and build the training model, but STiki data did not work, and it took time in writing the data gathering extension so I have made the basic skeleton of the extension of how it will look.
 * I have added the functionality to checkSpam, which cleans the text and then calculates bayesian probability of spam in it.

August

I have largely changed the source code, as previous was just a extension to gather data, but now I will be pulling it from MediaWiki deleted pages


 * Applied the changes discussed with Platonides
 * Wrote BayesianFilter.DBHandler. I have made two tables in DB, spam_ham_text and word_frequency. The first one stores the texts, with spam/ham lables while the latter stores the frequency of a word in spam and ham texts.
 * Wrote BayesianFilter.Tokenizer. It provides function to sanitize and tokenize the content got from the wikipedia text. As discussed with Platonides I wrote the tokenize function as iterator.
 * Added the functionality to train the DB
 * Almost all the coding is complete. The source code can be found here https://github.com/anubhav914/BayesianFilter.
 * Got the MediaWiki sysop permissions. I will now fetch mediawiki deleted pages and use them to train the DB. Once DB is trained I would submit the patch in gerrit for review

Future Plan of Action
 * Plug the extension into Mediawiki
 * Gather Training Data an sample(huge task) the data for effective training model