Extension:BayesianFilter/GSoC 2013/Project updates
< Extension:BayesianFilter | GSoC 2013(Redirected from User:Anubhav iitr/Bayesan spam filter/Project updates)
- Investigated for data sets corresponding to spamming on wikis, downloaded the data set of STiki, analyzed it if can be used as a training data set. Unfortuantely it didn't work out as the STiki labels vandalism and not spam
- Investigated the Mediawiki API and its control flow for developing my extension.
- Thoroughly examined the SpamBlacklist extension and how it works, as it is very close to my extension.
- After discussion on IRC and with Chris, decided to make an extension that registers spam rather than a gadget.
- Created the skeleton for BayesianFilter extension, that as of now registers the reverted edits.
- Studied the database access and implemented functionalities for registering of undo and rollback edits
- Implemented a checkbox "Mark this Spam" beside "Watch this page" for undo actions.
- The source code can be found here https://github.com/anubhav914/BayesianFilter.
- Made changes to the reverted_edits table and the code as suggested by Platonides
- My earlier plan was to get the data and build the training model, but STiki data did not work, and it took time in writing the data gathering extension so I have made the basic skeleton of the extension of how it will look.
- I have added the functionality to checkSpam, which cleans the text and then calculates bayesian probability of spam in it.
I have largely changed the source code, as previous was just a extension to gather data, but now I will be pulling it from MediaWiki deleted pages
- Applied the changes discussed with Platonides
- Wrote BayesianFilter.DBHandler. I have made two tables in DB, spam_ham_text and word_frequency. The first one stores the texts, with spam/ham lables while the latter stores the frequency of a word in spam and ham texts.
- Wrote BayesianFilter.Tokenizer. It provides function to sanitize and tokenize the content got from the wikipedia text. As discussed with Platonides I wrote the tokenize function as iterator.
- Added the functionality to train the DB
- Almost all the coding is complete. The source code can be found here https://github.com/anubhav914/BayesianFilter.
- Got the MediaWiki sysop permissions. I will now fetch mediawiki deleted pages and use them to train the DB. Once DB is trained I would submit the patch in gerrit for review
Future Plan of Action
- Plug the extension into Mediawiki
- Gather Training Data an sample(huge task) the data for effective training model