Extension:BayesianFilter/GSoC 2013/Project updates

From MediaWiki.org
Jump to: navigation, search

Monthly Reports[edit]

June


  • Investigated for data sets corresponding to spamming on wikis, downloaded the data set of STiki, analyzed it if can be used as a training data set. Unfortuantely it didn't work out as the STiki labels vandalism and not spam
  • Investigated the Mediawiki API and its control flow for developing my extension.
  • Thoroughly examined the SpamBlacklist extension and how it works, as it is very close to my extension.

July

Till Now

  • After discussion on IRC and with Chris, decided to make an extension that registers spam rather than a gadget.
  • Created the skeleton for BayesianFilter extension, that as of now registers the reverted edits.
  • Studied the database access and implemented functionalities for registering of undo and rollback edits
  • Implemented a checkbox "Mark this Spam" beside "Watch this page" for undo actions.
  • The source code can be found here https://github.com/anubhav914/BayesianFilter.
  • Made changes to the reverted_edits table and the code as suggested by Platonides
  • My earlier plan was to get the data and build the training model, but STiki data did not work, and it took time in writing the data gathering extension so I have made the basic skeleton of the extension of how it will look.
  • I have added the functionality to checkSpam, which cleans the text and then calculates bayesian probability of spam in it.

August

I have largely changed the source code, as previous was just a extension to gather data, but now I will be pulling it from MediaWiki deleted pages

  • Applied the changes discussed with Platonides
  • Wrote BayesianFilter.DBHandler. I have made two tables in DB, spam_ham_text and word_frequency. The first one stores the texts, with spam/ham lables while the latter stores the frequency of a word in spam and ham texts.
  • Wrote BayesianFilter.Tokenizer. It provides function to sanitize and tokenize the content got from the wikipedia text. As discussed with Platonides I wrote the tokenize function as iterator.
  • Added the functionality to train the DB
  • Almost all the coding is complete. The source code can be found here https://github.com/anubhav914/BayesianFilter.
  • Got the MediaWiki sysop permissions. I will now fetch mediawiki deleted pages and use them to train the DB. Once DB is trained I would submit the patch in gerrit for review

Future Plan of Action

  • Plug the extension into Mediawiki
  • Gather Training Data an sample(huge task) the data for effective training model