User talk:Anubhav iitr

--Qgil (talk) 00:43, 24 March 2013 (UTC)

some feedback
Thanks for your draft proposal. It would help you to link to some examples of your proficiency in web programming -- do you have past open source contributions or school projects that you could show us? Showing us the code is a good step. Also, have you tried playing with MediaWiki's code yet?

best, Sharihareswara (WMF) (talk) 17:28, 8 April 2013 (UTC)


 * You're welcome. I made a facebook app based on Orkut Crushlist. You can checkout the code here. Other than that I have made some school projects which currently are deployed on LAN. I was intern and am future employee in a 500 startups funded startup, Mygola. There I develop CRM tool for them. Unfortunately I can't show them working to you as of now :( . No I don't have any open source contribution, but I feel GSoC is a good first step to bond with an open source community. Yes I have gone through the codebase of mediawiki and submited a patch for review for this bug.

--Anubhav iitr (talk) 07:12, 12 April 2013 (UTC)

Proposal Comments
Some quick initial comments:
 * Updating the UI to collect the corpus is going to be hard, much more work than one week. Getting a button added to the UI is something that would need design review, and approval from the administrators. Alternatively, you may be able to collect reverts from Cluebot-ng or STiki, or possibly look at reverted edits by users who have been block for spam. You could also add a button to the page using javascript, that tags the revision just before the revert-- convincing a few administrators to use your javascript will be much easier than convincing them all that another button is needed in the interface.
 * Thanks for the suggestion. I guess I will use STiki, it labels texts as vandalism and innocent, so it would be easier to gather classified data.


 * For the offline processing, you may want to focus on implementing the filter as a bot, which reads all of the incoming edits, and does the processing outside of the WMF cluster. The data handling will need to be pretty mature before we can run it on the production servers. Running this on a wmflabs instance shouldn't be a problem.
 * I am doing that only. The filter will be a python daemon. It will be called from a php script, extension SpamFilter. It will provide it with all the incoming edits. Filter will evaluate it as a sapm or ham, update the DB, return the result.
 * So the difference is where the python script actually runs. To get it on the Wikimedia cluster, it will need to be pretty mature, and go through a rigorous review for performance and security before it can be deployed. This can take several weeks. If, instead, it's actually running on a wmflabs instance, and just consuming a feed (using irc or the api) of recent changes, then there are almost no security or performance requirements. So I'd recommend starting with that, with the goal have having it run on the cluster (either from a hook, or as a job runner) during the second half of the program.
 * After talking with anubhav, the goal will not be to run this on the WMF cluster this summer, but just to develop the extension, and the WMF can evaluate it's usefulness on WMF sites when it's done. So above comments about the cluster are irrelevant. CSteipp (talk) 17:49, 25 April 2013 (UTC)

Platonides (talk) 21:44, 25 April 2013 (UTC)
 * Why creating the classifier in python? As you're developing it from zero, it may be better to write it in php. Unless python has some advantage for the task.
 * You may find some revision metadata interesting, too. The variables recorded by AbuseFilter are: user_editcount, user_name, user_groups, article_article_id, article_namespace, article_text, article_prefixedtext, article_recent_contributors, action, summary, minor_edit, old_wikitext, new_wikitext, edit_diff, new_size, old_size, edit_delta, added_lines, removed_lines, added_links, all_links, old_links, tor_exit_node, timestamp. Some values which could be interesting: aggregated user_editcount, article_namespace, summary, minor_edit, tokenized added_lines, time since last edit...
 * The alpha/special/whitespace characters should be configurable/depend on the language. Perhaps Unicode properties could be used.
 * What's the reasoning behind the short words % attribute? Seems more likely that a problematic edit contains a 25 character "word"
 * Looking up words in a dictionary may be interesting (% of the words found in the language dictionary) as an alternative method.