User talk:AalekhN/GSoC proposal 2014

Asirra
Hi, isn't this feature basically the same as what Asirra is doing with cats and dogs? See Extension:Asirra, and a real example running in MediaWiki here.--Qgil (talk) 23:06, 6 March 2014 (UTC)

--Hi, What i mentioned in this proposal is different from Asirra,in two ways:
 * --I proposed use of data mining to extract images from the Wiki commons database through its API...Asirra on the other hand is more about image recognition ,moreover use of image recognition is not preferred as discussed in here : https://www.mediawiki.org/wiki/Talk:CAPTCHA#lqt_thread_id_39998
 * --Asirra is proprietary and can't be self-hosted....Wiki Commons on the other hand is a part of Wikimedia as mentoined in a reply  here : https://www.mediawiki.org/wiki/Talk:CAPTCHA#lqt_thread_id_40174 --AalekhN(talk)

ConfirmEdit / node
Two Questions: CSteipp (talk) 01:13, 7 March 2014 (UTC)
 * Are you proposing to extend or integrate with the existing ConfirmEdit extension, or would this be a competing extension?
 * You mentioned node.js as a part of your development environment-- are you planning to do part of this in javascript instead of PHP? Having the captchas require other languages besides php has been an issue with fancy captcha (generation is in python), and I would caution against it unless there is a really compelling reason for it.

--Hi ,I plan to make it as an integration to ConfirmEdit extension....if required we can also make it as an separate extenstion.
 * -Mention of node server was more of a typing mistake and has been corrected by now ......since this proposal is more about data mining I don't think there is a need to use other languages such as python or node as discussed in a comment here: https://www.mediawiki.org/wiki/Talk:CAPTCHA#lqt_thread_id_39998 AalekhN (talk)


 * Thanks. Please just integrate with ConfirmEdit-- a separate extension will make things more difficult in the future, which is why I was concerned about it. It sounds like you're on a good track. Good luck.


 * Also, when you consider your proposal complete and you've settled on a specific idea, please move it to the appropriate title as a subpage of CAPTCHA (per instructions). --Nemo 13:08, 8 March 2014 (UTC)

Watch the complexity
Just reading the captcha question gave me a headache... I'm not a designer so I'm unable to comment on specific ideas; but, again, you'll need some scientific research/usability testing to support whatever approach you choose. --Nemo 17:31, 13 March 2014 (UTC)
 * -- Thank You,I got similar advice from various community members and then I concluded to make the captcha question mentoined in approach 1 more user friendly ,I therefore have changed my approach and made it more user friendly --aalekhN 11:28, 14 March 2014 (UTC)

Layout
21.20 < aalekhN1> Nemo_bis: any suggestions regarding new layout of proposal? Layout is certainly not the main problem. --Nemo 20:34, 19 March 2014 (UTC)

Mentors
Please, as I told you:
 * ask on the related bugzilla reports who's interested in the project and in mentoring it;
 * ask Siebrand if he's available to be the primary mentor and/or he prefers/recommends you (based on your skills etc.) to apply for another project he's mentoring. --Nemo 20:36, 19 March 2014 (UTC)


 * --Yes thank you, I did mailed to Siebrand yesterday itself and have evidently just posted on bugzilla report. AalekhN 03:30PM, 20 March 2014 (UTC)

Unrealistic without a data source
This proposal concentrates on the easy part of the image-recognition CAPTCHA problem (finding tasks where humans drastically outperform machines) and completely ignores the hard part (how to create images + verification data in a way that is not exploitable to spambots). You need a very large set of CAPTCHAs (hundreds of thousands, ideally), otherwise an attacker can just map your CAPTCHA database. If you use a public image repository (such as Commons) or a public data source (such as Commons categories), chances are an attacker can match the CAPTCHA to the source and figure out the solution from that.

Asirra works because Microsoft has an agreement with petfinder.com who provide them with an endless stream of new animal photos and manually created classifications; the classifications are initially not public and only available to the CAPTCHA software. How do you intend to obtain a similarly robust data source? As far as I can see, this would be the real challange in the project, not the choice of image transformation.

I suppose one could try the reCaptcha way and create some sort of bootstrap data set, then show people a mix of captchas with known and unknown solutions, and use the known ones for verification and the unknown ones for generating more data. But that is not easy and should get significant focus in the project if you want a CAPTCHA system which is of any practical use at the end. --Tgr (WMF) (talk) 01:42, 20 March 2014 (UTC)


 * --First of all thank you for raising this question,now to solve the problem of being recognized by bots, i have opted for Graphical Modification of the images and to support my argument i have gone through various research papers and testing ,one such test has been made on this image https://www.mediawiki.org/wiki/File:Cat123789_(1).gif by applying photobooth and pencil sketch effect, and to my surprise the image was not recognizable even on Google images and tin eye api's.This argument is supported in this research paper:http://web.media.mit.edu/~mehoque/Publications/Captcha-CameraReady.pdf . Also there are few approaches discussed,one of them on Google Research Blog as mentioned here:http://googleresearch.blogspot.in/2009/04/socially-adjusted-captchas.html which support secure captcha. Also in addition providing questions like Selecting Odd one out leaves little scope for the bots to be identified.AalekhN(talk) 03:24, 20 March 2014 (UTC)


 * Another approach that can be worth trying is to play a little (or a lot) with categories. For example, don't rely directly on basic categories but on more general meta-categories. For example, I'll put three images and their categories which are easy for humans to find the "different one" but hard for bots to solve (I guess):

––Pginer (talk) 15:29, 26 March 2014 (UTC)
 * http://commons.wikimedia.org/wiki/File:Pablo_picasso_1.jpg
 * Pablo Picasso
 * Portraits of Pablo Picasso
 * 1962
 * Flat caps
 * Black and white photographic portraits of painters
 * 20th-century portrait photographs
 * http://commons.wikimedia.org/wiki/File:1pastor_Shevchenko_w.jpg
 * Portraits
 * Priests from Ukraine
 * 1974 births
 * Men wearing neckties
 * Men with glasses
 * Wikiusers
 * http://commons.wikimedia.org/wiki/File:Cat_Briciola_with_pretty_and_different_colour_of_eyes.jpg
 * White cats
 * Odd-eyed cats

A few questions/remarks

 * What specifically will be the data (image) source? Wikidata or did you mean Commons?
 * I don't quite understand how the image indexing system would work. What does "downrate given image's in option" mean?
 * Have you thought about implementing it for mobile too? It shouldn't be much more work.
 * Try to clean the proposal up a little bit. It will read much better with proper punctuation (in particular with spaces after, not before a period).