Talk:Requests for comment/CAPTCHA

Issue: image classification CAPTCHAs need a secret corpus
Image classification CAPTCHAs like ASIRRA, by definition, require the user to classify images in a way which is difficult to do by a computer. Unfortunately, this implies that verifying the correctness of the classification also cannot be easily done by a computer. Therefore, in order to work, any such scheme needs a large corpus of human-classified images. (If the corpus is too small, spammers can just learn it.)

Now, as it happens, the WMF does have a large corpus of human-classified images: Commons and its category system. Unfortunately, because this corpus is public, anyone could, in principle at least, just download it and apply existing image recognition tools to compile a reverse index mapping images to categories. Worse, they may not even need to &mdash; instead, they can use existing public image search engines like TinEye or Google Image Search to find the images' description pages on Commons, from which they can then extract the categories or whatever other information they need.

Now, granted, TinEye's and Google's coverage of Commons images is not currently perfect, but that's not really a state which we want to persist in the future. Furthermore, based on a quick test, at least Google's coverage is actually pretty good: out of ten images selected using the random file link on Commons, Google more or less correctly identified eight thumbnails, while TinEye found matches for four out of ten. (For two of the images, which happened to be locator maps, both Google and TinEye returned other maps of the same region using the same base map. Otherwise, Google mostly returned exact matches to Commons or Wikipedia, whereas TinEye mostly found copies from other sites.)

ASIRRA gets around this problem by using a proprietary image database contributed to Microsoft by Petfinder.com; only a small fraction of this database is publicly viewable, making database-cloning attacks infeasible. In principle, WMF could do the same, either by relying on a third party database or collecting their own. However, both methods have their problems: using a third party would introduce an external dependency, something which the WMF has been unwilling to do in the past, while spending precious volunteer effort to compile a massive secret image classification database would seem perverse, given that the same effort could be spent on improving our public image categorization. --Ilmari Karonen (talk) 13:27, 4 September 2012 (UTC)