Thread:Talk:Requests for comment/Database field for checksum of page text/Similarity measure/reply (5)

We need a convincing use case before considering this :) And I really think that Lucene, Mahout and the XML dump files are the way to go. Lucene has built-in similarity functionality and else Mahout has it as well. You can create an index using Lucene and provide his to Mahout as input. Then you can calculate similarity scores for different measures.Drdee 18:29, 16 November 2011 (UTC)