Thread:Extension talk:Lucene-search/Spellcheck an entire wiki/reply

Hi

Interesting suggestions. Search Engines do have a good capability of spell checking search queries. This roughthly works by looking at cases where a specific word or phrase returns zero or almost zero result and finding the best scoring closest match filtering for the other good words.

As you can understand this allows the SE to do two special things - detect a spelling mistake without knowing about the language and to suggest a correction based on context of other words and again without actualy knowing anything about the language. So this is basically a semantic algorithm.


 * This algorithm is not very useful for regular spell checking - if it meets unknown words it will generate false positives and it would also suffer from false positives (trying to correct good workd). If you have a good spell checker you still have a huge risk of running it unsupervised on a real wiki. Wikipedia (more than Britanica) combines several varieties of English which spell checker should be used British English, Austrailan, US, Indian ???

It would be possible to code a hybrid contextual spell checker - especially if you had a large robustly pos-tagged ngrams set to compare with. However this is not a small project and I do not know of an exiting open-source implementation.

We have a related project in our roadmap to generate the Ngram collections but the GSOC project was canceled this month and so this is now unscheduled.