User:KStoller-WMF/sandbox

Evaluation:
We have completed an initial evaluation of sample copyedits utilizing LanguageTool and Hunspell. To compare, our research team created a list of ~100 samples of potential copyedits in Wikipedia articles for arwiki, bnwiki, cswiki, eswiki (pilot-wikis) and enwiki (as a test-case for debugging).

Methodology:

 * Started with a subset of the 10,000 first articles from the HTML dumps using the 20220801-snapshot of the respective wiki.
 * Extracted the plain text from the HTML-version of the article (trying to remove any tables, images, etc)
 * Ran LanguageTool and the Hunspell-spellchecker on the plain text.
 * Applied a series of filters to decrease the number of false positives (further details available in this this Phabricator task).
 * Selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article I picked only one error randomly so we have 100 errors from 100 different articles.
 * Growth Ambassadors evaluated the samples in their first language, and decided if the suggested copy edit was accurate, incorrect, or if they were unsure, or if was unclear (the suggestion wasn't clearly right or wrong).

Results:

 * LanguageTool supports 31 languages, so only two of the Growth team pilot languages are supported: Spanish and Arabic. LanguageTool's copyedits were judged over 50% accurate across all wikis. Copy edit research - Hunspell.png
 * The precision for Hunspell copyedits was judged less than 50% accurate across all wikis (best case was English with 39% but Czech had only 19% and Bengali yielded 0% correct suggestions).

Next Steps:
The comments from Growth Ambassadors gave good starting points for further improving the filters to decrease the number of false positives and thus further improve accuracy.

For languages not supported by an open source copy editing tool, we might consider a rule-based approach, i.e. only looking for very specific errors which could be based on a list of common misspellings.