Growth/Personalized first day/Structured tasks/Copyedit/ha

This page describes work on a "copyedit" structured task, which is a type of structured task that the Growth team may offer through the newcomer homepage. This page contains major assets, designs, open questions, and decisions. Most incremental updates on progress will be posted on the general Growth team updates page, with some large or detailed updates posted here.

Current status

 * 2021-07-19: create project page and begin background research.
 * 2022-08-12: add initial research results.
 * Next: complete manual evaluation.

Summary
Structured tasks are meant to break down editing tasks into step-by-step workflows that make sense for newcomers and make sense on mobile devices. The Growth team believes that introducing these new kinds of editing workflows will allow more new people to begin participating on Wikipedia, some of whom will learn to do more substantial edits and get involved with their communities. After discussing the idea of structured tasks with communities, we decided to build the first structured task: "add a link".

Even as we built that first task, we have been thinking about what subsequent structured tasks could be; we want newcomers to have multiple task types to choose from so that they can find the ones that they like to do, and can increase in difficulty as they learn more. The second task we started working on was "add an image". But in our initial community discussions of the idea of structured tasks, the task type that communities desired most was a task around copyediting -- something related to spelling, grammar, punctuation, tone, etc. Here are our initial notes from looking into this and discussing with community members.

We know that there are many open questions around how this would work, many potential reasons that it might not go right: what kind of copyediting are we talking about? Just spelling, or something more? Is there any sort of algorithm that will work well across all languages? These questions are why we are hoping to hear from lots of community members and have an ongoing discussion as we decide how to proceed.

Goals

 * We want to understand the types of copyediting tasks it might be possible to assist with algorithms.
 * We want to use an algorithm that can suggest tasks for a type of copyediting in articles across different languages.
 * We want to know how good the algorithm works (e.g. know which model works best from a set of existing models).

Literature review

 * What different subtasks are considered copyediting?
 * Identify different aspects of copyediting across the spectrum: typo/spelling to grammar to style/tone
 * What are existing approaches to copyediting in Wikipedia?
 * Communities such as Guild of Copy Editors or the Typo Team.
 * Maintenance-templates such as the copyedit-template.
 * Tools such as the moss-tool to identify typos (also JarBot in Arabic Wikipedia)
 * What are existing public commonly-used tools for spell-checking/grammar etc such as hunspell, LanguageTool, or Grammarly?
 * We know that our communities prefer transparent algorithms, so it is easy for everyone to understand where suggestions come from.
 * What are available models from research in NLP and ML, for example for the task of Grammatical Error Correction.

Defining the task

 * Which aspect of copyediting will we model for the structured task?
 * Type of task: spelling, grammar, tone/style
 * For example: What can browser-spellcheckers do?
 * Granularity -- highlighting task on the level of: article, section, paragraph, sentence, word, sub-word
 * Depends on the task
 * Surface known items (e.g. from templates) or predict new ones?
 * Only suggest that improvement is needed, or suggest how to improve?
 * Suggesting improvement is easier for simpler tasks.
 * Simply highlighting that work is needed is easier for more complex tasks (e.g. style or tone)
 * Language support: how many languages do we aim to support?
 * Include Spanish and Portuguese as target languages alongside Arabic, Vietnamese, Bengali, Czech.
 * We ideally want to cover all languages, but will realistically need to evaluate solutions based on the depth of their language coverage.

Building a dataset for evaluation

 * Generate a test-dataset (ideally in multiple languages) for the task for which we can compare different algorithms. This can be achieved in different ways
 * An existing benchmark dataset, such as CoNLL-2014 Shared Task on Grammatical Error Correction, or approaches for corpora generation (from Wikipedia)
 * Generate our own dataset from revision history using templates (copyedit) or edit summaries (typo)
 * Manual evaluation of output of models run on a set of sentences from Wikipedia.

Research results
A full summary of Research is available on MetaWiki: Research:Copyediting as a structured task.

Literature Review
Background research and literature review can be viewed here.

Main findings:


 * Sauƙaƙan masu duba haruffa da nahawu kamar LanguageTool ko Enchant sun fi dacewa don tallafawa yin kwafin harsuna da yawa kuma suna buɗe/kyauta.
 * Za a buƙaci wasu daidaitawa zuwa mahallin Wikipedia da aikin da aka tsara don rage hankali na samfuran; Hanyoyi na gama gari shine watsi da duk abin da ke cikin zance ko rubutu da ke da alaƙa.
 * Kalubalen zai kasance don haɓaka saitin bayanan gaskiya na ƙasa don gwaji. Wataƙila, za a buƙaci wasu ƙima na hannu.
 * Dogon lokaci: Ƙirƙirar ƙira don haskaka jimlolin da ke buƙatar gyara (ba tare da ba da shawarar gyara ba) dangane da samfuran kwafi. Wannan na iya samar da ƙarin ƙalubale na ɗawainiyar kwafin ƙalubale idan aka kwatanta da duban haruffa.

Kayan Aikin Harshe
Mun gano LanguageTool a matsayin ɗan takara don fitar da yiwuwar kwafi a cikin labarai saboda:


 * Yana buɗewa, ana haɓaka shi sosai, kuma yana goyan bayan harsuna 30+
 * Hanyar tushen ƙa'idar tana da fa'idar cewa kurakurai sun zo tare da bayanin dalilin da yasa aka haskaka su kuma ba kawai saboda babban maki daga samfurin ML ba. Bugu da ƙari, yana ba da ayyuka don ƙara ƙa'idodin al'ada ta al'umma https://community.languagetool.org/
 * Gyaran gyare-gyare daga LanguageTool ya wuce duban rubutun kalmomi ta amfani da ƙamus amma kuma yana kama kurakuran nahawu da salo.

Za mu iya samun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun yadda LanguageTool ke aiki don gano gyare-gyare a cikin labaran Wikipedia ta hanyar kwatanta adadin kurakurai a cikin labaran da aka fito da waɗanda ke cikin labaran da ke ɗauke da samfurin kwafi. Mun sami cewa aikin yana da ma'ana a cikin yaruka da yawa bayan amfani da matakin aiwatarwa wanda a ciki muke tace wasu kurakurai daga LanguageTool (misali waɗanda ke tare da haɗin gwiwa ko rubutu mai ƙarfi). We find that the performance is reasonable in many languages after applying a post-processing step in which we filter some of the errors from LanguageTool (e.g. those overlapping with links or bold text).

Mun kuma kwatanta aikin masu duba haruffa masu sauƙi waɗanda ke akwai don ƙarin harsuna fiye da tallafin LanguageTool. Hakanan za su iya fitar da kurakurai masu ma'ana da yawa don yin kwafin amma suna fama da ƙimar ƙimar ƙarya mafi girma. Ana iya magance wannan ta wani yanki ta hanyar matakan aiwatarwa don tace kurakurai. Wani hasashe kuma shine masu duba rubutun suna yin muni fiye da LanguageTool wajen ba da shawarar ingantaccen ci gaba don kuskuren. They can also surface many meaningful errors for copyediting but suffer from a much higher rate of false positives. This can be partially addressed by post-processing steps to filter the errors. Another disadvantage is that spellcheckers perform considerably worse than LanguageTool in suggesting the correct improvement for the error.

Ɗaya mai yuwuwar haɓakawa mai mahimmanci zai iya zama haɓaka samfuri wanda ke ba da ƙima ga kurakuran da LanguageTool/Mai duba haruffa ya bayyana. Wannan zai ba mu damar ba da fifiko ga waɗannan kurakuran don aikin gyare-gyaren da aka tsara wanda muke da kwarin gwiwa cewa gyare-gyare ne na gaskiya. Wasu tunanin farko suna cikin $1. This would allow us to prioritize those errors for the structured task copyediting task for which we have a high confidence that they are true copyedits. Some initial thoughts are in T299245.

Karanta nan don ƙarin bayani: Research:Copyediting as a structured task/LanguageTool

kimantawa
We have completed an initial evaluation of sample copy edits utilizing LanguageTool and Hunspell. To compare how each tool worked for Wikipedia articles, our research team created a list of sample copy edits for 5 languages: Arabic, Bengali, Czech, Spanish (Growth pilot wikis) and English (as a test-case for debugging).

Methodology

 * Started with a subset of the 10,000 first articles from the HTML dumps using the 20220801-snapshot of the respective wiki (arwiki, bnwiki, cswiki, eswiki, and enwiki).
 * Extracted the plain text from the HTML-version of the article (trying to remove any tables, images, etc).
 * Ran LanguageTool and the Hunspell-spellchecker on the plain text.
 * Applied a series of filters to decrease the number of false positives (further details available in this Phabricator task).
 * Selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article, only one error was selected randomly; thus for each language we had 100 errors from 100 different articles.
 * Growth Ambassadors evaluated the samples in their first language, and decided if the suggested edit was accurate, incorrect, or if they were unsure, or if was unclear (the suggestion wasn't clearly right or wrong).

Hunspell
The precision for Hunspell copy edits were judged less than 40% accurate across all wikis. Suggestions were accurate for 39% of English suggestions, 11% for Spanish, and 32% for Arabic, 16% for Czech, and 0% for Bengali.

LanguageTool
LanguageTool first evaluation (V1 sample): LanguageTool currently supports ~30 languages, so only two of the Growth team pilot languages are supported: Spanish and Arabic. LanguageTool's copy edits were judged at 50% accurate or higher across all three wikis. Suggestions were accurate for 51% of English suggestions, 50% for Spanish, and 57% for Arabic. LanguageTool second evaluation (V2 sample): We completed a second evaluation of LanguageTool as a way to surface copy edits in Wikipedia articles. We evaluated suggested errors in Arabic, English, and Spanish. In the previous evaluation we determined that certain rules often resulted in incorrect suggestions, so we added functionality to filter certain rules. You can see that we ended with results with a higher level of accuracy than in the V1 sample.

Common Misspellings
For this evaluation we simply used a list of common misspellings curated by Growth pilot Ambassadors, and then checked for those misspellings in Wikipedia articles. Results looked promising, but we ended up with a fairly small sample in some languages. This might be a solution to help provide coverage to languages that aren't supported by LanguageTool, however, if we pursue this option further we will test again with a longer list of misspellings to see if we can get  a more representative & significant results (and a better sense of what level of coverage this solution would provide).

Next Steps
Consider how to better handle highly inflected and agglutinated languages, which likely won't benefit much from standard spell-checking approaches.

Further improving LanguageTool filters to decrease the number of false positives and thus further improve accuracy.

For languages not supported by an open source copy editing tool, we will consider a rule-based approach, i.e. only looking for very specific errors which could be based on a list of common misspellings. We will set up an additional test to estimate the accuracy and coverage of this type of approach.