Growth/Personalized first day/構造化タスク/校正

グループ:	Growth
開始:	2021-07-19
チームのメンバー:	Martin Gerlach (研究者), Gergő Tisza (ソフトウェア技術者), Benoît Evellin (コミュニティ関与専門職), Elena Tonkovidova (QA 技術者), Kosta Harlan (ソフトウェア技術者), Morten Warncke-Wang (データ分析者), Rita Ho (上席ユーザー体験設計者), Max Binder (コーチ), Mew Ophaswongse (ソフトウェア技術者)
リーダー:	Kosta Harlan
管理者:	Marcella Florence (技術), Marshall Miller (製品)
更新:	更新の概要こちらに掲出。

This page is a translated version of the page Growth/Personalized first day/Structured tasks/Copyedit and the translation is 87% complete.

このページでは「校正」構造化タスクについて説明します。「校正」構造化タスクとは、Growth チームが新規参加者ホームページに提供する予定の構造化タスクのひとつです。このページでは主要なアセット、設計、未決の課題、意思決定について述べます。進行中の増分更新のほとんどは全般的なGrowthチームの更新ページに投稿されます。このページにはいくつかの大規模または詳細な更新を掲載します。

現状

2021-07-19: プロジェクトページを開設して背景調査を開始。
2022-08-12: 初期調査結果を追加。
Next: 手動の評価を完了。

要約

構造化タスクの目的は、編集作業（タスク）を細かな段階に分解し、それぞれのワークフロー単位を新規参加者にわかりやすく、モバイル環境に適した形にまとめることです。 Growth チームではこれらの新しいタイプの編集ワークフローを導入するとウィキペディアに新規に参加しようとする人がもっと増えて、より複雑な編集を覚えるきっかけ作りとコミュニティへの参加の糸口になると期待しています。構造化タスクのアイディアをコミュニティと協議したのち、我々は最初に構築する構造化タスクを「リンクの追加」に決定しました。

その1番目のタスクを作成する段階さえ、次にどんな構造化タスクを設けるか考えていました。新規参加者には複数の種類のタスクから選べるようにして、それぞれがおもしろそうだと感じるタスクを見つけること、またどんどん難易度の高いものに挑戦できるようにしたいと考えました。 2番目に取り組み始めたタスクは、「画像の追加」でした。しかしながら、構造化タスクに関するコミュニティとの初期の協議では、コミュニティがもっとも 要望するタスクとは校正 -- スペルや用字、文法や句読点、文の口調などでした。この件を検討した当初、コミュニティの皆さんとの協議をこちらの初期のメモにまとめてあります。

これがどのように機能するのかに関してまだ未対応の質問や、これがうまくいかないかもしれないと思う理由がたくさんあることを承知しています。では、ここで言う校正とは、具体的にどんなものでしょうか？スペルだけか、それ以上か？全ての言語にわたってうまく作動する何らかのアルゴリズムはあるのか？これらの質問があるからこそ、広くコミュニティの皆さんから意見をお聞きして、プロセスの決定段階と並行して協議を続けたいと考えます。

背景の調査

調査計画

目標

アルゴリズムで補助できる可能性のある校正タスクの種類を把握したいと考えています。
さまざまな言語にわたる記事である種の校正についてのタスクを提案することができるアルゴリズムを使いたいと考えています。
アルゴリズムがどのくらいうまく機能するのか（例えば、既存のモデル群の中で最良のモデルはどれか）知りたいと考えています。

文献調査

校正と見なされるサブタスクは何か？
校正の様々な側面を特定する：誤字脱字、スペル、文法、記述法、語調
ウィキペディアには現状、どのような校正へのアプローチが存在するか？
- Guild of Copy Editors（校正者ギルド）あるいは Typo Team（誤字対策チーム）のようなコミュニティ。
- copyedit-template（校正テンプレート）のような管理用テンプレート
- 誤字検出に使う moss-tool などのツール（アラビア語版ウィキペディアには JarBot もある）
スペルや文法のチェックによく使われている公開のツールはどんなものか、たとえばhunspell、LanguageTool あるいはGrammarlyなど？
- 明白なアルゴリズム、つまり、提案がどこから来たのか誰にでも簡単にわかることがコミュニティの理想であると理解しています。
- NLP と ML の調査からどんなモデルが既にあるか、文法の誤りの修正 Grammatical Error Correction と呼ばれるタスクを例にします。

タスクの定義

校正のどのような側面を構造化タスクでモデル化するか？
タスクの種類：スペル、文法、語調や文体
- 例：ブラウザのスペルチェッカーができることは何？
粒度 -- タスクをどの段階に落とし込むか：記事、節、段落、文、単語、サブワード
- タスクに依存
（例えば、テンプレートから）既知の項目を表面化させるのかまたは新しいものを予測するのか？
改善が必要であると提案するだけか、改善方法まで提案するか？
- タスクが単純なほど、改善を提案するのも簡単です。
- より複雑なタスクに対しては、単に作業が必要だと強調する方が簡単です（例えば、文体や語調）
言語のサポート：サポート対象の言語の件数は？
- 対象はアラビア語、ベトナム語、ベンガリー語、チェコ語、スペイン語とポルトガル語を加えます。
- 理想的にはあらゆる言語を対象にしたいと考えていますが、現実的には言語の普及の深度に基づいて解決策を評価をする必要があります。

評価用のデータセットを構築

特定のタスクについてテスト用データセットの構築（できるだけ複数言語で実施）により、異なるアルゴリズムの比較対照に使えるようにします。これは様々な方法で達成することができます
- 既存のベンチマーク用のデータセット、たとえばCoNLL-2014 文法エラー修正の共同タスクまたはコーパス集作成という取り組み（対象はウィキペディア）
- テンプレート（校正）または編集要約（誤字）を使って、変更履歴から独自のデータセットを生成
- ウィキペディアから抽出した文章を使って実行したモデルの出力を手動で評価

調査結果

調査の完全な要約はメタウィキで入手可能です： Research:Copyediting as a structured task。

文献調査

背景調査と文献の検証の全文をお読みください。

主な所見：

LanguageToolやEnchantなど簡単な綴りと文法の校正者とは、公開されていて無料の多言語で、文字編集の支援に最適です。
Some adaptation to the context of Wikipedia and structured task will be required in order to decrease the sensitivity of the models; common approaches are to ignore everything in quotes or text that is linked.
The challenge will be to develop a ground-truth dataset for backtesting. Likely, some manual evaluation will be needed.
Long-term: Develop a model to highlight sentences that require editing (without necessarily suggesting a correction) based on copyediting templates. This could provide a set of more challenging copyediting tasks compared to spellchecking.

言語ツール

We have identified LanguageTool as a candidate to surface possible copyedits in articles because:

It is open, is being actively developed, and supports 30+ languages
The rule-based approach has the advantage that errors come with an explanation why they were highlighted and not just due to a high score from a ML-model. In addition, it provides functionality for adding custom rules by the community https://community.languagetool.org/
The copyedits from LanguageTool go beyond spellchecking of single words using a dictionary but also capture grammatical errors and style.

We can get a very rough approximation of how well LanguageTool works for detecting copyedits in Wikipedia articles by comparing the amount of errors in featured articles with those in articles containing a copyedit-template. We find that the performance is reasonable in many languages after applying a post-processing step in which we filter some of the errors from LanguageTool (e.g. those overlapping with links or bold text).

We also compared the performance of simple spellcheckers which are available for more languages than supported by LanguageTool. They can also surface many meaningful errors for copyediting but suffer from a much higher rate of false positives. This can be partially addressed by post-processing steps to filter the errors. Another disadvantage is that spellcheckers perform considerably worse than LanguageTool in suggesting the correct improvement for the error.

信頼スコアを割り当てるモデル開発にLanguageTool/spellcheckerの浮上させたエラーを使えば、実質的な改善の1つになり得ます。これら構造化した校正作業のエラーを優先的に処理できる、対象は真の校正作業という確信があります。T299245に当初の予測をまとめました。

さらに詳細はこちらをお読みください：Research:構造化タスクとしての校正/言語ツール

評価

We have completed an initial evaluation of sample copy edits utilizing LanguageTool and Hunspell. To compare how each tool worked for Wikipedia articles, our research team created a list of sample copy edits for 5 languages: Arabic, Bengali, Czech, Spanish (Growth pilot wikis) and English (as a test-case for debugging).

方法論

Started with a subset of the 10,000 first articles from the HTML dumps using the 20220801-snapshot of the respective wiki (arwiki, bnwiki, cswiki, eswiki, and enwiki).
Extracted the plain text from the HTML-version of the article (trying to remove any tables, images, etc).
Ran LanguageTool and the Hunspell-spellchecker on the plain text.
Applied a series of filters to decrease the number of false positives (further details available in this Phabricator task).
Selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article, only one error was selected randomly; thus for each language we had 100 errors from 100 different articles.
Growth Ambassadors evaluated the samples in their first language, and decided if the suggested edit was accurate, incorrect, or if they were unsure, or if was unclear (the suggestion wasn't clearly right or wrong).

結果

Hunspell

The precision for Hunspell copy edits were judged less than 40% accurate across all wikis. Suggestions were accurate for 39% of English suggestions, 11% for Spanish, and 32% for Arabic, 16% for Czech, and 0% for Bengali.

LanguageTool

LanguageTool first evaluation (V1 sample): LanguageTool currently supports ~30 languages, so only two of the Growth team pilot languages are supported: Spanish and Arabic. LanguageTool's copy edits were judged at 50% accurate or higher across all three wikis. Suggestions were accurate for 51% of English suggestions, 50% for Spanish, and 57% for Arabic.

LanguageTool second evaluation (V2 sample): We completed a second evaluation of LanguageTool as a way to surface copy edits in Wikipedia articles. We evaluated suggested errors in Arabic, English, and Spanish. In the previous evaluation we determined that certain rules often resulted in incorrect suggestions, so we added functionality to filter certain rules. You can see that we ended with results with a higher level of accuracy than in the V1 sample.

Common Misspellings

For this evaluation we simply used a list of common misspellings curated by Growth pilot Ambassadors, and then checked for those misspellings in Wikipedia articles. Results looked promising, but we ended up with a fairly small sample in some languages. This might be a solution to help provide coverage to languages that aren't supported by LanguageTool, however, if we pursue this option further we will test again with a longer list of misspellings to see if we can get a more representative & significant results (and a better sense of what level of coverage this solution would provide).

次の段階

Consider how to better handle highly inflected and agglutinated languages, which likely won't benefit much from standard spell-checking approaches.

Further improving LanguageTool filters to decrease the number of false positives and thus further improve accuracy.

For languages not supported by an open source copy editing tool, we will consider a rule-based approach, i.e. only looking for very specific errors which could be based on a list of common misspellings. We will set up an additional test to estimate the accuracy and coverage of this type of approach.