Growth/Personalized first day/Structured tasks/Add an image/ja

このページでは「画像の追加」構造化タスクについての取り組みを解説します. 「画像の追加」構造化タスクとは、Growth チームが新規参加者ホームページを通じて提供する予定の構造化タスク のひとつです.

このページでは主要なアセット、設計、未決の課題、意思決定について述べます.

進行中の増分更新のほとんどは全般的なGrowthチームの更新ページに投稿されます. このページにはいくつかの大規模または詳細な更新を掲載します.



現状

 * 2020-06-22: 画像をお勧めするための単純なアルゴリズム作成のアイデアに関して最初の検討
 * 2020-09-08: 対応アルゴリズムの初回試作の評価、対象は英語版、フランス語版、アラビア語版、朝鮮語版、チェコ語版、ベトナム語版
 * 2020-09-30: 対応アルゴリズムの第2回試作の評価、対象は英語版、フランス語版、アラビア語版、朝鮮語版、チェコ語版、ベトナム語版
 * 2020-10-26: 実行性のありそうな画像推薦サービスについて、技術面に関する内部の協議
 * 2020-12-15: 利用者テストの初回を実施して、新規参加者がこのタスクをうまく習得するかどうか、把握を開始
 * 2021-01-20: プラットフォーム技術班が画像推薦に関して、 コンセプト実証用の API 開発を開始
 * 2021-01-21: Android チームは調査用に使える最小限のバージョン開発に着手
 * 2021-01-28: 利用者テストの結果を公表
 * 2021-02-04: コミュニティの協議内容のまとめと、適用の統計を公表
 * 2021-05-07: Android MVP を利用者に公開
 * 2021-08-06: Android 版の公表結果と実用版模型（イテレーション1）
 * 2021-08-17: イテレーション1でバックエンドの作業を開始
 * 2021-08-23: 英語版とスペイン語版で、インタラクティクブな試験版を投入し、利用者テストを開始
 * 2021-10-07: 利用者テストで発見したこと、それに立脚した設計の最終案を掲出
 * 2021-11-19: 製品版のウィキペディアで大使がテストを開始
 * 2021-11-22: イテレーション1を利用者に公開する前に画像おすすめデータセットをリフレッシュ
 * 2021-11-29: イテレーション1をアラビア語版、チェコ語版、およびベンガル語版ウィキペディアでモバイルアカウントの40％に展開.
 * 2021-12-22: 目を引く指標を提示
 * 2022-01-28: デスクトップ版の展開はアラビア語版、チェコ語版、ベンガル語版の新規登録アカウントの40%に展開.
 * 2022-02-16: スペイン語版ウィキペディアの新規参加者に「画像を追加」タスクの提供を開始
 * 2022-03-22: ウィキペディアのポルトガル語版、ペルシャ語版、フランス語版、トルコ語版の新規参加者に「画像を追加」機能の提供を開始
 * 次: 次のグループのウィキ群に展開し、コンバージョンファネルを詳細に分析する（ファネル分析）

概要
構造化タスクの目的は、編集作業のワークフローを段階ごとに割り、新規参加者に理解しやすく、モバイル機器で利用できる形にすることです. Growth チームの予測では、これら新式の編集ワークフロー導入はウィキペディアの新規参加者にも編集を始めやすくすること、さらにその人々の中から、より踏み込んだ編集を学んでコミュニティへも参加する人が出てくるだろうと見込まれます. 構造化タスクの案をコミュニティと協議した結果、最初の構造化タスク群の開発を決めました. 「リンクの追加」です.

2021年5月に「リンクを追加」を展開後、初期データを採ったところ、新規参加者にとってタスクはやり甲斐があり、編集は差し戻される率は低いとわかり -- 構造化タスクは新規参加者の体験に、またウィキにとって価値があると示唆されました.

その最初のタスクを構築しながらも、次の構造化タスクは何がよいか、新規参加者に適したものとして画像の追加はどうかと考えていました. 単純なアルゴリズムで画像のない記事に置くべき画像をコモンズからお勧めしてはどうだろうかという発案です. 手始めに、ウィキデータで見つけることができる既存の接続のみを利用し、記事にその画像を使うかどうかは新規参加者が自分で判断するようにしてみます.

この課題が有効かどうか、まだ未決の課題が多いこと、成功しない可能性として想定される理由の多さを把握しています. だからこそ、より多くのコミュニティの皆さんからご意見を聞き、また方向性を決めながら並行して協議を進められないかと考えます.



関連するプロジェクト群
Android チームでも同じ構成要素を応用し、Android 版ウィキペディア アプリへの導入を類似のタスクを最小限のバージョンで検討しています. これに加え、構造化データチームはやはり類似の案件について、経験値の高い利用者のみ対象に早期の検討段階にあり、コモンズの構造化データの応用を想定しています. このページに示す協議と進捗状況は、すべてのチームに関連します.



なぜ画像か？
実質的な貢献を求める

コミュニティのメンバーと最初に構造化タスクについて議論したとき、多くの人がウィキリンクの追加は特に価値の高い編集の種類ではないと指摘しました. コミュニティのメンバーは新規参加者がもっと実質的な貢献をすることができる方法についてのアイデアも持ち出しました. ひとつのアイデアが画像です. ウィキメディア・コモンズには6500万の画像がありますが、多くのウィキペディアでは50%以上の記事に画像がありません. コモンズからの多くの画像がウィキペディアを実質的にもっと図解することができるに違いないと考えています.

新規参加者の関心度

多くの新規参加者が画像の追加に興味を持っていることを知っています. 「画像の追加」は歓迎アンケートでアカウントを作成した理由に対する新規参加者のよくある回答です. ヘルプパネルの質問でも画像の追加方法に関するものは最もよくある質問のひとつで、我々が扱うすべてのウィキにわたってそうでした. これらの新規参加者のほとんどはおそらく追加したい独自の画像を持ってくるのでしょうが、これは画像がいかに魅力的でエキサイティングたりえるか示唆しています. 新規参加者が参加している他のプラットフォーム（InstagramやFacebookのようなもの）が画像に重きを置いていることを考えれば道理です.

画像を扱う難しさ

画像に関する多くのヘルプパネルの質問は、記事に画像を追加する工程が難しすぎることを反映しています. 新規参加者はウィキペディアとコモンズの違い、著作権周辺のルール、および適正な場所に画像を挿入しキャプションを付けるための技術的な部分を理解する必要があります. 図解されていない記事のためにコモンズで画像を見つけるには、ウィキデータやカテゴリの知識のような、さらなるスキルが要求されます.

キャンペーンの「写真が足りないウィキペディアのページ」成功事例

略称WPWP※1、ウィキペディアの画像がないページキャンペーンは予想外の成功裡に終わりました. 利用者600名が画像を記事8万5000件に追加したのです. これを達成するため利用した複数のコミュニティ・ツールは、画像がないページを抽出し、ウィキデータを介して適合しそうな画像をおすすめします. （訳注※1：WPWP＝Wikipedia Pages Wanting Photos campaign. ） While there are important lessons to be learned about how to help newcomers succeed with adding images, this gives us confidence that users can be enthusiastic about adding images and that they can be assisted by tools.

これを総合すると

この情報を総合して考えると、新規参加者にとって楽しく、なおかつウィキペディアにとって生産的である「画像の追加」構造化タスクを構築することが可能であると考えています.

アイデアの検証
''2020年6月から2021年7月にかけて、Growthチームは「画像の追加」タスク周辺のコミュニティ議論、背景調査、評価、および概念実証に取り組みました. これにより2021年8月に最初のイテレーションを構築するという決定に至りました（イテレーション1参照）. この節にはイテレーション1に至るまでの背景作業のすべてが含まれています. ''

アルゴリズム
画像の追加について構造化タスクを作ることができるかどうかは、十分に良好なお勧めを生成することができるアルゴリズムを作成することができるかどうかに依存します. 間違った画像を追加するように新規参加者を駆り立てて、巡回者に後始末をさせる結果になることは断じて望んでいません. したがって、良好なアルゴリズムを作ることができるかどうか見てみることが我々が最初に取り組んだことのひとつです.

ロジック
我々はWikimedia Research teamと協力して、これまで正確さと人間の判断を優先するアルゴリズムを検証してきました. いかなるコンピュータの視点も予期せぬ結果を生成してしまう可能性があるので、それよりもむしろ単純に経験を積んだ投稿者によって行われた接続に頼っているウィキデータにある既存の情報を統合します. 以下は図解されていない記事に合致する提案をする3つの主要な方法です：


 * その記事に対するウィキデータの項目を見ます. もし画像(P18)があれば、その画像を選びます.
 * その記事に対するウィキデータの項目を見ます. もし関連付けられたコモンズのカテゴリ(P373)があれば、そのカテゴリから画像を選びます.
 * 他言語版ウィキペディアで同じトピックに関する記事を見ます. それらの記事から導入の画像を選びます.

アルゴリズムには、アイコンやナビボックスの一部として記事に存在していそうな画像を除外するといったようなことをするロジックも含まれています.

正確さ
2021年8月現在、我々はアルゴリズムの検証を既に3回行っており、各回で6言語（英語、フランス語、アラビア語、ベトナム語、チェコ語、および韓国語）の記事の合致を見ました. 評価は我々のチームの大使およびその言語を母国語として話すその他の熟練ウィキメディアンによって行われました.

最初の2回の評価

各言語で提案された50個の合致を見て、それらを以下のグループに分類しました：

アルゴリズムについての作業を通じてこのような疑問が浮かんできます：どのくらいの正確さが必要か？75%の合致が良好ならば十分か？90%の正確さが必要か？あるいは正確さが50%まで低下してもいいのか？これはそれを利用する新規参加者の判断がどのくらい良好か、およびどのくらい弱い合致を耐えることができるのかに依存します. これに関してはアルゴリズムを実際の新規参加者でユーザーテストするときにもっと知ることができるでしょう.

最初の評価で、最も重要なことは、除外すべき記事や画像の種類を含む、簡単にできるアルゴリズムの改善点がたくさん見つかったことです. それらの改善なしでも、約20-40%の合致が「2」、つまり記事にとても良く合致しました（ウィキによって異なります）. 最初の評価からの全結果と注記はこちらをご覧ください.

2回目の評価には、多くの改善が組み込まれ、正確さが向上しました. 50-70%の間の合致が「2」でした（ウィキによって異なります）. しかし正確さの向上は網羅率、つまり合致させることができる記事の数を減少させる可能性があります. 保守的な基準を使うと、何十万あるいは何百万という記事があるウィキでさえ、アルゴリズムは数万の合致しか提案できないかもしれません. そのような量は、この機能の初期バージョンを構築するには十分であると確信しています. 2回目の評価からの全結果と注記はこちらをご覧ください.

3回目の評価

2021年5月、構造化データチームは画像マッチングアルゴリズム（およびメディア検索アルゴリズム）のさらに大規模な検証をアラビア語版、セブアノ語版、英語版、ベトナム語版、ベンガル語版、およびチェコ語版ウィキペディアで実施しました. この検証では、画像マッチングアルゴリズムとメディア検索アルゴリズムの両方からの約500個の合致が各言語の熟練者によって評価され、合致を「良」、「可」、あるいは「不可」として分類できるようにしました. 以下で詳述する結果によって、これらのことが示されました：


 * 画像マッチングアルゴリズムの正確さは「良」を集計するか「良+可」を集計するかによって、そしてウィキ/評価者によって65-80%の範囲にあります. 興味深いことに、画像の合致を評価する経験では、画像が記事に相応しいかどうか皆が独自の基準を持っているので、しばしば熟練ウィキメディアンの意見が互いに異なることがあります.
 * ウィキデータのP18（「ウィキデータ」）は合致の最も強力なソースで、85%-95%の範囲の正確さです. 他のウィキペディアからの画像（「ウィキ横断」）とウィキデータの項目に付されたコモンズのカテゴリからの画像（「コモンズのカテゴリ」）は同程度のより低い正確さです.
 * 他のウィキペディアからの画像（「ウィキ横断」）は最もありふれた合致のソースです. 言い換えれば、他の2つのソースよりも多くのものをアルゴリズムが利用可能です.

結果の全データセットはこちらで見てください.

網羅率
アルゴリズムの正確さは明らかに非常に重要な要素です. 同様に重要なのはその「網羅率」です -- これはどれだけ多くの画像を合致させることができるかということです. 正確さと網羅率は相反する傾向にあります：アルゴリズムが正確であればあるほど、提案の数は少なくなります（確信があるときだけ提案するからです）. 我々はこれらの質問に答える必要があります：そのアルゴリズムは機能を構築する価値があるくらい十分な合致を提供することができますか？ウィキに実質的な影響を及ぼすことができますか？その答えを感じ取るために22のウィキペディアを見ました. 表がこれらの要点の下にあります：


 * 表に反映されている網羅率の数字は、「画像の追加」機能の最初のバージョンとしては十分そうです. 十分な合致の候補が各ウィキにあり、(a)利用者が使い果たすことなく、(b)機能がウィキの図解のあり方に実質的な影響を及ぼすことができます.
 * ウィキは20%が図解されていないもの （セルビア語版）から69%が図解されていないもの（ベトナム語版）まで範囲があります.
 * 7,000個（ベンガル語版）から155,000 個（英語版）の合致する候補がある図解されていない記事を見つけることができます. 一般的には、これはタスクの最初のバージョンとしては 十分な量であり、利用者が豊富な合致に取り組むことができます. ベンガル語版のような、一部のまばらなウィキでは、利用者が興味のあるトピックを絞ると少数になるかもしれません. とはいえ、ベンガル語版は全部で約100,000記事しかなく、その7%を提案していますので、相当なものです.
 * このアルゴリズムでウィキにどれだけ大きな図解の改善をすることができるかという観点では、上限は1% (cebwiki)から9% (trwiki)までの範囲です. これはすべての合致が良好でウィキに追加されるとした場合に図解されることになる追加の記事の全体的な割合です.
 * The wikis with the lowest percentage of unillustrated articles for which we can find matches are arzwiki and cebwiki, which both have a high volume of bot-created articles. This makes sense because many of those articles are of specific towns or species that wouldn't have images in Commons. But because those wikis have so many articles, there are still tens of thousands for which the algorithm has matches.
 * In the farther future, we hope that improvements to the image matching algorithm, or to MediaSearch, or to workflows for uploading/captioning/tagging images yield more candidate matches.

MediaSearch
As mentioned above, the Structured Data team is exploring using the MediaSearch algorithm to increase coverage and yield more candidate matches.

MediaSearch works by combining traditional text-based search and structured data to provide relevant results for searches in a language-agnostic way. By using the Wikidata statements added to images as part of Structured Data on Commons as a search ranking input, MediaSearch is able to take advantage of aliases, related concepts, and labels in multiple languages to increase the relevance of image matches. You can find more information about how MediaSearch works here.

As of February 2021, team is currently experimenting with how to provide a confidence score for MediaSearch matches that the image recommendations algorithm can consume and use to determine whether a match from MediaSearch is of sufficient quality to use in image matching tasks. We want to be sure that users are confident in the recommendations that MediaSearch provides before incorporating them into the feature.

The Structured Data team is also exploring and prototyping a way for user generated bots to use the results generated by both the image recommendations algorithm and MediaSearch to automatically add images to articles. This will be an experiment in bot-heavy wikis, in partnership with community bot writers. You can learn more about that effort or express interest in participating in the phabricator task.

In May 2021, in the same evaluation cited in the "Accuracy" section above, MediaSearch was found to be far less accurate than the image matching algorithm. Where the image matching algorithm was about 78% accurate, matches from MediaSearch were about 38% accurate. Therefore, the Growth team is not planning to use MediaSearch in its first iteration of the "add an image" task.

Open questions
Images are such an important and visible part of the Wikipedia experience. It is critical that we think hard about how a feature enabling the easy adding of images would work, what the potential pitfalls might be, and what the implications would be for community members. To that end, we have many open questions, and we want to hear of more that community members can bring up.


 * Will our algorithm be sufficiently accurate such that plenty of good matches are provided?
 * What metadata from Commons and the unillustrated article do newcomers need in order to make a decision about whether to add the image?
 * Will newcomers have sufficiently good judgment when looking at recommendations?
 * Will newcomers who don't read English be equally able to make good decisions, given that much of Commons metadata is in English?
 * Will newcomers be able to write good captions to go along with images that they place in the articles?
 * How much should newcomers judge images based on their "quality" as opposed to their "relevance"?
 * Will newcomers think this task is interesting? Fun? Difficult? Easy? Boring?
 * How exactly should we determine which articles have no images?
 * Where in the unillustrated article should the image be placed? Is it sufficient to put it at the top of the article?
 * How can we be mindful of potential bias in the recommendations, i.e. perhaps the algorithm will make many more matches for topics in Europe and North America.
 * Will such a workflow be a vector for vandalism? How can this be prevented?

Notes from community discussions 2021-02-04
Starting in December 2020, we invited community members to talk about the "add an image" idea in five languages (English, Bengali, Arabic, Vietnamese, Czech). The English discussion mostly took place on the discussion page here, with local language conversations on the other four Wikipedias. We heard from 28 community members, and this section summarizes some of the most common and interesting thoughts. These discussions are heavily influencing our next set of designs.


 *  Overall : community members are generally cautiously optimistic about this idea. In other words, people seem to agree that it would be valuable to use algorithms to add images to Wikipedia, but that there are many potential pitfalls and ways this can go wrong, especially with newcomers.
 * アルゴリズム
 * Community members seemed to have confidence in the algorithm because it is only drawing on associations coded into Wikidata by experienced users, rather than some sort of unpredictable artificial intelligence.
 * Of the three sources for the algorithm (Wikidata P18, interwiki links, and Commons categories), people agreed that Commons categories are the weakest (and that Wikidata is the strongest). This has borne out in our testing, and we may exclude Commons categories from future iterations.
 * We got good advice on excluding certain kinds of pages from the feature: disambiguations, lists, years, good, and featured articles.. We may also want to exclude biographies of living persons.
 * We should also exclude images that have a deletion template on Commons and that have been previously removed from the Wikipedia page.
 * 新規参加者の判断
 * Community members were generally concerned that newcomers would apply poor judgment and give the algorithm the benefit of the doubt. We know from our user tests that newcomers are capable of using good judgment, and we believe that the right design will encourage it.
 * ウィキペディアの写真が必要なページのキャンペーン (WPWP) について協議したところ、新規参加者の多くは良い判断を示したものの、あまりに熱中した利用者は短時間に多くの不良な照合を行い、巡回者の作業を増やしてしまう可能性があることがわかりました. 画像を性急に追加したり、何度も差し戻されているのに画像の追加を続けないよう、なにがしかの評価基準の設定を検討するかもしれません.
 * Most community members affirmed that "relevance" is more important than "quality" when it comes to whether an image belongs. In other words, if the only photo of a person is blurry, that is usually still better than having no image at all.  Newcomers need to be taught this norm as they do the task.
 * Our interface should convey that users should move slowly and take care, as opposed to trying to get as many matches done as they can.
 * We should teach users that images should be educational, not merely decorative.
 * ユーザー インターフェイス
 * Several people proposed that we show users several image candidates to choose from, instead of just one. This would make it more likely that good images are attached to articles.
 * Many community members recommended that we allow newcomers to choose topic areas of interest (especially geographies) for articles to work with. If newcomers choose areas where they have some knowledge, they may be able to make stronger choices.  Fortunately, this would automatically be part of any feature the Growth team builds, as we already allow users to choose between 64 topic areas when choosing suggested edit tasks.
 * Community members recommend that newcomers should see as much of the article context as possible, instead of just a preview. This will help them understand the gravity of the task and have plenty of information to use in making their judgments.
 * 記事における配置
 * We learned about Wikidata infoboxes. We learned that for wikis that use them, the preference is for images to be added to Wikidata, instead of to the article, so that they can show up via the Wikidata infobox.  In this vein, we will be researching how common these infoboxes are on various wikis.
 * In general, it sounds like a rule of "place an image under the templates and above the content" in an article will work most of the time.
 * Some community members advised us that even if placement in an article isn't perfect, other users will happily correct the placement, since the hard work of finding the right image will already be done.
 *  Non-English users 
 * Community members reminded us that some Commons metadata elements can be language agnostic, like captions and depicts statements. We looked at exactly how common that was in this section.
 * We heard the suggestion that even if users aren't fluent with English, they may still be able to use the metadata if they can read Latin characters. This is because to make many of the matches, the user is essentially just looking for the title of the article somewhere in the image metadata.
 * Someone also proposed the idea of using machine translation (e.g. Google Translate) to translate metadata to the local language for the purposes of this feature.
 *  Captions 
 * Community members (and Growth team members) are skeptical about the ability of newcomers to write appropriate captions.
 * We received advice to show users example captions, and guidelines tailored to the type of article being captioned.

Plan for user testing


Thinking about the open questions above, in addition to community input, we want to generate some quantitative and qualitative information to help us evaluate the feasibility of building an "add an image" feature. Though we have been evaluating the algorithm amongst staff and Wikimedians, it is important to see how newcomers react to it, and to see how they use their judgment when deciding on whether an image belongs in an article.

To that end, we are going to run tests with usertesting.com, in which people new to Wikipedia editing can go through potential image matches in a prototype and respond "Yes", "No", or "Unsure". We built a quick prototype for the test, backed with real matches from the current algorithm. The prototype just shows one match after another, all in a feed. The images are shown along with all the relevant metadata from Commons:


 * ファイル名
 * サイズ
 * 日付
 * 利用者
 * 説明
 * キャプション
 * カテゴリ
 * タグ

Though this may not be what the workflow would be like for real users in the future, the prototype was made so that testers could go through lots of potential matches quickly, generating lots of information.

'''双方向の試作版を試すには、 こちらのリンクをご利用ください. '''なお留意点として試作版はアルゴリズムの対応関係を示すためのものであり -- 実際の利用者体験はまだ具体的に検討していません. 実際の編集には使えません. アルゴリズムの提案する実存の対応 60 件が含まれます.

テストの主眼は以下の通りです.


 * 1) 提示された内容とデータにもとづき、参加者は着実に対応を確認できるか？
 * 2) 参加者が提示内容を正確に評価する率とは？ 実際の成績と比べると、本人評価は高いか低いか？
 * 3) この方法で記事に画像を追加するタスクは、参加者にどう評価されるか？ 難易度は高いか低いか、面白いか退屈か、やりがいがあるか的外れか？
 * 4) 画像と記事の対応を評価するとき、参加者が最も役に立つと感じるのはどんな情報か？
 * 5) 提示されたデータを基づき、参加者は画像に適切なキャプションを書けるか？

Concept A vs. B
In thinking about design for this task, we have a similar question as we faced for "add a link" with respect to Concept A and Concept B. In Concept A, users would complete the edit at the article, while in Concept B, they would do many edits in a row all from a feed. Concept A gives the user more context for the article and editing, while Concept B prioritizes efficiency.

In the interactive prototype above, we used Concept B, in which the users proceed through a feed of suggestions. We did that because in our user tests we wanted to see many examples of users interacting with suggestions. That's the sort of design that might work best for a platform like the Wikipedia Android app. For the Growth team's context, we're thinking more along the lines of Concept A, in which the user does the edit at the article. That's the direction we chose for "add a link", and we think that it could be appropriate for "add an image" for the same reasons.

Single vs. Multiple
Another important design question is whether to show the user a single proposed image match, or give them multiple images matches to choose from. When giving multiple matches, there's a greater chance that one of the matches is a good one. But it also may make users think they should choose one of them, even if none of them are good. It will also be a more complicated experience to design and build, especially for mobile devices. We have mocked up three potential workflows:


 *  Single : in this design, the user is given only one proposed image match for the article, and they only have to accept or reject it. It is simple for the user.
 *  Multiple : this design shows the user multiple potential matches, and they could compare them and choose the best one, or reject all of them. A concern would be if the user feels like they should add the best one to the article, even if it doesn't really belong.
 *  Serial : this design offers multiple image matches, but the user looks at them one at a time, records a judgment, and then chooses a best one at the end if they indicated that more than one might match. This might help the user focus on one image at a time, but adds an extra step at the end.



User tests December 2020
 Background 

During December 2020, we used usertesting.com to conduct 15 tests of the mobile interactive prototype. The prototype contained only a rudimentary design, little context or onboarding, and was tested only in English with users who had little or no previous Wikipedia editing experience. We deliberately tested a rudimentary design earlier in the process so that we could gather lots of learnings. The primary questions we wanted to address with this test were around feasibility of the feature as a whole, not around the finer points of design:


 * 1) Are participants able to confidently confirm matches based on the suggestions and data provided?
 * 2) How accurate are participants at evaluating suggestions? And how does the actual aptitude compare to their perceived ability in evaluating suggestions?
 * 3) How do participants feel about the task of adding images to articles this way? Do they find it easy/hard, interesting/boring, rewarding/irrelevant?
 * 4) What metadata do participants find most valuable in helping them evaluate image and article matches?
 * 5) Are participants able to write good captions for images they deem a match using the data provided?

In the test, we asked participants to annotate at least 20 article-image matches while talking out loud. When they tapped yes, the prototype asked them to write a caption to go along with the image in the article. Overall, we gathered 399 annotations.

 Summary 

We think that these user tests confirm that we could successfully build an "add an image" feature, but it will only work if we design it right. Many of the testers understood the task well, took it seriously, and made good decisions -- this gives us confidence that this is an idea worth pursuing. On the other hand, many other users were confused about the point of the task, did not evaluate as critically, and made weak decisions -- but for those confused users, it was easy for us to see ways to improve the design to give them the appropriate context and convey the seriousness of the task.

 Observations 

'' To see the full set of findings, feel free to browse the slides. The most important points are written below the slides. ''




 * General understanding of the task matching images to Wikipedia articles was reasonably good, given the minimal context provided for the tool and limited knowledge of Commons and Wikipedia editing. There are opportunities to boost understanding once the tool is redesigned in a Wikipedia UX.
 * The general pattern we noticed was: a user would look at an article's title and first couple sentences, then look at the image to see if it could plausibly match (e.g. this is an article about a church and this is an image of a church). Then they would look for the article's title somewhere in the image metadata, either in the filename, description, caption, or categories.  If they found it, they would confirm the match.
 * Each image matching task could be done quickly by someone unfamiliar with editing. On average, it took 34 seconds to review an image.
 * All said they would be interested in doing such a task, with a majority rating it as easy or very easy.
 * Perceived quality of the images and suggestions was mixed. Many participants focused on the image composition and other aesthetic factors, which affected their perception of the suggestion accuracy.
 * Only a few pieces of image metadata from Commons were critical for image matching: filename, description, caption, categories.
 * Many participants would, at times, incorrectly try to match an images to its own data, rather than to the article (e.g. "Does this filename seem right for the image?"). Layout and visual hierarchy changes to better focus on the article context for the image suggested should be explored.
 * “Streaks” of good matches made some participants more complacent with accepting more images -- if many in a row were "Yes", they stopped evaluating as critically.
 * Users did a poor job of adding captions. They frequently would write their explanation for why they matched the image, e.g. "This is a high quality photo of the guy in the article." This is something we believe can be improved with design and explanation for the user.

 Metrics 


 * Members of our team annotated all the image matches that were shown to users in the test, and we recorded the answers the users gave. In this way, we developed some statistics on how good of a job the users did.
 * 利用者に表示された提案399件中、「採用」（Yes）の選択は192回（48%. ）
 * そのうち、33件は不良で、実際に記事に追加しても差し戻されると予測されます. 割合としては17%相当であり、これを「差し戻し候補率」と呼びます（likely revert rate）.

 Takeaways 


 * 17%という「差し戻し候補率」は貴重な数字であり、一方でこれを可能な限り低減したいと考えます. 他方、この数字は新規参加者がウィキペディアで見せる通常の編集行動の差し戻し率と近似もしくは「低い」のです（ウィキペディアの英語版36%、アラビア語版26%、フランス語版22%、ベトナム語版11%. ）視点を変えると、画像の影響度は高めで、記事内の文字の微細な変更や単語の書き換えに比べると目立ちやすくなります. （定質ではなく定量重視の）ワークフローをテストに使った点、これから加えていく変更点を考慮に入れるならば、この差し戻し率は目に見えて低減すると予測されます.
 * We think that this task would work much better in a workflow that takes the user to the full article, as opposed to quickly shows them one suggestion after another in the feed. By taking them to the full article, the user would see much more context to decide if the image matches and see where it would go in the article.  We think they would absorb the importance of the task: that they will actually be adding an image to a Wikipedia article.  Rather than going for speed, we think the user would be more careful when adding images.  This is the same decision we came to for "add a link" when we decided to build the "Concept A" workflow.
 * We also think outcomes will be improved with onboarding, explanation, and examples. This is especially true for captions.  We think if we show users some examples of good captions, they'll realize how to write them appropriately.  We could also prompt them to use the Commons description or caption as a starting point.
 * Our team has lately been discussing whether it would be better to adopt a "collaborative decision" framework, in which an image would not be added to an article until two users confirm it, rather than just one. This would increase the accuracy, but raises questions around whether such a workflow aligns with Wikipedia values, and which user gets credit for the edit.

メタデータ
The user tests showed us that image metadata from Commons (e.g. filename, description, caption, etc.) is critical for a user to confidently make a match. For instance, though the user can see that the article is about a church, and that the photo is of a church, the metadata allowed them to tell if it is the church discussed in the article. In the user tests, we saw that these items of metadata were most important: filename, description, caption, categories. Items that were not useful included size, upload date, and uploading username.

Given that metadata is a critical part of making a strong decision, we have been thinking about whether users will need to be have metadata in their own language in order to do this task, especially in light of the fact that the majority of Commons metadata is in English. For 22 wikis, we looked at the percentage of the image matches from the algorithm that have metadata elements in the local language. In other words, for the images that can be matched to unillustrated articles in Arabic Wikipedia, how many of them have Arabic descriptions, captions, and depicts? The table is below these summary points:


 * In general, local language metadata coverage is very low. English is the exception.
 * For all wikis except English, fewer than 7% of image matches have local language descriptions (English is at 52%).
 * For all wikis except English, fewer than 0.5% of image matches have local language captions (English is at 3.6%).
 * For depicts statements, the wikis range between 3% (Serbian) and 10% (Swedish) coverage for their image matches.
 * The low coverage of local language descriptions and captions means that in most wikis, there are very few images we could suggest to users with local language metadata. Some of the larger wikis have a few thousand candidates with local language descriptions.  But no non-English wikis have over 1,000 candidates with local language captions.
 * Though depicts coverage is higher, we expect that depicts statements don’t usually contain sufficient detail to positively make a match. For instance, a depicts statement applied to a photo of St. Paul’s Church in Chicago is much more likely to be “church”, than “St. Paul’s Church in Chicago”.
 * We may want to prioritize image suggestions with local language metadata in our user interfaces, but until other features are built to increase the coverage, relying on local languages is not a viable option for these features in non-English wikis.

Given that local-language metadata has low coverage, our current idea is to offer the image matching task to just those users who can read English, which we could ask the user as a quick question before beginning the task. This unfortunately limits how many users could participate. It's a similar situation to the Content Translation tool, in that users need to know the language of the source wiki and the destination wiki in order to move content from one wiki to another. We also believe there will be sufficient numbers of these users based on results from the Growth team's welcome survey, which asks newcomers which languages they know. Depending on the wiki, between 20% and 50% of newcomers select English.

アンドロイド版 MVP
''アンドロイド版MVPの詳細は、こちらのページをご参照ください. ''

経緯
After lots of community discussion, many internal discussions, and the user test results from above, we believe that this "add an image" idea has enough potential to continue to pursue. Community members have been generally positive, but also cautionary -- we also know that there are still many concerns and reasons the idea might not work as expected. The next step we want to in order to learn more is to build a "minimum viable product" (MVP) for the Wikipedia Android app. The most important thing about this MVP is that it will not save any edits to Wikipedia. Rather, it will only be used to gather data, improve our algorithm, and improve our design.

The Android app is where "suggested edits" originated, and that team has a framework to build new task types easily. These are the main pieces:


 * The app will have a new task type that users know is only for helping us improve our algorithms and designs.
 * It will show users image matches, and they will select "Yes", "No", or "Skip".
 * We'll record the data on their selections to improve the algorithm, determine how to improve the interface, and think about what might be appropriate for the Growth team to build for the web platform later on.
 * No edits will happen to Wikipedia, making this a very low-risk project.

成果
The Android team released the app in May 2021, and over several weeks, thousands of users evaluated tens of thousands of image matches from the image matching algorithm. The resulting data allowed the Growth team to decide to proceed with Iteration 1 of the "add an image" task. In looking at the data, we were trying to answer two important questions around "Engagement" and "Efficacy".

Engagement: do users of all languages like this task and want to do it?
 * On average, users in the Android MVP did about 11 annotations each. While this is less than image descriptions and description translations, it is greater than the other four kinds of Android tasks.
 * Image matching edits showed a substantially lower retention rate than other kinds of Android suggested edits, but there are concerns that it’s not possible to calculate an apples-to-apples comparison. Further, we think that the fact that the edits from this MVP do not actually change the wikis would lead to lower retention, because users would be less motivated to return and do more.
 * With respect to language, data was collected for users in English Wikipedia as well as from users who exclusively use non-English Wikipedia, including large numbers of evaluations from German, Turkish, French, Portuguese, and Spanish Wikipedias. We expected English and non-English users to have quite different experiences, because the majority of metadata on images in Commons is in English. But metrics were remarkably similar across the two groups, including number of tasks completed, time spent on task, retention, and judgment. This bodes well for this task being usable across wikis, although it's likely that many of the non-English Android users are actually bilingual.

Efficacy: will resulting edits be of sufficient quality?
 * 80% of the matches for which newcomers said "yes" are actually good matches according to experts. This is an improvement of about 5 percentage points over the algorithm alone.
 * This number goes up to 82-83% when we remove newcomers who have very low median time for evaluations.
 * Experts tend to agree with each other only about 85% of the time.
 * Because newcomer accuracy goes up when certain kinds of newcomers are removed (those who evaluate too quickly or who accept too many suggestions), we think that automated “quality gates” could boost newcomer performance to levels acceptable by communities.

See the full results are here.

技術面
This section contains links on how to follow along with technical aspects of this project:


 * Work on the "proof of concept" API by the Platform Engineering team, built to back the Android MVP
 * Phabricator tasks around the Android team's MVP
 * Phabricator tasks and evaluations of the image matching algorithm

イテレーション1
In July 2021, the Growth team decided to move forward with building a first iteration of an "add an image" task for the web. This was a difficult decision, because of the many open questions and risks around encouraging newcomers to add images to Wikipedia articles. But after going through a year of idea validation, and looking through the resulting community discussions, evaluations, tests, and proofs-of-concepts around this idea, we decided to build a first iteration so that we could continue learning. These are the main findings from the idea validation phase that led us to move forward:


 * Cautious community support: community members are cautiously optimistic about this task, agreeing that it would be valuable, but pointing out many risks and pitfalls that we think we can address with good design.
 * Accurate algorithm: the image matching algorithm has shown to be 65-80% accurate through multiple different tests, and we have been able to refine it over time.
 * User tests: many newcomers who experienced prototypes found the task fun and engaging.
 * Android MVP: the results from the Android MVP showed that newcomers generally applied good judgment to the suggestions, but more importantly, gave us clues about how to improve their results in our designs. The results also hinted that the task could work well across languages.
 * Overall learnings: having bumped into many pitfalls through our various validation steps, we'll be able to guard against them in our upcoming designs. This background work has given us lots of ideas on how to lead newcomers to good judgment, and how to avoid damaging edits.

仮説
We're not certain that this task will work well -- that's why we plan to build it in small iterations, learning along the way. We do think that we can make a good attempt using our learnings so far to build a lightweight first iteration. One way to think about what we're doing with our iterations is hypothesis testing. Below are five optimistic hypotheses we have about the "add an image" task. Our aim in Iteration 1 will be to see if these hypotheses are correct.


 * 1) Captions: users can write satisfactory captions. This is our biggest open question, since images that get placed into Wikipedia articles generally require captions, but the Android MVP did not test the ability of newcomers to write them well.
 * 2) Efficacy: newcomers will have strong enough judgment that their edits will be accepted by the communities.
 * 3) Engagement: users like to do this task on mobile, do many, and return to do more.
 * 4) Languages: users who don’t know English will be able to do this task. This is an important question, since the majority of metadata on Commons is in English, and it is critical for users to read the filename, description, and caption from Commons in order to confidently confirm a match.
 * 5) Paradigm: the design paradigm we built for "the add a link structured task" will extend to images.

適用範囲
Because our main objective with Iteration 1 is learning, we want to get an experience in front of users as soon as we can. This means we want to limit the scope of what we build so that we can release it quickly. Below are the most important scope limitations we think we should impose on Iteration 1.


 * Mobile only: while many experienced Wikimedians do most of the wiki work from their desktop/laptop, the newcomers who are struggling to contribute to Wikipedia are largely using mobile devices, and they are the more important audience for the Growth team's work. If we build Iteration 1 only for mobile, we'll concentrate on that audience while saving the time it would take to additionally design and build the same workflow for desktop/laptop.
 * Static suggestions: rather than building a backend service to continuously run and update the available image matches using the image matching algorithm, we'll run the algorithm once and use the static set of suggestions for Iteration 1. While this won't make the newest images and freshest data available, we think it will be sufficient for our learning.
 * Add a link paradigm: our design will generally follow the same patterns as the design for our previous structured task, "add a link".
 * Unillustrated articles: we'll limit our suggestions only to articles that have no illustrations in them at all, as opposed to including articles that have some already, but could use more. This will mean that our workflow will not need to include steps for the newcomer to choose where in the article to place the image. Since it will be the only image, it can be assumed to be the lead image at the top of the article.
 * No infoboxes: we'll limit our suggestions only to articles that have no infoboxes. That's because if an unillustrated article has an infobox, its first image should usually be placed in the infobox. But it is a major technical challenge to make sure we can identify the correct image and image caption fields in all infoboxes in many languages. This also avoids articles that have Wikidata infoboxes.
 * Single image: although the image matching algorithm can propose multiple image candidates for a single unillustrated article, we'll limit Iteration 1 to only proposing the highest-confidence candidate. This will make for a simpler experience for the newcomer, and for a simpler design and engineering effort for the team.
 * Quality gates: we think we should include some sort of automatic mechanism to stop a user from making a large number of bad edits in a short time. Ideas around this include (a) limiting users to a certain number of "add an image" edits per day, (b) giving users additional instructions if they spend too little time on each suggestions, (c) giving users additional instructions if they seem are accepting too many images. This idea was inspired by English Wikipedia's 2021 experience with the Wikipedia Pages Wanting Photos campaign.
 * Pilot wikis: as with all new Growth developments, we will deploy first only to our four pilot wikis, which are Arabic, Vietnamese, Bengali, and Czech Wikipedias. These are communities who follow along with the Growth work closely and are aware that they are part of experiments. The Growth team employs community ambassadors to help us correspond quickly with those communities. We may add Spanish and Portuguese Wikipedias to the list in the coming year.

We're interested to hear community members' opinions on if these scoping choices sound good, or if any sound like they would greatly limit our learnings in Iteration 1.

模型と試作版
これまでの利用者テスト類で使った設計案、Android MVP 版を下敷きにして、第1回反復開発で取り組む複数のデザインコンセプトを検討しています. 利用者フローを5段階に分け、それぞれ2案を考えています. 両方を利用者テストにかけて、新規参加者から情報を集める予定です. 利用者テストは英語とスペイン語が対象の予定で -- 英語以外でのテストとして、チーム初の取り組みです. コミュニティの皆さんにも設計を考えてもらいたいですし、ご意見ご感想の投稿をトークページでお待ちしています.

利用者テスト向け試作版

私たちが築こうとしているものは、双方向性の試作版を使ってもらうと簡便に経験できます. 試作版は設計案の「A案」「B案」両パターンを作成し、英語とスペイン語でご用意しました. これらはウィキのソフトウェアではありませんが、そのシミュレーション版と考えてください. つまり実際の編集は保存されず、表示されたボタンの中には作動しないものがあり -- それでも「画像を追加する」作業に最も重要なものに限定して 作動させています.


 * A案（英語）
 * B案（英語）
 * A案（スペイン語）
 * B案（スペイン語）

利用者テスト向け模型

以下に、2021年8月に行った利用者テストで使った模型から、静止画像をお見せします. コミュニティの皆さんはGrowth チームのデザイナーが作った Figma 画像集を開いて、以下の画像表示エリアの右下の試作版や、その準備段階で書き留めたデザイン案やメモを自由にご覧ください.

フィード

これらの設計案は、このワークフローのごく最初の段階を説明しており、利用者はまず編集のおすすめフィードから作業をする記事を選びます. カードは魅力的である方が良いのですが、利用者を混乱させてはいけないと考えています.

イテレーション1の最終設計

Based on the user test findings above, we created the set of designs that we are implementing for Iteration 1. The best way to explore those designs is here in the Figma file, which always contains the latest version.

Leading indicators
Whenever we deploy new features, we define a set of "leading indicators" that we will keep track of during the early stages of the experiment. These help us quickly identify if the feature is generally behaving as expected and allow us to notice if it is causing any damage to the wikis. Each leading indicator comes with a plan of action in case the defined threshold is reached, so that the team knows what to do.

We collected data on usage of "add an image" from deployment on November 29, 2021 until December 14, 2021. "Add an image" has only been made available on the mobile website, and is given to a random 50% of registrations on that platform (excluding our 20% overall control group). We therefore focus on mobile users registered after deployment. This dataset excluded known test accounts, and does not contain data from users who block event logging (e.g. through their ad blocker).

Overall: The most notable thing about the leading indicator data is how few edits have been completed so far: only 89 edits over the first two weeks. Over the first two weeks of "add a link", almost 300 edits were made. That feature was deployed to both desktop and mobile users, but that alone is not enough to make up the difference. The leading indicators below give some clues. For instance, task completion rate is notably low. We also notice that people do not do many of these tasks in a row, whereas with "add a link", users do dozens in a row. This is a prime area for future investigation.

Revert rate: We use edit tags to identify edits and reverts, and reverts have to be done within 48 hours of the edit. The latter is in line with common practices for reverts.

The "add an image" revert rate is comparable to the copyedit revert rate, and it’s significantly higher than "add a link" (using a test of proportions). Because "add an image" has a comparable revert rate to unstructured tasks, the threshold described in the leading indicator table is not met, and we do not have cause for alarm. That said, we are still looking into why reverts are occurring in order to make improvements. One issue we've noticed so far is a large number of users saving edits from outside the "add an image" workflow. They can do this by toggling to the visual editor, but it is happening so much more often than for "add a link" that we think there s something confusing about the "caption" step that is causing users to wander outside of it.

Rejection rate: We define an edit “session” as reaching the edit summary dialogue or the skip dialogue, at which point we count whether the recommended image was accepted, rejected, or skipped. Users can reach this dialogue multiple times, because we think that choosing to go back and review an image or edit the caption is a reasonable choice.

The threshold in the leading indicator table was a rejection rate of 40%, and this threshold has not been met. This means that users are rejecting suggestions at about the same rate as we expected, and we don't have reason to believe the algorithm is underperforming.

Over-acceptance rate: We reuse the concept of an "edit session" from the rejection rate analysis, and count the number of users who only have sessions where they accepted the image. In order to understand whether these users make many edits, we measure this for all users as well as for those with multiple edit sessionsfive or more edit sessions. In the table below, the "N total" column shows the total number of users with that number of edit sessions, and "N accepted all" the number of users who only have edit sessions where they accepted all suggested links.

It is clear that over-acceptance is not an issue in this dataset, because there are no users who have 5 or more completed image edits, and for those who have more than one, 38% of the users accepted all their suggestions. This is in the expected range, given that the algorithm is expected usually to make good suggestions.

Task completion rate: We define "starting a task" as having an impression of "machine suggestions mode". In other words, the user is loading the editor with an "add an image" task. Completing a task is defined as clicking to save the edit, or confirming that you skipped the suggested image.

The threshold defined in the leading indicator table is "lower than 55%", and this threshold has been met. This means we are concerned about why users do not make their way through the whole workflow, and we want to understand where they get stuck or drop out.