Wikimedia Apps/Team/Android/Add an image MVP

Objective
The Android, Structured Data, and Growth teams aim to offer "Add an Image" as a “structured task”. More about the motivations for pursuing this project can be found on the main page created by the Growth team. In order to roll out Add an Image and have the output of the task show up on wiki, a "minimum viable product" (MVP) for the Wikipedia Android app will be created. The MVP will enhance the algorithm provided by the research team and answer questions about behavior usage to further explore the concerns raised by the community.

The most important thing about this MVP is that it will not save any edits to Wikipedia. Rather, it will only be used to gather data, improve our algorithm, and improve our design.

With the Android app being where "suggested edits" originated, and our team has a framework to build new task types easily. The main pieces include:


 * The app will have a new task type that users know is only for helping us improve our algorithms and designs.
 * It will show users image matches, and they will select "Yes", "No", or "Skip".
 * We'll record the data on their selections to improve the algorithm, determine how to improve the interface, and think about what might be appropriate for the Growth team to build for the web platform later on.
 * No edits will happen to Wikipedia, making this a very low-risk project.

The Android team will be working on this in February and March 2021. Our hope is the Growth team will learn enough to deploy the feature on mobile web. Based on the success and lessons of the Growth team's deployment, the Android team will refine the MVP and turn it into a feature that produces edits to Wikipedia.

Product Requirements

As a first step in the implementation of this project, the Android team will develop a MVP with the purpose of:


 * 1) Improving the Image Matching Algorithm developed by the research team by answering "how accurate is the algorithm"?  We want to set confidence levels for the sources in the algorithm -- to be able to say that suggestions from Wikidata are X% accurate, from Commons categories are Y% accurate, and other Wikipedias are Z% accurate
 * 2) Learn about our users by evaluating:
 * 3) * The stickiness of Add an Image across editing tenure, Commons familiarity, and language
 * 4) * The difficulty of Add an Image as a task and if we can determine if certain matches are harder than others
 * 5) * Learn the implications of language preference on the ability to complete of the task
 * 6) * Accuracy levels of users judging the matches because we’re not sure how accurate the users are, we want to receive multiple ratings on each image match (i.e. “voting”).
 * 7) * The optimal design and user workflow to encourage accurate matches and task retention
 * 8) * What, if any, measures need to be in place to discourage bad matches

How to Follow Along
We have created T272872 as our Phabricator Epic to track the work of the MVP. We encourage your collaboration there or on our Talk Page.

There will also be periodic updates to this page as we make progress on the MVP.

2021 Jun 25 - Final Report and Next Steps
The Android team completed the Train Image Algorithm experiment. The findings can be found below. There was enough favorable insights from the experiment that the Growth team decided to proceeds with the next phase of this work. You can read more about the Growth team building a Mobile Web feature to place images in articles on their project page. In the interim, the Android team will sunset the Train Image Algorithm task, and will add an Image Recommendations task to Suggested Edits based on the work from the Growth team.

The two most important questions to answer in making a decision to proceed with image recommendations work for newcomers are around engagement and efficacy. Each of those has more detailed questions underneath.

Engagement: do users like this task and want to do it?


 * Edits per session: do users do many of these edits in a row?
 * Retention: do users return on multiple days to do the task again?
 * Algorithm experience: is the algorithm accurate enough that users feel productive, but not so accurate that they feel superfluous?
 * Qualitative: is there anything we can see about the task in Play Store comments?

Efficacy: will resulting edits be of sufficient quality?


 * Accuracy of algorithm: what is the baseline accuracy before users are involved?
 * Algorithm improvement: what did we learn about the algorithm’s weak points?
 * Judgment: can newcomers identify the good matches from the bad, thereby improving the overall accuracy of the feature placing images on articles?
 * Effort: do newcomers seem to spend adequate time and care evaluating each match?

Engagement: do users like this task and want to do it?
Edits per session: we want to see users do many of these edits in a row, indicating that they like the task enough to keep on going.


 * On average, they do about 9 annotations per user and 10 annotations per session
 * We want to compare to the other Android tasks, using a 30 day sample of data, only Logged In, Suggested Edit editors.
 * We want to look at these numbers for English and Non-English users, if possible.
 * Note on positive reinforcement: the experience recommends that users do 10 per day as their “daily goal”.  Perhaps the fact that this number is close to 10 is an indication that the daily goal is influencing users.

Average Edits per Unique User:
* Image tag edits are on Commonswiki, we don’t track language for those edits

Retention: we want to see users return on multiple days to do the task again.

 * Most recent in this Phab comment, on how to make an apples-to-apples comparison between the various Android tasks
 * Using a 30 day sample of data from only Logged In, Suggested Edit editors.
 * We want to compare to the other Android tasks.
 * We want to look at these numbers for English and Non-English users, if possible.

All users
English Non-English

Algorithm experience: is the algorithm accurate enough that users feel productive, but not so accurate that they feel unnecessary?

 * If users were saying “yes” or “no” over 90% of the time, we might worry that they’re bored.  If they say “unsure” more than a third of the time, we might worry that they’re frustrated.
 * Users say “yes” 65% of the time, “no” 20% of the time, and “not sure” 15% of the time.  In other words, they perceive the algorithm to be correct about two-thirds of the time, and they’re only unsure rarely.
 * It would be helpful to find research from the industry or academy on how to think about and tune this ratio.

Efficacy: will resulting edits be of sufficient quality?
Accuracy of algorithm: what is the baseline accuracy before users are involved?


 * Our best estimate comes from the SDAW test, which tested in six languages, and ranges from 65-80% accurate depending on whether you count “Good” or “Good+Okay”, and depending on the wiki/evaluator (source).
 * The three sources in the algorithm have substantially different accuracy (source) and make up different shares of the coverage (source):


 * Through the Android MVP, experts evaluated 2,397 matches. On average, experts assessed 76% of the matches to be correct. This is in line with the results above.
 * WMF staff also manually evaluated 230 image matches which were marked as “correct” by newcomer editors (<50 edits). We found that 80% of these matches are actually correct, which is in line with the numbers above.

Algorithm improvement: what did we learn about the algorithm’s weak points?

 * What is the distribution of responses for the follow-up questions for “no” and “not sure”?
 * We want to look at these numbers for English and Non-English users, if possible.

“No” responses “Not sure” responses

Judgment: can newcomers identify the good matches from the bad, thereby improving the overall accuracy of the feature placing images on articles?

 * Comparison with WMF staff annotations
 * 80% of the matches for which newcomers said "yes" are actually good matches
 * This number goes up to 82-83% when we remove newcomers who have very low median time for evaluations.
 * Since the algorithm is 65-80% accurate in the first place, and algorithm+newcomers is 80% accurate, but we think that we can boost that by screening the worst newcomers (those who go too fast; those who say yes too often), then perhaps newcomer+algorithm could be up at 85%+.
 * 85% of the matches for which Avg/Expert users said "yes" are actually good matches


 * Comparison with expert users (users with 1000+ Wikipedia edits)


 * There was agreement amongst users


 * Newcomers are more likely to select yes than experienced users.

Effort: do newcomers seem to spend adequate time and care evaluating each match?


 * What percent of users have a mean response time of less than five seconds?

All users This table is at the task level (not the user level).


 * The more experienced someone is, the more time they spend evaluating

English This table is at the task level (not the user level).

Non-English This table is at the task level (not the user level).


 * How often do users open the article to read more, and open the image to see details?

All users This table is at the task level (not the user level).

English This table is at the task level (not the user level).

Non-English This table is at the task level (not the user level).

2021 May 25 - Initial Data Insights
The Android team met with members of the Growth, Platform Engineering and Research teams to have a high level review of our data thus far and make determinations of what adjustments we should make now for the MVP, as oppose to later phases of this project.

With the experiment officially running for two weeks, the Train Image Algorithm tasks has received engagement from over 2,800 unique users on over 20,000 image titles across several language wikis. Below you will find which language wikis have at least 200 completed tasks in order by the number of tasks completed:


 * English
 * German
 * Turkish
 * French
 * Portuguese
 * Spanish
 * Persian
 * Arabic
 * Russian
 * Italian
 * Hebrew
 * Ukranian
 * Czech
 * Vietnamese

The average Train Image Algorithm tasks completed per day by a user is 10, which is consistent with the daily goal set in the feature by the team. This data tells us there that participants in this task are motivated by the daily goal, a positive reinforcement element unique to Suggested Edits.

The Train Image Algorithm feature appears to be popular with both new users, as well as power editors.

47.85% of contributors of this task downloaded the app 30 days or less ago, while 20.86% of users completing the Train Image Algorithm task have more than 50 edits across platforms.

2021 May 7 - Production Release
The team incorporated minor tweaks to the Beta version and released the Train Image Algorithm task to the production version of the Wikipedia app. In two weeks we will do a check on the data to ensure data is coming in the way it should and share a few initial insights. We will also monitor our android-support@wikimedia.org email, the play store and our phabricator board for any bugs that may arise.

2021 April 27 - Release to Beta and FAQ page
The team incorporated user testing feedback and released the feature to Beta. Our QA Analyst will review the feature in Beta for the rest of the week, and if there are not major blockers, the feature will become available in the production version of the app. We also created an FAQ page which is accessible in the app. We encourage feedback on this project's talk page.

2021 April 5 - User Testing Prioritization
Based on our analysis of the user testing feedback, the team is making updates to the prototype ahead of the release of the MVP at the end of the month. The tweaks we are making, which is captured in T272872 will include:

Required


 * T278455 The bottom sheet for image suggestions needs to be draggable in order to reveal the article contents below it. Also, participants tried to interact with the handle bar at the top of the bottom sheet.
 * If draggable sheet is not feasible: Consider a max height of the bottom sheet in order to not cover the article completely.
 * T278490 Optimize tooltip positioning and handling on smaller screens, as they are cut off on smaller screens.
 * T278493 Ensure words are not cut off and gracefully overflows
 * T278526 Create more suitable 'Train image algorithm' onboarding illustrations for all different themes.
 * T278527 The checkbox items in the the 'No' and 'Not sure' dialogs have issues in dark/black theme and need to be optimized.
 * T278528 The element of positive reinforcement/counter has displays in the dark/black theme and needs to be optimized.
 * T278529 Provide an easy way to access the entire article from the feed, e.g. by incorporating a 'Read more' link, tappable article title or showing the entire article right from the beginning.
 * T278494 Optimize copy 'Suggestion reason' meta information as the current copy ('Found in the following Wiki: trwiki') is not clear enough.
 * T278530 Might be worth to explore making the 'Suggestion reason' more prominent as participants rated its usefulness the lowest (likely due to low discoverability)
 * T278532 Optimize the 'No' and 'Not sure' dialog copy to reflect that multiple options can be selected. Some participants weren’t aware that multiple reasons can be selected.
 * T278496 Optimize copy of the 'opt-in' onboarding screen, as there’s an unnecessary word at the moment ('We would you like (...)').
 * T278497 Suppress “Sync reading list” dialog within Suggested edits as it’s distracting from the task at hand.
 * T278501 Incorporate gesture to swipe back and forth between image suggestions in the feed, as participants were intuitively applying the gestures.
 * T278533 Optimize design of positive reinforcement element/counter on the Suggested edits home screen, as it was positioned too close the task’s title.
 * T275613 Write FAQ page
 * T278534 Make it clear that reviewing the image metadata is a core part of the task. We can potentially do that by increasing the visual prominence and/or increase the affordance to promote always opening the metadata screen.
 * T278535 Optimize the discoverability of 'info' button at the top right as 2/5 participants had issues finding it.
 * T278555 Save previous answer state: Given users are able to go back, the selection made in the previous image or images should be retained
 * T278556 Reduce the font-size of the fields of the More details screen
 * T278545 Change the goal count to 10/10

Nice to Have


 * T278546 Add "Cannot read the language" as a reason for rejection and unsure
 * T278557 Show the full image contained instead of a cropped image
 * T278548 Include the same metadata in the card - notably the suggestion reason (in addition to filename, image description and caption) on the more details screen as well.
 * T278549 Show success screen (see designs on Zeplin) when users complete daily goal (10/10 image suggestions)
 * T278550 Explore tooltip "Got it" button
 * T278552 Incorporate pinch to zoom functionality, as participants tried to zoom the image directly from the image suggestions feed.
 * T278558 Remove full screen overlay when transitioning to next image suggestion. This allows users to orient better and keep context after submitting an answer.
 * T278561 Provide clear information that images come from Commons, or some more overt message about the image source and access to more metadata

2021 March 25 - User Testing Analysis
The team released an update to production that included minor bug fixes for TalkPage and Watchlist. We also show non-main name space pages in-app through a mobile web treatment.

The Android team leveraged usertesting.com to gain a better understanding of what aspects of the Image Recommendations MVP worked well and what things should be improved prior to release in English, German, French, Portuguese, Russian, Persian, Turkish, Ukrainian, Arabic, Vietnamese, Cebuano, Hebrew, Hungarian, Swedish, Polish, Czech, Basque, Korean, Serbian, Armenian, Bangla and Spanish.

We completed the analysis in partnership with the Growth team. Below is the Android team analysis.

Analysis of tasks T277861
🥰 = Good — Participant had no issues 😡 = Bad — Participant had issues 🤔 = Not sure if good or bad — Participant might had difficulties understanding the question, did not explicitly interact with it or ignored the task completely Onboarding and understanding of Suggested edits

Do participants understand the tooltip? 😡 Can participants explain the difference between tasks? 🥰 Do participants understand what the 'Train image algorithm' task is all about? 🥰 What do participants associate with the robot icon? 🥰 Train AI task - Onboarding and understanding
 * 2/5 discovered the tooltip but had issues understanding it.
 * 2/5 did not see the tooltip since it disappeared too quickly.
 * 1/5 discovered and understood the tooltip completely.
 * 5/5 were able to explain their understanding of the tasks in a sufficient way.
 * 5/5 were able to describe the task in their own words well.
 * 4/5 associated the robot icon with an algorithm, artificial intelligence (AI) or computer program
 * 1/5 didn’t know what it means

Do participants understand the two onboarding screens? 🥰 How do participants interact with onboarding tooltips? 🥰 Is the tooltip copy clear enough? How’s the timing and positioning of the tooltips on various devices / screen sizes? 🤔 Do participants know what to do after all these onboarding measures? 🥰 Train images task
 * 4/5 understand both onboarding screens.
 * 1/5 wasn’t reacting to the second onboarding screen (opt-in).
 * 3/5 understand the task due to the tooltips.
 * 1/5 mentioned that the tooltips are very helpful to understand the task.
 * 1/5 understands the task but did not pay attention to the tooltips.
 * 1/5 probably did not see or understand the tooltips.
 * 3/5 read and understand the tooltip copy.
 * 2/5 did not interact with the tooltips.
 * 2/5 had tooltip display issues on a smaller phone.
 * 1/5 likes that the tooltip mentions the impact (help readers understand a topic)
 * 5/5 understand what to do now.

Do participants interact with the prototype naturally? 🥰 Do participants know how to navigate to the file detail page? 🥰 How helpful is the meta information on the file detail page? 🥰 Do participants know how to enlarge / zoom an image? 🥰 Do participants know how to go back and forth between image suggestions? 🥰 Do participants understand the 'Not sure' options? 🥰 Do participants understand the 'No' options? 🥰 Do participants scroll or know how to reveal more of the article contents? 🥰 Do participants know how to access the FAQ? 🤔 How do participants interpret the element of positive reinforcement? 🥰 Do participants notice the element of positive reinforcement that has been added to the card? 🥰
 * 4/5 are mostly comfortable interacting with the UI and make educated decisions.
 * 3/5 do not navigate to the file page without being prompted.
 * 2/5 navigate between the article and file page intuitively and without issues.
 * 1/5 is intimated to make decisions that affect Wikipedia articles, doesn’t know how to interact with the article (RS: possible due to small screen size) and doesn’t use file detail page intuitively.
 * 5/5 successfully navigated to the file detail page after being prompted.
 * 1/5 tapped the 'info i' icon in the feed view first.
 * 3/5 consider the information on the file page as helpful.
 * 2/5 mention that the author is helpful.
 * 2/5 mention that the date is helpful.
 * 1/5 mentions that licensing info is helpful.
 * 1/5 mentions that the image description is helpful.
 * 5/5 tapped the image and used a pinch to zoom gesture to zoom the image.
 * 2/5 tried to zoom the image directly from the feed experience.
 * 5/5 use swipe gestures to navigate back and forth between image suggestions.
 * 2/5 tapped the back button at the top left before using the swipe gesture.
 * 1/5 tapped the 'info i' button at the top right before using the swipe gesture.
 * 5/5 understand the 'Not sure' options.
 * 3/5 were selecting multiple reasons at once.
 * 5/5 understand the 'Not sure' options.
 * 4/5 were successful in scrolling the article to reveal more information
 * 2/5 wanted to use the pull indicator at the top of the image suggestion to reveal the article below before they scrolled the article
 * 2/5 tried to the tap the article title (1/5 scrolled afterwards)
 * 1/5 looked for a 'More' button to reveal more of the article’s content, then tapped the 'info i' button at the top right
 * 3/5 tap the 'info i' button at the top right to reveal the FAQ.
 * 1/5 explained that she would tap the back button and look for an FAQ there (RS: a possible way to success as there’s an FAQ section in the SE home screen)
 * 1/5 did not notice the 'info i' button at the top right
 * 5/5 understand what it is and identified the element as motivational, encouraging and/or daily goal
 * 1/5 wasn’t 100% sure about it but then identified it as a motivational element.
 * 5/5 participants identified the added progress indication in the card

3. Analysis of rating scale

1 = Not at all useful information 5 = Very useful information 4. Analysis of follow-up questions

1. How do you think the suggested images for articles are being found? And how would you rate the overall quality of the suggestions? 2. Was there anything that you found frustrating or confusing, that you would like to change about the way this tool works? 3. How easy or hard did you find this task of reviewing whether images suggested were a good match for articles? 4. Would you be interested in adding images to Wikipedia articles this way? Please explain why or why not.
 * 5/5 mentioned that the images presented were relevant.
 * 4/5 associated the image suggestions with an algorithm or computer program.
 * 2/5 mentioned that the suggestions are associated with keywords.
 * 1/5 mentioned these are random suggestions.
 * 3/5 replied that it’s easy to use.
 * 1/5 that it’s tedious and cumbersome.
 * 1/5 suggested to show more than 1 image choice per article.
 * 4/5 find it very easy to evaluate if it’s a good match for the article.
 * 1/5 think it’s hard and time consuming but well worth it.
 * 4/5 are interested in such a feature
 * 1/5 mentions he would not be interested
 * 1/5 mentions that she wants to know how accurate she is when reviewing images

2021 February 23 - Finalizing Designs ahead of Usability Testing
The Android team has created designs that are currently being turned into a prototype for usability testing prior to deployment.

Once the prototype is created for user testing we will update this page with a link that anyone following along with this project can use and provide us feedback on our talk page.

2021 February 1 - Designs, Product Decisions and APIs
This week the Platform Engineering Team began building the API needed for this project with the projection of completion in early March, which is when we hope to deploy the MVP.

There were open Product questions the team's new Product Manager answered in T273055

Initial Product Decisions


 * We will have one suggested image per article instead of multiple images
 * This iteration of the MVP will not include Image Captions
 * There are no language constraints for this task. As long as there is an article available in the language we will surface it. We want to be deliberate in ensuring this task is completed by a variety of languages. For this MVP to be considered a success, we want the task completed in at least five different languages including English, an indic language and Latin language.
 * We will have a check point two weeks after the launch of the feature to check if the feature is working properly and if modifications need to be made in order to ensure we are getting the answers to our core questions. The check point is not intended to introduce scope creep.
 * We aren't able to filter by article categories in this iteration of the MVP, but it could be a possibility in the future through the PET API
 * We will surface a survey each time a user says no to a match and sparingly surface a survey when a user clicks Not Sure or Skip
 * We need three annotations from 3000 different users on 3000 different matches. By having these three annotations, the tasks will self grade.
 * We will know people like the task if they return to complete it on three distinct dates we will compare frequency of return by date across user type to understand if there was more stickiness for this task by how experienced a user is
 * Once we pull the data we will be able to compare the habits of English vs. Non English users. We can not / do not need to show the same image to both non English and English users. Non English users will have different articles and images. We will know if a task was hard due to language based on their response to the survey if they click no or not sure. We will check task retention to see how popular the task is by language.
 * In order to know if the task is easy or hard, we would like to be able to see how long it is taking them to complete it. ****NOTE: This only works if we can see if someone backgrounds the app. Of the people that got it right, how long did it take them?
 * In order to know if the task is easy or hard we should also track if they click to see more information about the task, in order to make a decision
 * We determined that it is not worth adding extra clicks to see what metadata is used that is found helpful. Perhaps we allow people to swipe up for more information and it generally provides the meta data??? Will need to see designs to compare this
 * It is too hard, at least for this MVP, to track if experienced users use this tool to add images to articles manually without using the tool, so we aren't going to track that.
 * In the designs we want to track if someone skips or press no on an image because the image is offensive in order to learn how often NSFW or offensive material appears

The Android Designer began work on mockups for the MVP and has started to receive feedback at T269594. The user stories the designer is creating mockups in response to include:

2.1. Discovery
When I am using the Wikipedia Android app, am logged in,

and discover a tooltip about a new edit feature,

I want to be educated about the task,

so I can consider trying it out.

2.2. Education
When I want to try out the image recommendations feature,

I want to be educated about the task,

so my expectations are set correctly.

2.3. Adding images
When I use the image recommendations feature,

I want to see articles without an image,

I want to be presented with a suitable image,

so I can select images to add to multiple articles in a row.

2.4. Positive reinforcement
When I use the image recommendations feature,

I want feedback/encouragement that what I am doing is right/helping,

so that I am motivated to do more.