Wikimedia Apps/Team/Android/Machine Assisted Article Descriptions

Experiment Background
The Android team is teaming up with Research and EPFL to improve article descriptions, also known as short descriptions.

Currently Android app users can create and edit article descriptions via suggested edits. Article descriptions go to Wikidata with the exception of article descriptions for English Wikipedia. The Android team has received feedback that new users produce low-quality article descriptions (T279702). In 2022 the team placed a temporary restriction on Suggested Edits for users that had less than 3 edits for English Wikipedia users (T304621) with the intent on finding methods of improving the quality of article descriptions by new users.

EPFL and Research reached out to the Android team with a model called Descartes that a model that can generate descriptions performing on par with human editors. Descartes takes the information on a Wikipedia article page and provides a short description of the article while adhering to the guidance of what makes an article description helpful. During initial evaluation of the model, it was preferred more than 50% of the time over human generated article descriptions. Additionally, Descartes held a 91.3% accuracy rate in testing. Despite these very promising results, the team wanted to do our due diligence by conducting an ABC test to ensure the suggestions will improve the quality of article descriptions when suggested to new editors, without introducing or increasing existing bias. We created an API which is hosted on Toolforge and will integrate the model into our existing interface in order to conduct our experiment. We will patrol edits made through the experiment in partnership with volunteers to not burden patrollers.

Product Requirements:

 * Users being able to provide feedback on individual suggestions should they detect issues


 * Accommodate two machine generated suggestions to test which beam is more accurate
 * Onboard users to Machine Generated suggestions
 * Reminder popups of checking for bias when clicking a suggestion on a biography
 * Only experienced users can will see suggestions for biographies
 * Ability for users to write in their own response and edit a suggestion
 * Incorporate icon that identifies the product uses machine learning
 * Multilingual compatibility with mBART25

Objective and Indicators
As a first step in the implementation of this project, the Android team will develop a MVP with the purpose of:


 * 1) Determine if suggestions made through the Descartes model increases the quality of article description additions and edits made using the Wikipedia Android app. To understand how the suggested article description changes user behavior we will evaluate:
 * 2) * If introduction of suggestions alters the stickiness of the task type across editing tenure
 * 3) * Variability in task completion time relative to quality of edits
 * 4) * How often users modify suggestions before hitting publish
 * 5) * The optimal design and user workflow to encourage accuracy and task retention
 * 6) * What, if any, additional measures need to be in place to discourage bad or bias suggestions
 * 7) Determine if the algorithm holds up when exposed to more user:
 * 8) * Does the accuracy and preference rate change when exposed to more users
 * 9) * Does the accuracy and preference rate of using the suggestion vary greatly across languages
 * 10) * Is the algorithm introducing bias (e.g. Misgendering) or not accurately representing critical nuance for Biographies of Living Persons
 * 11) * How does the accuracy rate and performance change when showing more than one suggestion

Should the 30 day experiment show promising results based on the indicators above, the team will introduce the feature to all users and remove our 3 edit requirement for suggested edits. We will also take steps to expand the number of languages to mBART 50 and migrate the API from toolforge to a more permanent home.

How to Follow Along
We have created T316375 as our Phabricator Epic to track this work. We encourage your collaboration there or on our Talk Page.

There will also be periodic updates to this page as we make progress. You can also test the model at https://ml-article-descriptions.toolforge.org/.

Updated Designs
After determining that the suggestions could be embedded in the existing article descriptions task the Android team made updates to our design. If a user reports a suggestion, they will see the same dialog as we proposed in our August 2022 update as the what will be seen if someone clicks Not Sure.

This new design does mean we will allow users to publish their edits, as they would be able to without the machine generated suggestions. However, our team will patrol the edits that are made through this experiment to ensure we do not overwhelm volunteer patrollers. Additionally, new users will not receive suggestions for Biographies of Living Persons.

November 2022: API Development
The Research team put the model on toolforge and tested the performance of the API. Initial insights found that it took 5-10 seconds to generate suggestions, which also varied depending on how many suggestions were being shown. Performance improved as the number of suggestions generated decreased. Ways of addressing this problem was by preloading some suggestions, restricting the number of suggestions shown when integrated into article descriptions, and altering user flows to ensure suggestions can be generated in the background.

August 2022: Initial Design Concepts and Guardrails for Bias
User story for Discovery

When I am using the Wikipedia Android app, am logged in, and discover a tooltip about a new edit feature, I want to be educated about the task, so I can consider trying it out. Open Question: When should this tooltip be seen in relation to other tooltips?

User story for education

When I want to try out the article descriptions feature, I want to be educated about the task, so my expectations are set correctly.

Guardrails for bias and harm
The team generated possible guardrails for bias and harm:


 * Harm: problematic text recommendations
 * Guardrail: blocklist of words never to use
 * Guardrail: check for stereotypes – e.g., gendered language + occupations
 * Harm: poor quality of recommendations
 * Guardrail: minimum amount of information in article
 * Guardrail: verify performance by knowledge gap
 * Harm: recommendations only for some types of articles
 * Guardrail: monitor edit distribution by topic