Wikimedia Research/Showcase/Archive/2022/03

March 2022[edit]

Theme: Patterns and dynamics of article quality

March 16, 2022 Video: YouTube

Quality monitoring in Wikipedia - A computational perspective

By Animesh Mukherjee (Indian Institute of Technology, Kharagpur)

In this talk, I shall summarize our five-year long research highlights concerning Wikipedia. In particular, I shall deep dive into two of our recent works; while the first one attempts to understand the early indications of which editors would soon go "missing" (aka missing editors) [1], the second one investigates how the quality of a Wikipedia article transitions over time and whether computational models could be built to understand the characteristics of future transitions [2]. In each case, I will present a suite of key results and the main insights that we obtained thereof.

[1] When expertise gone missing: Uncovering the loss of prolific contributors in Wikipedia, ICADL 2021 (pdf)
[2] Quality Change: norm or exception? Measurement, Analysis and Detection of Quality Change in Wikipedia, CSCW 2022 (pdf)
Slides on figshare

Automatically Labeling Low Quality Content on Wikipedia by Leveraging Editing Behaviors

By Sumit Asthana (University of Michigan, Ann Arbor)

Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current labeling approaches are tedious and produce noisy labels. In this talk, I will discuss an automated labeling approach that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia edits and uses the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated article sentences are examples that no longer need semantic improvements. I will discuss the performance of models training with this labeling approach over models trained with existing labeling approaches, and also the implications of such a large scale semi supervised labeling approach in capturing the editing practices of Wikipedia editors and helping them improve articles faster.

Related paper: Automatically Labeling Low Quality Content on Wikipedia By Leveraging Patterns in Editing Behaviors, CSCW 2021 (pdf)