Structured Data Across Wikimedia/Section-level Image Suggestions/Data Pipeline

Tracked in Phabricator
Task T311814

General architecture[edit]

The general data workflow is the same both for ALIS and for SLIS, as illustrated in the following diagram:

SLIS[edit]

SLIS-specific data processing steps are shown in the following diagram: inputs go into the main component, namely a Spark job, which is then executed as one task of the image suggestions Airflow parent job:

How it works[edit]

SLIS leverages two principal algorithms to generate suggestions: section alignment and section topics. Given a language and a Wikipedia article section:

the former retrieves images that already exist in the corresponding section of other languages;
the latter takes the section's wikilinks and looks up images that are connected to them via several properties, typically Wikidata ones.

We consider section alignment-based suggestions to be fairly relevant in general, since they represent a projection of community-curated content. On the other hand, the more connections a wikilink has, the more confident a section topics-based suggestion is.

How it looks like[edit]

The following table displays a SLIS example that stems from both section alignment and section topics:

page title	section title	image
Boombox	Design	Ghettoblaster-family

The section alignment algorithm found the image in Japanese Wikipedia's equivalent section.

The section topics algorithm obtained it through the following path:

Sharp Corporation section wikilink → Sharp Corporation (Q53227) Wikidata item → Commons category (P373) Wikidata property → Sharp boomboxes Commons category.

Suggestion algorithms[edit]

We dive below into the main aspects of the core SLIS components.

Section alignment[edit]

This algorithm is based on a machine-learned model that classifies (i.e., aligns) equivalent section titles across Wikipedia language chapters.

Given a target section title, it looks up images available in all equivalent sections and suggests them. The workflow breaks down into the following steps:

gather aligned section titles from the model's output
extract existing section images from all Wikipedias through a wikitext parser
combine the above data to generate suggestions

Section topics[edit]

This algorithm builds on top of the section topics data pipeline and aims at constructing a visual representation of wikilinks available in Wikipedia article sections.

To achieve such goal, it follows two kinds of paths that connect a given wikilink to a Commons image, namely:

wikilink → Wikidata item → Wikidata image property → Commons image
wikilink → Wikipedia article's lead image

The former path consumes available image (P18) and Commons category (P373) Wikidata properties. Note that we explored the use of additional ones with no success, see phab:T311832.

During the development and internal evaluation of this component, we observed that these paths alone weren't generating accurate suggestions. The intuition is that the visual representation is indeed relevant to a wikilink, but is unrelated to the actual content where that wikilink originates. Therefore, we applied the paths to the article page as well, and matched them against the wikilink ones. In this way, we ensured that an image holds the same relationship with both the section wikilink and the article, thus yielding better suggestions at the cost of a lower amount.

Intersection[edit]

This simply stands for the combination of the above algorithms, thus yielding the most solid suggestions at the cost of a much lower volume.

Pruning[edit]

The content of Wikipedia articles is highly heterogeneous: while unstructured textual information is generally a suitable candidate for image suggestions, semi-structured data such as tables and lists are not. Furthermore, articles don't exclusively describe one entity, and might be lists, disambiguation pages, redirects, etc.

Hence, we devoted a substantial quantity of work to filter out undesired content. We group filters depending on their target as follows:

Articles
- disambiguation pages;
- redirects;
- dates;
- numbers;
- years;
- lists;
- names.
Sections
- lead sections;
- tabular data;
- lists;
- too short text;
- sections where an image already exists;
- title denylist, typically References, External links, Further reading.
Images
- non-image file extensions;
- images already on the page;
- placeholder categories;
- images that are used so frequently that they are likely icons.

Confidence scores[edit]

Every suggestion ships with a confidence score, which is computed upon the following idea: the more sources agree on the same suggestion, the greater the confidence.

This is supported by evidence: when we elicited human judgments over a random sample, results indicated a correlation between suggestions rated as good and their confidence scores. See phab:T330784 for more details.

More specifically, we assign a $[0..100]$ integer depending on how many sources match a given suggestion through the following formula:

$100*(1-\left(1-{\frac {\mathcal {C_{a}}}{100}}\right)*\left(1-{\frac {\mathcal {C_{t}}}{100}}\right))$

where ${\mathcal {C_{a}}}$ is the section alignment confidence score and ${\mathcal {C_{t}}}$ is the section topics one.

Informally, if a given suggestion only matches one source, it will inherit its score; if it matches both, it will be a combined or probability.

Based on manual evaluation evidence, we set ${\mathcal {C_{a}}}=80$ , while we compute ${\mathcal {C_{t}}}$ through a combination of its sub-sources, namely a wikilink image ${\mathcal {w}}$ and an article one ${\mathcal {a}}$ as follows:

${\mathcal {L}}*\left({\frac {\mathcal {w}}{100}}\right)*\left({\frac {\mathcal {a}}{100}}\right)$

where ${\mathcal {L}}$ is the likelihood of the general section topic relevance, which we empirically set to $90$ .

Remediating bias[edit]

SLIS might generate or propagate different kinds of bias, which we acknowledge and attempt to remediate when possible.

First, articles that are candidates for suggestions can bear typical Wikipedia selection biases, e.g., towards men and English-speaking countries. Such bias can be mitigated by surfacing suggestions to end users for articles that do not represent those selection biases. Helpful examples include:

topical filters and edit tags;
incorporating category-based notifications into events ran by local groups, such as this one in Portuguese by Wiki Editoras Lx and Wikimedia Portugal.

On the other hand, the system may propagate (but not worsen) existing bias in users' watchlists, as notifications are based on them. This is similar to Growth's newcomer tools, where users can pick their topics of interest.

Under a different perspective, a complement to bias is harm: we argue here that the likelihood is low, since the suggestion space is quite constrained and already patrolled by the community.

A second aspect revolves around potential bias propagation towards generally more illustrated topics on Wikipedia. Recent work on visual knowledge gaps shows that most of the bias happens at the selection stage, while the quality of articles and the proportion of illustrated articles is similar across different genders. Therefore, we claim that the algorithm may replicate, but not amplify, existing gender bias. Furthermore, an initial investigation on illustrated articles by country shows non-significant trends of western countries being on average more illustrated than the rest.

A third aspect relies on section alignment's core idea: images available in one Wikipedia project are projected to other ones. Hence, acceptable content for one language community is also assumed to be culturally appropriate for others. While we acknowledge this is a strong assumption, latest evaluation data showed that only 0.07% of suggested images were judged as offensive.

Last but not least, section alignment's machine-learned component may suffer from language bias: multilingual language models are known to work better on languages with large presence on the Internet (Wikipedia size in our case) compared to under-resourced ones. For instance, the English-Spanish pair's measured precision is 95%, while the English-Japanese one is 83%. This translates into more confident suggestions for larger Wikipedias. Nevertheless, the model learns section alignments for every Wikipedia project pair, thus enabling eventual expansion to new languages.

Code base[edit]

Spark job at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/section_image_suggestions.py
Airflow DAG at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/image_suggestions_dag.py
Section alignment at https://gitlab.wikimedia.org/repos/structured-data/section-image-recs
Section topics at https://gitlab.wikimedia.org/repos/structured-data/section-topics