Structured Data Across Wikimedia/Section-level Image Suggestions/Data Pipeline

General architecture
The general data workflow is the same both for ALIS and for SLIS, as illustrated in the following diagram:

SLIS
SLIS-specific data processing steps are shown in the following diagram: inputs go into the main component, namely a Spark job, which is then executed as one task of the image suggestions Airflow parent job:

How it works
SLIS leverages two principal algorithms to generate suggestions: section alignment and section topics. Given a language and a Wikipedia article section:
 * the former retrieves images that already exist in the corresponding section of other languages;
 * the latter takes the section's wikilinks and looks up images that are connected to them via several properties, typically Wikidata ones.

We consider section alignment-based suggestions to be fairly relevant in general, since they represent a projection of community-curated content. On the other hand, the more connections a wikilink has, the more confident a section topics-based suggestion is.

How it looks like
The following table displays a SLIS example that stems from both section alignment and section topics:

The section alignment algorithm found the image in Japanese Wikipedia's equivalent section.

The section topics algorithm obtained it through the following path:

Sharp Corporation section wikilink → Wikidata item →  Wikidata property → Sharp boomboxes Commons category.

Suggestion algorithms
We dive below into the main aspects of the core SLIS components.

Section alignment
This algorithm is based on a machine-learned model that classifies (i.e., aligns) equivalent section titles across Wikipedia language chapters.

Given a target section title, it looks up images available in all equivalent sections and suggests them. The workflow breaks down into the following steps:
 * 1) gather aligned section titles from the model's output
 * 2) extract existing section images from all Wikipedias through a wikitext parser
 * 3) combine the above data to generate suggestions

Section topics
This algorithm builds on top of the section topics data pipeline and aims at constructing a visual representation of wikilinks available in Wikipedia article sections.

To achieve such goal, it follows two kinds of paths that connect a given wikilink to a Commons image, namely:
 * 1) wikilink → Wikidata item → Wikidata image property → Commons image
 * 2) wikilink → Wikipedia article's lead image

The former path consumes available and  Wikidata properties. Note that we explored the use of additional ones with no success, see phab:T311832.

During the development and internal evaluation of this component, we observed that these paths alone weren't generating accurate suggestions. The intuition is that the visual representation is indeed relevant to a wikilink, but is unrelated to the actual content where that wikilink originates. Therefore, we applied the paths to the article page as well, and matched them against the wikilink ones. In this way, we ensured that an image holds the same relationship with both the section wikilink and the article, thus yielding better suggestions at the cost of a lower amount.

Intersection
This simply stands for the combination of the above algorithms, thus yielding the most solid suggestions at the cost of a much lower volume.

Pruning
The content of Wikipedia articles is highly heterogeneous: while unstructured textual information is generally a suitable candidate for image suggestions, semi-structured data such as tables and lists are not. Furthermore, articles don't exclusively describe one entity, and might be lists, disambiguation pages, redirects, etc.

Hence, we devoted a substantial quantity of work to filter out undesired content. We group filters depending on their target as follows:
 * 1) Articles
 * 2) * disambiguation pages;
 * 3) * redirects;
 * 4) * dates;
 * 5) * numbers;
 * 6) * years;
 * 7) * lists;
 * 8) * names.
 * 9) Sections
 * 10) * lead sections;
 * 11) * tabular data;
 * 12) * lists;
 * 13) * too short text;
 * 14) * sections where an image already exists;
 * 15) * title denylist, typically References, External links, Further reading.
 * 16) Images
 * 17) * non-image file extensions;
 * 18) * images already on the page;
 * 19) * placeholder categories;
 * 20) * images that are used so frequently that they are likely icons.

Confidence scores
Every suggestion ships with a confidence score, which is computed upon the following idea: the more sources agree on the same suggestion, the greater the confidence.

This is supported by evidence: when we elicited human judgments over a random sample, results indicated a correlation between suggestions rated as good and their confidence scores. See phab:T330784 for more details.

More specifically, we assign a $$[0..100]$$ integer depending on how many sources match a given suggestion through the following formula:

$$100 * ( 1 - \left ( 1 - \frac{\mathcal{C_a}}{100} \right ) * \left (1 - \frac{\mathcal{C_t}}{100} \right ) )$$

where $$\mathcal{C_a}$$ is the section alignment confidence score and $$\mathcal{C_t}$$ is the section topics one.

Informally, if a given suggestion only matches one source, it will inherit its score; if it matches both, it will be a combined or probability.

Based on manual evaluation evidence, we set $$\mathcal{C_a} = 80$$, while we compute $$\mathcal{C_t}$$ through a combination of its sub-sources, namely a wikilink image $$\mathcal{w}$$ and an article one $$\mathcal{a}$$ as follows:

$$\mathcal{L} * \left ( \frac{\mathcal{w}}{100} \right ) * \left ( \frac{\mathcal{a}}{100} \right )$$

where $$\mathcal{L}$$ is the likelihood of the general section topic relevance, which we empirically set to $$90$$.

Code base

 * Spark job at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/section_image_suggestions.py
 * Airflow DAG at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/image_suggestions_dag.py
 * Section alignment at https://gitlab.wikimedia.org/repos/structured-data/section-image-recs
 * Section topics at https://gitlab.wikimedia.org/repos/structured-data/section-topics