Structured Data Across Wikimedia/Section-level Image Suggestions/Data Pipeline

General architecture
The general data workflow is the same both for ALIS and for SLIS, as illustrated in the following diagram:

SLIS
SLIS-specific data processing steps are shown in the following diagram: inputs go into the main component, namely a Spark job, which is then executed as one task of the image suggestions Airflow parent job:

How it works
SLIS leverages two principal algorithms to generate suggestions: section alignment and section topics. Given a language and a Wikipedia article section:
 * the former retrieves images that already exist in the corresponding section of other languages;
 * the latter takes the section's wikilinks and looks up images that are connected to them via several properties, typically Wikidata ones.

We consider section alignment-based suggestions to be fairly relevant in general, since they represent a projection of community-curated content. On the other hand, the more connections a wikilink has, the more confident a section topics-based suggestion is.

How it looks like
The following table displays a SLIS example that stems from both section alignment and section topics:

The section alignment algorithm found the image in Japanese Wikipedia's equivalent section.

The section topics algorithm obtained it through the following path:

Sharp Corporation section wikilink → Wikidata item →  Wikidata property → Sharp boomboxes Commons category.

Suggestion algorithms
We dive below into the main aspects of the core SLIS components.

Section alignment
TODO

Section topics
This algorithm builds on top of the section topics data pipeline and aims at constructing a visual representation of wikilinks available in Wikipedia article section.

To achieve such goal, it follows two kinds of paths that connect a given wikilink to a Commons image, namely:
 * 1) wikilink → Wikidata item → Wikidata image property → Commons image
 * 2) wikilink → Wikipedia article's lead image

The former path consumes available and  Wikidata properties. Note that we explored the use of additional ones with no success, see phab:T311832.

Intersection
This simply stands for the combination of the above algorithms, thus yielding the most solid suggestions at the cost of a much lower volume.

Pruning
TODO

Confidence scores
TODO

Code base

 * Spark job at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/section_image_suggestions.py
 * Airflow DAG at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/image_suggestions_dag.py