Structured Data Across Wikimedia/Section-level Image Suggestions/Data Pipeline

General architecture
The general data workflow is the same both for ALIS and for SLIS, as illustrated in the following diagram:

SLIS
SLIS-specific data processing steps are shown in the following diagram: inputs go into the main component, namely a Spark job, which is then executed as one task of the image suggestions Airflow parent job:

How it works
SLIS leverages two principal algorithms to generate suggestions: section alignment and section topics. Given a language and a Wikipedia article section:
 * the former retrieves images that already exist in the corresponding section of other languages;
 * the latter takes the section's wikilinks and looks up images that are connected to them via several properties, typically Wikidata ones.

We consider section alignment-based suggestions to be fairly relevant in general, since they represent a projection of community-curated content. On the other hand, the more connections a wikilink has, the more confident a section topics-based suggestion is.

How it looks like
The following table displays a SLIS example that stems from both section alignment and section topics:

The section alignment algorithm found the image in Japanese Wikipedia's equivalent section.

The section topics algorithm obtained it through the following path:

Sharp Corporation section wikilink → Wikidata item →  Wikidata property → Sharp boomboxes Commons category.

Section alignment
TODO

Section topics
TODO

Intersection
TODO

Pruning
TODO

Confidence scores
TODO

Code base

 * Spark job at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/section_image_suggestions.py
 * Airflow DAG at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/image_suggestions_dag.py