Structured Data Across Wikimedia/Section Topics/Data Pipeline

What's a topic?
We define a topic as a Wikidata item of a given wikilink extracted from a given piece of Wikipedia content.

How it works
The data pipeline is implemented as an Airflow job and breaks down into the following steps:
 * 1) two sensors that give green lights as soon as fresh data is available in the Data Lake;
 * 2) one Python Spark task that takes as input Wikipedias wikitext, Wikidata item page links, and outputs the section topics dataset.

A look at the data
Here is how a row of data looks like (manually hyperlinked if the reader wishes to check it):

Data processing flow

 * 1) Gather the content of top-level sections, lead section included;
 * 2) filter out sections that don't convey relevant topics, such as External links. See phab:T318092 and phab:T323504 for more details;
 * 3) extract Wikidata items from wikilinks: the so-called section topics;
 * 4) filter out noisy topics, such as dates. See phab:T323597 and phab:T323036 for more details;
 * 5) compute topics relevance score.

Note that we:
 * resolve redirect pages;
 * optionally separate media links from the main dataset.

Relevance score
We define relevance as a score that measures to what extent a given topic helps summarize and understand a given piece of Wikipedia content. This enables topic ranking and is computed as a term frequency-inverse document frequency (TF-IDF) weight based on the distribution of topics.

We must distinguish between article-level and section-level relevance, which summarize a Wikipedia article and a Wikipedia article section respectively. They follow slightly different implementations:
 * the former is a custom weight, where the TF component is computed across Wikipedias by leveraging the language-agnostic nature of Wikidata items;
 * the latter is a classic one, i.e., computed within the same Wikipedia;
 * both compute the IDF component within the same Wikipedia.

As a result, we expect article-level relevance to be much more meaningful than section-level one, due to the much larger amount of topics that contribute to the computation. Moreover, TF-IDF doesn't perform well in case of short content, which is likely to impact relevance of short sections with few topics.

Code base

 * Spark job at https://gitlab.wikimedia.org/repos/structured-data/section-topics;
 * Airflow DAG at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/section_topics_dag.py.