Structured Data Across Wikimedia/Section Topics/Data Pipeline
This page in a nutshell: The section topics data pipeline gathers Wikidata identifiers from Wikipedia article sections. It is owned by the Structured Data team.
What's a topic?
We define a topic as a Wikidata item of a given wikilink extracted from a given piece of Wikipedia content.
How it works
The data pipeline is implemented as an Airflow job and breaks down into the following steps:
- two sensors that give green lights as soon as fresh data is available in the Data Lake;
- one Python Spark task that takes as input Wikipedias wikitext, Wikidata item page links, and outputs the section topics dataset.
A look at the data
Here is how a row of data looks like (manually hyperlinked if the reader wishes to check it):
Data processing flow
- Gather the content of top-level sections, lead section included;
- filter out sections that don't convey relevant topics, such as External links. See phab:T318092 and phab:T323504 for more details;
- extract Wikidata items from wikilinks: the so-called section topics;
- filter out noisy topics, such as dates. See phab:T323597 and phab:T323036 for more details;
- compute topics relevance score.
Note that we:
- resolve redirect pages;
- optionally separate media links from the main dataset.
We define relevance as a score that measures to what extent a given topic helps summarize and understand a given piece of Wikipedia content. This enables topic ranking and is computed as a term frequency-inverse document frequency (TF-IDF) weight based on the distribution of topics.
We must distinguish between article-level and section-level relevance, which summarize a Wikipedia article and a Wikipedia article section respectively. They follow slightly different implementations:
- the former is a custom weight, where the TF component is computed across Wikipedias by leveraging the language-agnostic nature of Wikidata items;
- the latter is a classic one, i.e., computed within the same Wikipedia;
- both compute the IDF component within the same Wikipedia.
As a result, we expect article-level relevance to be much more meaningful than section-level one, due to the much larger amount of topics that contribute to the computation. Moreover, TF-IDF doesn't perform well in case of short content, which is likely to impact relevance of short sections with few topics.
- Spark job at https://gitlab.wikimedia.org/repos/structured-data/section-topics;
- Airflow DAG at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/platform_eng/dags/section_topics_dag.py.