Structured Data Across Wikimedia/Section Topics/Data Pipeline

From mediawiki.org

What's a topic?[edit]

We define a topic as a Wikidata item of a given wikilink extracted from a given piece of Wikipedia content.

General architecture[edit]

How it works[edit]

The data pipeline is implemented as an Airflow job and breaks down into the following steps:

  1. two sensors that give green lights as soon as fresh data is available in the Data Lake;
  2. one Python Spark task that takes as input Wikipedias wikitext, Wikidata item page links, and outputs the section topics dataset.

A look at the data[edit]

Here is how a row of data looks like (manually hyperlinked if the reader wishes to check it):

snapshot wiki_db page_namespace revision_id page_qid page_id page_title section_index section_title topic_qid topic_title topic_score
2023-01-16 enwiki 0 1127523670 Q36724 841 Attila 5 Solitary kingship Q3623581 Arnegisclus 1.13

Data processing flow[edit]

  1. Gather the content of top-level sections, lead section included;
  2. filter out sections that don't convey relevant topics, such as External links. See phab:T318092 and phab:T323504 for more details;
  3. extract Wikidata items from wikilinks: the so-called section topics;
  4. filter out noisy topics, such as dates. See phab:T323597 and phab:T323036 for more details;
  5. compute topics relevance score.

Note that we:

  • resolve redirect pages;
  • optionally separate media links from the main dataset.

Relevance score[edit]

We define relevance as a score that measures to what extent a given topic helps summarize and understand a given piece of Wikipedia content. This enables topic ranking and is computed as a term frequency-inverse document frequency (TF-IDF) weight based on the distribution of topics.

We must distinguish between article-level and section-level relevance, which summarize a Wikipedia article and a Wikipedia article section respectively. They follow slightly different implementations:

  • the former is a custom weight, where the TF component is computed across Wikipedias by leveraging the language-agnostic nature of Wikidata items;
  • the latter is a classic one, i.e., computed within the same Wikipedia;
  • both compute the IDF component within the same Wikipedia.

As a result, we expect article-level relevance to be much more meaningful than section-level one, due to the much larger amount of topics that contribute to the computation. Moreover, TF-IDF doesn't perform well in case of short content, which is likely to impact relevance of short sections with few topics.

Code base[edit]

See also[edit]