Structured Data Across Wikimedia/Image Suggestions/Data Pipeline

The Structured Data side
Our main duties are:
 * extract relevant content from Commons, Wikidata and Wikipedias;
 * transform it into datasets suitable for Commons and Wikipedias search indices;
 * load image suggestions into the database that will serve the API.

Speaking with the above architecture diagram terms, we:
 * own the Airflow job;
 * feed specific Hive tables in the Data Lake;
 * provide content for Data Persistence, namely input for Elasticsearch updates and Cassandra.

The Airflow job breaks down into the following steps:
 * 1) a set of sensors that give green lights as soon as fresh data is available in the Lake;
 * 2) a Python Spark task that gathers Elasticsearch weighted tags for the Commons search index;
 * 3) a Python Spark task that gathers actual image suggestions, filtering out those rejected by the user community;
 * 4) a Python Spark task that gathers suggestion flags for Wikipedias search indices;
 * 5) a set of Scala Spark tasks that feed Cassandra with suggestions.

Get good suggestions from Commons
First, we aim at retrieving relevant image candidates for unillustrated Wikipedia articles. To achieve so, we build sets of (property, value, score) triples that serve as tags to weight images in the Commons search index. We leverage three properties: For each property, we retrieve the corresponding Wikidata item and compute a confidence score.
 * two from Wikidata, namely and ;
 * one from Wikipedias, namely article lead images.

Scores computation
When available, we consider as a crucial property, thus setting a constant maximum score of 1,000 to its values.

For, we implement the following simple intuition: a category with few members is more important than one with many members. As a result, given a Wikidata category item, its score is inversely proportional to the logarithm of the total images it holds.

For article lead images, the score is based on the number of main namespace pages that link to articles with the given lead image, grouped by the Wikidata item of the article. For instance, let  be a lead image of   and , which map to the Wikidata item. The  score is proportional to the sum of incoming links for   and. Based on empirical evidence, we pick a scaling factor of 0.2 and a threshold of 5,000 for incoming links. Hence, if the sum of incoming links is:
 * less than the threshold, then we set the score to incoming links * scaling factor;
 * greater than or equal to the threshold, then we set the score to its maximum value, i.e., 1,000.

Find suggestions for Wikipedia articles
The second major step involves mining images that serve as suitable candidates for unillustrated Wikipedia articles. We consider an article to be unillustrated if:
 * it has no images at all;
 * its images are used so widely across Wiki projects that they are likely to be icons or placeholders;
 * the corresponding Wikidata item is an instance of a list, a year, a number, or a name.

Therefore, we implement the following set of heuristics to filter irrelevant candidates:
 * the file name contains substrings that indicate icons or placeholders, such as  or  ;
 * the image is in the placeholder category.

Tell whether an article has suggestions
The final stage consists of adding markers to Wikipedia articles for which we have suggestion candidates. We fulfill this by simply injecting boolean flags into the respective search indices.