Structured Data Across Wikimedia/Image Suggestions/Data Pipeline

The Structured Data Side
Our main duties are:
 * extract relevant content from Commons, Wikidata and Wikipedias;
 * transform it into datasets suitable for Commons and Wikipedias search indices;
 * load image suggestions into the database that will serve the API.

Speaking with the above architecture diagram terms, we:
 * own the Airflow job;
 * feed specific Hive tables in the Data Lake;
 * provide content for Data Persistence, namely input for Elasticsearch updates and Cassandra.

The Airflow job breaks down into the following steps:
 * 1) a set of sensors that give green lights as soon as fresh data is available in the Lake;
 * 2) a Python Spark task that gathers Elasticsearch weighted tags for the Commons search index;
 * 3) a Python Spark task that gathers actual image suggestions, filtering out those rejected by the user community;
 * 4) a Python Spark task that gathers suggestion flags for Wikipedias search indices;
 * 5) a set of Scala Spark tasks that feed Cassandra with suggestions.