Platform Engineering Team/Data Value Stream/Airflow Coding Convention

This page describes the coding conventions of batch jobs, such as algorithms, intended to run on Platform Engineering’s Airflow instance. To see coding conventions for Python please look at: Platform Engineering Team/Data Value Stream/Data Pipeline Onboarding/

Jupyter Notebooks
While the use of Jupyter notebooks is expected during development it is NOT recommended for use on Airflow instances. We recommend users convert Jupyter notebooks into executable scripts.

To convert a Jupyter notebook into a script use `nbconvert` in the command line: This will convert the Jupyter notebook file notebook.ipynb into an executable script.

Spark
We recommend using Spark’s cluster mode to run jobs in the Airflow instance. As of the writing of this document, wmfdata does not support the use of cluster mode. Hence we recommend initializing Spark directly:

Local computations and storage
The airflow instance has limited memory and storage that is shared among all airflow jobs. That is why we recommend all computations and storage should not be done locally.

Computation
We recommend running computations on Spark instead of locally.

For example, instead of pulling the results down locally using pandas and running the computations locally:

We recommend transforming functions into Spark’s User Defined Functions (UDFs) and then applying the changes to the column(s):

Storage
We recommend saving files on HDFS. For example, instead of saving a data frame as a file like shown below:

We recommend saving files directly to HDFS:

Idempotency
We recommend making all airflow jobs idempotent. In practice, this generally means making sure any randomized operations are seeded and can be reproduced.

Additional Resources:

 * DAG best practices