User:GModena (WMF)/Data Pipeline Onboarding

This page describes hows to onboard a new data pipeline to Generate Data Platform.

A 5 slide overview of our capabilities can be found at https://docs.google.com/presentation/d/1gNDKy_XNgbXtZqNHCd2m63JSzURqL_s-Fzc6SBBKtxA/edit#slide=id.g1083d4ca5ed_0_0

Onboarding
You have developed a notebook or a set of pyspark scripts, and would not like to orchestrate and schedule them on Generate Data Platform.

The onboarding process requires the following steps:


 * 1) Create a new Phabricator Task on the Generated Data Platform board and describe your use case. It's important to include (TODO: turn this into a phab template):
 * 2) What technologies are used by your project? (eg. Jupyter, Hive, etc)
 * 3) What are the data sources?
 * 4) Where do you plan to store data?
 * 5) What is the pipeline schedule?
 * 6) Privacy review?
 * 7) Data volumes, cluster resources?
 * 8) Your code will need to be refactored according to our conventions (see below). See the Create a new datapipeline section below.
 * 9) We'll ask that your code passes a number of code checks. Once done, open a Draft merge request.
 * 10) During code review we'll assist you with deploying your pipeline to our systems.
 * 11) Once the request is merged, your data pipeline will be operated according to the agreed upon schedule and SLOs.

Flow chart with the proposed onboarding process is available at https://miro.com/app/board/uXjVOZRVVsQ=/

Create a new datapipeline
This section will describe:


 * 1) How to use our scaffolding tools to get started
 * 2) How to organize project code
 * 3) How to orchestrate tasks with Airflow

Scaffolding
Fork (or clone) our data pipelines monorepo (TODO: this will move to the Generated Data Platform group once that's in place). git clone git@gitlab.wikimedia.org:gmodena/platform-airflow-dags.git Create a branch for your new data pipeline. Each new branch should reference its phabricator task. git checkout -b -your-data-pipeline Scaffold a new datapipeline project with: make datapipeline This will create a project template under your_data_pipeline and a new airflow dag template under dags/your_data_pipeline_dag.py.

Organize project code
your_data_pipeline/README.md will contain getting started information.

This directory contains the transformations (Tasks) that you want to orchestrate and schedule. Project code is organized as follows:


 * conf contains Spark job specific config files. `spark.properties` will let you define your cluster topology and desired resources. We default to a [yarn-regular]( https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Spark_Resource_Settings ) sized cluster.
 * pyspark contains Spark based data processing tasks.
 * sql contains SQL/HQL queries.

Python tasks must be located under pyspark and follow these convention


 * runtime dependencies must declared in requirements.txt
 * test dependencies must be declared in requirements-test.txt
 * all code belongs to the src module.
 * src/transform.py contains boilerplate to the help get your started with a pyspark job.
 * conftest.py and tests contain pytest boilerplate to test spark code.

Task orchestration with Airflow
contains a template to orchestrate pyspark (and SQL) sequential task according to our convention. A spark can be configured as instance of factory.squence.PySparkTask. PySparkTasks, to be executed sequentially,

can be appended to a task list, that is then passed to a generate_dag method that will stitch them together into an Airflow DAG. The example below is taken from our sample-project pipeline # in yarn-cluster mode. # SparkConfig will take care of configuring PYSPARK_SUBMIT_ARGS, # as well as Python dependencies. spark_config = SparkConfig(       pipeline="sample-project",        pipeline_home=config["pipeline_home"],    )
 * 1) Configure a Spark environment to run sample-project

# A spark job is a script that takes some input # and produces some output. # The script should be provided in your project src module. pyspark_script = os.path.join(       config["pipeline_home"], "sample-project", "pyspark", "src", "transform.py"    )

# You should specify the HDFS directory # where a task input data resides. input_path = "/path/to/hdfs/input"

# You should specify the HDFS directory # where a task output data should be saved. output_path = "/path/to/hdfs/output"

# PySparkTask is a helper class that # helps you submit a pyspark_script to the cluster. t1 = PySparkTask(       main=pyspark_script,        input_path=input_path,        output_path=output_path,        config=spark_config,    )

tasks = [t1, ] # generate_dag will chain and execute tasks in sequence (t1 >> t2 >> ... >> tn). # The generated dag is appended to the global dags namespace. globals["sample-project"] = generate_dag(       pipeline="sample-project", tasks=tasks, dag_args=dag_args    ) From the top level directory, you can now run. The command will check that  is a valid airflow dag. The output should look like this: {{Code Name                                   Stmts   Miss  Cover --- dags/factory/sequence.py                  70      3    96% dags/ima.py                               49      5    90% dags/similarusers-train-and-ingest.py     20      0   100% dags/your_data_pipeline_dag.py            19      0   100% --- TOTAL                                    158      8    95%
 * 1 = -- coverage: platform linux, python 3.7.11-final-0 ---

=
============== 8 passed, 8 warnings in 12.75s =========================== ______________________________________ summary ____________ }}

What is a data pipeline
A Generated Datasets Platform pipeline is made up by two components:


 * 1) Project specific tasks and data transformation that operate on input (sources) and produce output (sink). We depend on Apache Spark for elastic compute.
 * 2) An Airflow DAG, that is a thin orchestration layer that composes and executes tasks

Data pipelines are executed on Hadoop. Elastic compute is provided by Spark (jobs are deployed in cluster mode). Scheduling and orchestration is delegated to Apache Airflow. Currently we support Python based projects. Scala support is planned.

Some caveats apply.

Task Orchestration

 * Airflow dags are a  thin layer of (declarative) execution steps.
 * Airflow dags must not contain any logic
 * Airflow tasks must not perform local compute.
 * Airflow tasks must not persist data locally.

Compute

 * pyspark jobs must be deployed in cluster-mode.
 * pyspark jobs can declare their dependencies and virtual environments, but they must not vendor libraries or third-party modules (e.g. research algos).
 * pyspark code should be idiomatic
 * Avoid computation on the driver; use UDFs instead.

Conventions, code style
We favour test-driven development with, lint with   and type check with. We encourage, but not yet enforce, the use of  and   for formatting code. We log errors and information messages with the Python logging library.

Codechecks
A valid project is expected to pass the following code checks:


 * Compile time type-checking
 * Unit tests
 * Linting
 * DAG validation tests

Code checks are triggered automatically after a

A DAG validation tests live under the toplevel  directory. They can be triggered manually with.

Continuous Integration
Gitlab Pipelines are currently unavailable in Wikimedia's instance. To automate CI, for demo purposes, we propose to mirror this repo to Github, and execute tests on a Github runner

implements a CI workflow that runs on every push. Output is available at https://github.com/gmodena/wmf-platform-airflow-dags/actions/workflows/build.yml

Python Code Style
The most up-to-date spec of our code style can be found at https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/blob/multi-project-dags-repo/datapipeline-scaffold/%7B%7Bcookiecutter.pipeline_directory%7D%7D/pyspark/tox.ini

We lint with flake8 and the following (conservative) settings:


 * McCabe complexity threshold: 10
 * maximum allowed line length: 127 (Default PEP8: 79)
 * check for syntax errors or undefined names

We perform compile time type checks with mypy and the following rule set: [mypy] python_version = 3.7 disallow_untyped_defs = True # methods signature should be typed disallow_any_unimported = True # disallows usage of types that come from unfollowed imports no_implicit_optional = True # <- Explicit is better than implicit. Open to debate :) check_untyped_defs = True # Type-checks the interior of functions without type annotations. warn_return_any = True # Shows a warning when returning a value with type Any from a function declared with a non- Any return type. show_error_codes = True # Show error codes in output warn_unused_ignores = True # Warns about unneeded # type: ignore comments.

Deployment
This section describes how to add a new pipeline to our deployment targets. This is a step that currently requires a member of the Generated Data Platform team in the loop.

Deployment pipelines are declared in the  variable in. To deploy a new pipeline, append its project directory name to. For example, if a new pipeline has been created as, the new   list would look like the following:

Example
TODO: link to sample-project example

Operation & SLOs
The systems and software are currently a work in progress. No SLOs is available at this stage.