User:LBowmaker (WMF)/Airflow job in 5 mins

From mediawiki.org
The following page is aimed at creating very simple jobs that are scheduled via Airflow. In this context, simple means running a query and outputting the results to another table. If the query and output tables are known it should be possible to copy and paste your query into the examples below in less than 5 mins (pushing to the required repo's and testing might take a bit longer but the intention is that migrating an existing job you have shouldn't take more than a few hours and you should only need to modify <10 lines of Python code). It is intended to be a very simple version of this developer guide.

Pre-requisite knowledge:

  • Your job will be scheduled using a tool called Airflow (to learn more click here).
  • The process will require you to copy and edit 3 files - a query file which contains your Hive query, a Python file which is used to schedule your query and another Python file that contains tests for your job.
  • Some basic knowledge of Git commands and code repositories.

How to schedule your first simple job[edit]

Step 1: Create your query file[edit]

From your teams chosen repo run the following commands:

git clone <<your-git-url.git>>
git checkout -b your-new-job

Make a copy of the example .hql file. Make changes as directed in the file. Save your .hql file under the appropriate folder.

Make sure to test your query on a stat box updating the command below with your inputs:

spark3-sql -f my_file.hql -d your_arg=1

If everything works as expected then run:

git add your_file_name.hql
git commit your_file_name.hql -m "Some useful message"
git push

Now go to your Git repo. Click 'New pull request' - leave base as the master and compare as the branch: your-new-job then click 'Create pull request'

Have other team mates or Data Engineers review the code, once all is approved click 'Merge' to the main branch.

Step 2: Create your scheduling file[edit]

Run the following commands:

git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
git checkout -b your-new-job

Make a copy of the example Python file here. Make changes as directed in the file. Save your file under the product_analytics/dag folder. Now run:

git add your_file_name_dag.py
git commit -m your_file_name_dag.py "Some useful message"

Step 3: Creating your test file[edit]

Make a copy of the example Python file here. Make changes as directed in the file. Save your file under the tests/product_analytsics/your_dag_id folder. Now run:

git add your_file_name_dag_test.py
git commit -m your_file_name_dag_test.py "Some useful message"
git push

Now go to: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests

Click 'New Merge Request', select source branch as the branch you just pushed and leave target as main. Continue to create the MR.

Step 4: Create a ticket for DE review[edit]

Once the above steps are complete create a ticket, tagged with 'data-engineering' and including the 2 links to the PR's you submitted.

Someone from the DE team will review and if all looks good it will be deployed.