User:LBowmaker (WMF)/Airflow job in 5 mins
Pre-requisite knowledge:
- Your job will be scheduled using a tool called Airflow (to learn more click here).
- The process will require you to copy and edit 3 files - a query file which contains your Hive query, a Python file which is used to schedule your query and another Python file that contains tests for your job.
- Some basic knowledge of Git commands and code repositories.
How to schedule your first simple job[edit]
Step 1: Create your query file[edit]
From your teams chosen repo run the following commands:
git clone <<your-git-url.git>>
git checkout -b your-new-job
Make a copy of the example .hql file. Make changes as directed in the file. Save your .hql file under the appropriate folder.
Make sure to test your query on a stat box updating the command below with your inputs:
spark3-sql -f my_file.hql -d your_arg=1
If everything works as expected then run:
git add your_file_name.hql
git commit your_file_name.hql -m "Some useful message"
git push
Now go to your Git repo. Click 'New pull request' - leave base as the master and compare as the branch: your-new-job then click 'Create pull request'
Have other team mates or Data Engineers review the code, once all is approved click 'Merge' to the main branch.
Step 2: Create your scheduling file[edit]
Run the following commands:
git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
git checkout -b your-new-job
Make a copy of the example Python file here. Make changes as directed in the file. Save your file under the product_analytics/dag folder. Now run:
git add your_file_name_dag.py
git commit -m your_file_name_dag.py "Some useful message"
Step 3: Creating your test file[edit]
Make a copy of the example Python file here. Make changes as directed in the file. Save your file under the tests/product_analytsics/your_dag_id folder. Now run:
git add your_file_name_dag_test.py
git commit -m your_file_name_dag_test.py "Some useful message"
git push
Now go to: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests
Click 'New Merge Request', select source branch as the branch you just pushed and leave target as main. Continue to create the MR.
Step 4: Create a ticket for DE review[edit]
Once the above steps are complete create a ticket, tagged with 'data-engineering' and including the 2 links to the PR's you submitted.
Someone from the DE team will review and if all looks good it will be deployed.