Analytics/Archive/Infrastructure/Oozie

Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie is a job scheduler with fancy features. Most relevantly, jobs may be scheduled based on the existence of data in HDFS. This allows jobs to be scheduled to be run not based only on a current timestamp, but for when the data needed to run a particular job is available.

= Terms =
 * action
 * An action generally represents a single step in a job workflow. Examples include pig scripts, failure notifications, map-reduce jobs, etc.


 * workflow
 * Workflows are used to chain actions together. A workflow is synonymous with a job.  Workflows describe how actions should run, and how actions should flow.  Actions can be chained based on success and failure conditions.


 * coordinator
 * Coordinators are used to schedule recurring runs of workflows. They can abstractly describe input and output datasets based on on periodicity.  Coordinators will submit workflow jobs based on the existence of data.

= An Example = For our simple example, we'll create a workflow that runs a Pig script, takes any number of files/directories as input, and saves its output to a file. A second pig script that will take the output of the first as its input, and concatenate it all together in a single file. We'll then learn to use a coordinator to hourly schedule this workflow based on the existence of input data.

Note: replace 'dummy' with your user name.

Actions
Our actions are just two Pig scripts. It isn't important what they do, except that they take input and output files as CLI parameters. Our example Pig scripts also take in a regular expression, '$hour_regex' as a CLI parameter. This allows us to filter out content that doesn't match the hour for which we want to generate statistics.

This example was written while referring to webrequest_loss_hourly.pig and concat_sort.pig.

Workflow
We can define and parameterize a workflow before we start to schedule it for regular runs using a coordinator.

A workflow is just a series of parameterized actions. Parameters can be set on the Oozie CLI, in a .properties file, or by the coordinator XML file.

In workflow.xml, first we name the workflow:

Then we tell the workflow which action should be run first:

Now we define our actions and their flow:

Note that the if the "webrequest_loss_by_hour_action" finishes with an "ok" status, the "concat" action will be run.

And finally the end failure and success end conditions:

The full file can be found here.

This workflow can be manually submitted via the Oozie CLI. You'll need to create a job.properties file in which you set the parameterized values, as well as some properties that Oozie requires you to have to run.

The oozie.wf.application.path is important. It tells Oozie where to find your workflow.xml definition. You'll need to copy your workflow.xml file to this directory, and then use the above job.properties file to submit the workflow:

Note also that oozie.libpath includes /user/dummy/pig. If your Pig script imports any external Pig scripts, make sure that those external scripts are in HDFS and in the oozie.libpath.

In this example job.propeties file, we are generating statistics for the 10:00 hour on February 25th. Since our data is imported every 15 minutes, for each hour we will want to take 6 individual imports as our INPUT data. For the 10:00 hour, that will be 09:45, 10:00, 10:15, 10:30, 10:45, and 11:00.

You can use the "-info" and other oozie commands to get information about the running job.

Coordinator
A coordinator is submitted to Oozie in the same way as a workflow, but defined differently.

First we name our coordinator and specify how often we want it to run:

Defining coordinator datasets and input events
Our example input data set is imported every 15 minutes. This data is saved in directories that are named based on timestamp, but the data is not bucketed in these directories by the timestamps in the data content. For example, we cannot guarantee that the data stored in a directory named 2013-02-25_10.15.00 contains only data for the 10:00 - 10:15 range. It most likely will contain some few logs from minutes before or after.

Datasets
The Oozie coordinator needs to know the path and import frequency of our input dataset, as well as when our dataset begins. Note that the initial-instance attribute is not in the same format as the input directory.

We also need to define our output dataset.

''Note: I'm not sure that a frequency of 15 minutes is correct here. We will be running this coordinator every hour, so defining this dataset with a frequency of 15 minutes is probably incorrect.''

Input/Output Events
Now that we've described how the data on disk looks, we can now define input and output events based on ranges of this data.

The following defines the input event. We tell it that this input event is made up of the webrequest-wikipedia-mobile dataset we defined above. We also then give a range of data that belongs to to this input event. current(0) will resolve to the current hour at which we are running. For example, if the current hour is 11:00, we will want to look at the previous 6 data imports to generate statistics for the 10:00 hour. So the start-instance is current - 5 (11:00 - 5 * 15 minute intervals == 09:45), and the end-instance is the current hour.

We also define an output event so that the coordinator knows what each workflow run should output. If the coordinator is scheduling a run at 11:00, then current(-4) will be 10:00, which will eventually be in the name of the output directory.

Finally, we need to tell the coordinator the workflow that it should submit for each run. By setting app-path to ${wf_application_path}, we tell Oozie to look for workflow.xml in that directory (which we specify in a .properties file later).

Now for the workflow configuration. The workflow works with variable parameters based on which time period is being computed, so we need to set those based on named input and output events. We also set HOUR_REGEX to the hour for which each run generates statistics to the current hour minus one our (11:00 - 1 == 10:00). (Remember that this is a peculiarity/feature of our Pig script.) CONCAT_INPUT is defined to be the whole of all output data, and DEST a single .tsv file that will contain all of the output data in a single file.

Now that we've got our coordinator.xml file, we can copy it (and workflow.xml) to HDFS and create a coordinator.properties file to use for submission to Oozie.

Full versions of these .property and .xml files can be found at https://github.com/wikimedia/kraken/tree/master/oozie/webrequest_loss_by_hour.