This page is meant as a bucket for tips and notes on using the Hadoop toolchain.

Pig[edit]

Pig is a dataflow language adept at analyzing unstructured data, converting it into a regular structure.

Best Practices[edit]

Push filters up even if you have to reprocess fields.
Drop all unneeded fields as soon as you can.
Syntax-check your script locally; use DESCRIBE to understand data shape.
Don't explicitly set PARALLEL unless you have a reason. It only effects reduce operations, and Pig has heuristics based on data size for that; you are unlikely to help performance, and forcing PARALLEL higher than necessary uses up more slots on and increases job overhead.
Don't Be Afraid To:
- ...Run stupid, half-broken jobs from a grunt shell just to see the data shape or test your UDF-tinkertoys. (Just pls save to your home dir.)
- ...DUMP after a complex statement to check output shape.
- ...Use Hive to check your data; its aggregation and filtering features are way more advanced and SQL is easy.

- Macros aren't really all that interesting as they're severely limited in scope.
- Be careful with exec and run -- they're more flexible and powerful, but very expensive, as every call spawns more MR jobs!
- Parameters are inserted via literal string substitution, which lets you do some pretty wacky metaprogramming via workflows. This means you basically have meta-macros. See rollup.pig as an example.

MATCHES RegExp must hit whole input
Don't expect relations to stay sorted -- derived relations after a reduce-op (namely, GROUP) need re-ordering!
Implicit coercion from bag of single-tuple to scalar ONLY works for relations! -- it does NOT work for grouped-records or other actual bags
Escaping quotes in a parameter is impossible. I swear it. It's is worse than several layers of `ssh box -- bash -lc "eval '$WAT'"`. I gave up; no combination of backslashes and quotes made any difference.

Test each layer, work outward; I always make a $JOB-wf.properties to test the workflow alone before moving on to the coord (with a test-$JOB-coord.properties and a $JOB-coord.properties).
Everything can formally declare parameters using a <parameters> block at the beginning. DO IT! and avoid pointless defaults -- better to fail early.
Check the xmlns on your root element!
- Coordinators: xmlns="uri:oozie:coordinator:0.4"
- Workflows: xmlns="uri:oozie:workflow:0.4"

Know your <action> options: control flow, sub-workflow, fs, shell, java, streaming all have uses!
Sub-workflows are like functions -- compose and reuse!
<prepare> should probably be in all our jobs -- delete the output dir before starting work to ensure you don't pointlessly fail due to temporary cruft.
<globals> allows you to set properties for all actions. All jobs should set job-tracker and namenode here.
job.xml(s) -- they cascade -- will be useful once we start profiling and tuning jobs. Save those tweaks together as job-confs for similarly structured jobs to reuse!

<dataset> initial instances should always predate the job. This only restricts the possible valid results; it doesn't dictate anything about where the job starts.
Always create coordinator parameters for jobStart, jobEnd, jobName, jobQueue! This lets you easily fire off the job as a backfill in a different queue, or one-of instances of the job, etc.
datasets.xml lets you share <dataset> definitions. It's is worth investigating as the number of jobs grows.
Chaining datasets between coordinators is fussy. I haven't seen it worth the energy so far.

Some workflow action elements are order-sensitive (!!). Ex: <configuration> must come before <script> in <pig-action>, and yes, the error message is oblique and unhelpful.

All jobs keep performance counters and stats. These can be extremely helpful to improve job speed.
Be familiar with the hdfs shell tool -- it's a lot more expressive than you might expect.