Analytics/Kraken/Hadoop Tools

This page is meant as a bucket for tips and notes on using the Hadoop toolchain.

= Pig =

Pig is a dataflow language adept at analyzing unstructured data, converting it into a regular structure.

Best Practices

 * Push filters up even if you have to reprocess fields.
 * Drop all unneeded fields as soon as you can.
 * Syntax-check your script locally; use  to understand data shape.
 * Don't explicitly set  unless you have a reason. It only effects reduce operations, and Pig has heuristics based on data size for that; you are unlikely to help performance, and forcing   higher than necessary uses up more slots on and increases job overhead.
 * Don't Be Afraid To:
 * ...Run stupid, half-broken jobs from a grunt shell just to see the data shape or test your UDF-tinkertoys. (Just pls save to your home dir.)
 * ... after a complex statement to check output shape.
 * ...Use Hive to check your data; its aggregation and filtering features are way more advanced and SQL is easy.

Reuse & Metaprogramming

 * Macros aren't really all that interesting as they're severely limited in scope.
 * Be careful with  and   -- they're more flexible and powerful, but very expensive, as every call spawns more MR jobs!
 * Parameters are inserted via literal string substitution, which lets you do some pretty wacky metaprogramming via workflows. This means you basically have meta-macros. See rollup.pig as an example.

Gotchas

 * RegExp must hit whole input
 * Don't expect relations to stay sorted -- derived relations after a reduce-op (namely, GROUP) need re-ordering!
 * Implicit coercion from bag of single-tuple to scalar ONLY works for relations! -- it does NOT work for grouped-records or other actual bags
 * Escaping quotes in a parameter is impossible. I swear it. It's is worse than several layers of `ssh box -- bash -lc "eval '$WAT'"`. I gave up; no combination of backslashes and quotes made any difference.

= Oozie =

Best Practices

 * Test each layer, work outward; I always make a  to test the workflow alone before moving on to the coord (with a   and a  ).
 * Everything can formally declare parameters using a  block at the beginning. DO IT! and avoid pointless defaults -- better to fail early.
 * Check the  on your root element!
 * Coordinators:
 * Workflows:

Workflows

 * Know your  options: control flow, sub-workflow, fs, shell, java, streaming all have uses!
 * Sub-workflows are like functions -- compose and reuse!
 * should probably be in all our jobs -- delete the output dir before starting work to ensure you don't pointlessly fail due to temporary cruft.
 * allows you to set properties for all actions. All jobs should set job-tracker and namenode here.
 * (s) -- they cascade -- will be useful once we start profiling and tuning jobs. Save those tweaks together as job-confs for similarly structured jobs to reuse!

Coordinators

 * initial instances should always predate the job. This only restricts the possible valid results; it doesn't dictate anything about where the job starts.
 * Always create coordinator parameters for,  ,  ,  ! This lets you easily fire off the job as a backfill in a different queue, or one-of instances of the job, etc.
 * lets you share  definitions. It's is worth investigating as the number of jobs grows.
 * Chaining datasets between coordinators is fussy. I haven't seen it worth the energy so far.

Gotchas

 * Some workflow action elements are order-sensitive (!!). Ex:  must come before   in , and yes, the error message is oblique and unhelpful.

= Hadoop =


 * All jobs keep performance counters and stats. These can be extremely helpful to improve job speed.
 * Be familiar with the hdfs shell tool -- it's a lot more expressive than you might expect.

= Tutorials and Guides =


 * The Coalesce Workflow: Concatenate and rename job results
 * The Rollup Workflow: Aggregate a result field into a rollup