Analytics/Archive/Infrastructure/Overview

= Terms / Software =
 * Hadoop
 * A collection of services for batch processing of large data. Core concepts are a distributed filesystem (HDFS) and MapReduce.  Java.


 * HDFS
 * Hadoop Distributed Filesystem.


 * YARN
 * Yet-Another-Resource-Negotiator. Abstract implementation of compute resource manager.  MapReduce (v2) is a YARN application.  Although not technically correct, YARN is sometimes referred to as MRv2.


 * MapReduce
 * Programming model for parallelizing batch processing of large data sets. Really good for counting things.  Hadoop ships with a Java implementation of MapReduce.


 * Hue
 * Hadoop User Experience. Web GUI for various Hadoop services.  (HDFS, Oozie, Pig, Hive, etc.)


 * Pig
 * High level language abstraction for common implementing MapReduce programs without thinking about the MapReduce model. (Feels like a mix between SQL and awk).  Generates MapReduce programs that are run in Hadoop.


 * Hive
 * Projects structure onto flat data (text or binary) in HDFS and allows this data to be queried in an SQL like manner.


 * Oozie
 * Scheduler system for Hadoop job. Used for automated reporting based on data availability.


 * Storm
 * Distributed stream processing system. Useful for transformation and computation of streaming data.


 * Kafka
 * Distributed pub/sub message queue. Useful as a big ol' reliable log buffer.


 * Zookeeper
 * Highly reliable dynamic configuration store. Instead of keeping configuration and simple state in flat files or a database, Zookeeper allows applications to be notified of configuration changes on the fly.


 * Cloudera
 * An organization that provides Hadoop ecosystem packages and documentation. Kraken uses Cloudera's CDH4 distribution.