Analytics/Kraken/Software

This page is meant to be a brief overview of the software and libraries used to build Kraken.

Major Software and Libraries

 * Hadoop
 * HDFS: NameNode, DataNode
 * YARN: NodeManager, ResourceManager
 * Hadoop tools:
 * Hue
 * Oozie
 * Pig, Hive
 * HTTPfs, webhdfs
 * LDAP integration?
 * Kerberos integration?
 * Storm
 * Nimbus, Supervisors, Bolts, Spouts
 * (We will be writing bolts to perform the ETL tasks, and configuring the topology that links them together)
 * ETL libraries:
 * MaxMind GeoIP database
 * dClass: User Agent parser
 * Avro: Data serialization framework
 * Camus: Kafka -> HDFS library
 * Kafka
 * Zookeeper
 * Required by Hadoop, Storm, and Kafka
 * Monitoring tools:
 * jmxtrans
 * Ganglia
 * Graphite

See the Kraken Overview for a glossary of terms and detail on how these pieces are connected into components.

Today

 * Hadoop + tools, the Kafka brokers, and Zookeeper are all set up
 * logs from the edge -> awk -> Kafka
 * no ETL: cron import to HDFS from Kafka via MR job (this means IPs are in logs!)
 * User authorization: via shell account, or group membership in labs LDAP
 * Monitoring via jmxtrans into Ganglia; no Nagios

Network Usage
Network usage by node type:


 * Kafka Brokers: 2x the full incoming stream (in from edge, out to ETL). Kafka uses its own protocol over TCP, and supports message bundling, compression, and live failover.
 * Storm ETL workers: full stream sharded among nodes; Storm is smart about avoiding inter-node communication if a topology can completely fit on a single worker.
 * A file in HDFS is made up of blocks; these blocks can be both local and remote to any given DataNode. HDFS replicates data 3x (configurable) for availability and durability – this means it will copy blocks between nodes to maintain the replication invariant. The NameNode acts as the coordinator for replication, and the authority for addressing files to blocks.
 * HDFS is immutable (though it allows appends) – this means that "change" rewrites and replicates the entire block (our blocksize is configured at 256MB).
 * A MapReduce job has two phases: the Mappers each read their section of the input from HDFS (with Hadoop ensuring the section is on local disk), and perform some work, emitting (key, value) pairs; Reducers then receive these tuples after a total ordering is applied (via the keys) and perform aggregation, each writing its result back to HDFS via the local DataNode. This result is then replicated as normal.
 * At the moment, we plan regular data export, who acts exclusively as a public webserver.

Public Surfaces

 * ACL as outlined in the RT ticket.
 * Cross-datacenter connectivity (esams) requires a solution (public IP, or perhaps the existing bridge?)

Services and Access Needs
Many internal service dashboards and control panels:


 * Hue: HDFS web access; Hadoop job scheduling (via Oozie); Hive query dashboard (Beeswax)
 * Hue WebUI Login authentication uses LDAP
 * Can control privileges within the dashboard to granularly restrict access to particular services,
 * Limited control can be exerted on resource use
 * Dashboard access needed by analysts need access for job monitoring/control, and data access
 * Hadoop Admin pages
 * NameNode: Provides HDFS logs and system health overview. Cluster administrator access only.
 * JobTracker, DataNode: provides logs and debugging output for Hadoop jobs. Access needed by analysts to debug.
 * Storm's Nimbus: storm job monitoring and scheduling. Cluster administrator access only.
 * Graphite: Application and host monitoring for the cluster. Cluster administrator access only.