Analytics/Kraken/Firehose

This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!

This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.

= Pipeline Components = Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.

Sources / Producers

 * udp2log
 * Flume UDPSource (custom)
 * Flume Spooling Directory Source
 * KafkaProducer Shell (kafka-console-producer)
 * Ori's UDPKafka

Agents / Buffers / Brokers

 * Flume Memory Channel (volatile)
 * Flume File Channel
 * KafkaBroker
 * plain old files

Sinks / Consumers

 * Flume HDFS Sink
 * kafka-hadoop-consumer (3rd party, has Zookeeper support)
 * Kafka HadoopConsumer (ships with Kafka, no Zookeeper support)
 * plain old cron jobs + hadoop fs -put

= Possible Pipelines =

udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS
This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.

UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS
= Ideal Pipeline = The Ideal pipelines is the orignally propsed architecture that includes modifying frontend production nodes, as well as using storm.

'''C KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS '''