Analytics/Kraken/Firehose
This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!
This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.
Pipeline Components[edit]
Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.
Sources / Producers[edit]
- udp2log
- Flume UDPSource (custom)
- Flume Spooling Directory Source
- KafkaProducer Shell (kafka-console-producer)
- Ori's UDPKafka
Agents / Buffers / Brokers[edit]
- Flume Memory Channel (volatile)
- Flume File Channel
- KafkaBroker
- plain old files
Sinks / Consumers[edit]
- Flume HDFS Sink
- kafka-hadoop-consumer (3rd party, has Zookeeper support)
- Kafka HadoopConsumer (ships with Kafka, no Zookeeper support)
- plain old cron jobs + hadoop fs -put
Possible Pipelines[edit]
udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS[edit]
This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.
Flume UDPSource -> HDFS[edit]
udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS[edit]
udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS[edit]
UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS[edit]
Storm Pipeline[edit]
The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.
Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS
Links[edit]
- Kafka Spout - a Kafka consumer that emits Storm tuples.
- Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
- HDFS API Docs for FileSystem.append()