Analytics/Kraken/Firehose

From mediawiki.org

This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!

This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.

Pipeline Components[edit]

Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.

Sources / Producers[edit]

Agents / Buffers / Brokers[edit]

Sinks / Consumers[edit]


Possible Pipelines[edit]

udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS[edit]

This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.

Flume UDPSource -> HDFS[edit]

udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS[edit]

udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS[edit]

UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS[edit]

Storm Pipeline[edit]

The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.

Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS

Links[edit]

  • Kafka Spout - a Kafka consumer that emits Storm tuples.
  • Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
  • HDFS API Docs for FileSystem.append()