Analytics/Archive/Infrastructure/Logging Solutions Overview

This page is meant as a workspace to evaluate replacements for udp2log. The options here will also be used as the main stream into the Analytics Kraken Cluster.

= Context =

udp2log
Currently the Wikimedia Foundation uses a custom logging daemon to transport logs from several different sources to 3 (as of July 2012) dedicated logging machines. This is done via udp2log. Any host can send UDP sockets to a udp2log daemon. The udp2log daemon forks several 'filter' processes that are each responsible for filtering and transforming all incoming logs. Usually these logs are then saved to local log files, but they can also then be sent out over UDP again to anogher destination using log2udp.

While simple udp2log is a simple C program, and generally works well as it was designed, there are several problems with continuing to use it. The most obvious is that it does not scale. udp2log daemons are mainly being used to collect web access logs for all Wikimedia sites. The traffic for these sites is so great that sampling on the incoming log streams needs to be done. None of the 3 log machines currently configured have the capacity to write unsampled log streams to disk, let alone enough storage space to keep these logs. udp2log was not designed as a distributed solution to this problem.

The Analytics team is tasked with building a distributed analytics cluster that can intake and store unsampled logs from any number of sources, and then subsequently do stream and batch processing on this data. We could attempt to enhance udp2log so that it works in a more distributed fashion. However, this problem has already been solved by some very smart people, so we see no need to spend our own resources solving this problem. So so so! What can we use instead?

Requirements for a distributed logger

 * Hard Requirements:
 * Open source
 * Horizontally Scalable
 * Durable
 * Would be nice:
 * publish / subscribe model
 * ability to modify stream data (AKA content transformation, stream processing) for IP anonymization, GeoCoding, etc.
 * Good documentation, active community.

= Distributed Logging Solutions =

We've put together a table comparing features of quite a few options. Below is a summary of the ones we are seriously considering. See this spreadsheet for a more complete comparison.

Scribe
Scribe is a distributed pushed based logging service. It guarantees reliability and availability by using a tree hierarchy of scribe servers combined with local failover file buffers.

Reliability
Scribe's message durability is provided by its buffer store type. A buffer store has a and store, each of which can be any store type. Typically, if the Scribe host in question is an aggregator, the store is a network store that simply forwards messages over to the next Scribe server in the chain. The is then configured as a simple file store. If the network store goes down and is unavailable, logs will be buffered in the store until the comes back online. The buffered logs will then be re-read from disk and sent over to the store.

A multi store tells Scribe to send its messages to two different stores. This allows for fanout to multiple endpoints, but only with two branches at a time. It is possible to set up each of the two branches as buffered stores, each using the configuration ( = network, = local file buffer) described above.

With multi network stores, Scribe essentially achieves message replication. If a one of the two multi network stores in an immediate hierarchy explodes, the second will still continue receiving messages. In this setup, the downstream consumers need to be configured in such a way as to avoid consuming duplicate logs, but this is the only way to guarantee 100% message durability with scribe.

Option 1
Tail log files into scribe_cat or use scribe_tail. This has the disadvantage of (probably) being pretty inefficient.

Option 2
Modify source software to use scribe client code and log directly to scribe. This is similar to what we are doing now for udp2log.

Pros

 * C++
 * Simple configuration files
 * Generally works well
 * Uses thrift so clients are available for any language.

Cons

 * No longer an active project, Facebook is phasing this out
 * HDFS integration might be buggy
 * Difficult to package
 * Static configuration
 * No publish / subscribe
 * Routing topologies limited. Can only do two branch trees.

Flume
Flume is a distributed logging service built by Cloudera primarily as a reliable way of getting stream and log data into HDFS. Its pluggable architecture supports any consumer.

Flume is configured via 'data flows'. “A data flow describes the way a single stream of data is transferred and processed from its point of generation to its eventual destination.”

Kafka
= References =