Analytics/Archive/Infrastructure/Logging Solutions Overview

This page is meant as a workspace to evaluate replacements for udp2log. The options here will also be used as the main data firehose into the Analytics Kraken Cluster.

= Context =

udp2log
Currently the Wikimedia Foundation uses a custom logging daemon to transport logs from several different sources to 3 (as of July 2012) dedicated logging machines. This is done via udp2log. Any host can send UDP sockets to a udp2log daemon. The udp2log daemon forks several 'filter' processes that are each responsible for filtering and transforming all incoming logs. Usually these logs are then saved to local log files, but they can also then be sent out over UDP again to anogher destination using log2udp.

While udp2log is a simple C program, and generally works well as it was designed, there are several problems with continuing to use it. The most obvious is that it does not scale. udp2log daemons are mainly being used to collect web access logs for all Wikimedia sites. The traffic for these sites is so great that sampling on the incoming log streams needs to be done. None of the 3 log machines currently configured have the capacity to write unsampled log streams to disk, let alone enough storage space to keep these logs. udp2log was not designed as a distributed solution to this problem.

The Analytics team is tasked with building a distributed analytics cluster that can intake and store unsampled logs from any number of sources, and then subsequently do stream and batch processing on this data. We could attempt to enhance udp2log so that it works in a more distributed fashion. However, this problem has already been solved by some very smart people, so we see no need to spend our own resources solving this problem. So so so! What can we use instead?

Requirements for a distributed logger

 * Hard Requirements:
 * Open source
 * Horizontally Scalable
 * Durable
 * Would be nice:
 * publish / subscribe model
 * ability to modify stream data (AKA content transformation, stream processing) for IP anonymization, GeoCoding, etc.
 * Good documentation, active community.

= Distributed Logging Solutions = The following sections provide an overview of the logging solutions we are considering.

We've put together a table comparing features of quite a few options. Below is a summary of the ones we are seriously considering. See this spreadsheet for a more complete comparison.

= Scribe = Scribe is a distributed pushed based logging service. It guarantees reliability and availability by using a tree hierarchy of scribe servers combined with local failover file buffers.

Reliability
Scribe's message durability is provided by its  store type. A  store has a   and   store, each of which can be any store type. Typically, if the Scribe host in question is an aggregator, the  store is a   store that simply forwards messages over to the next Scribe server in the chain. The  is then configured as a simple file store. If the    store goes down and is unavailable, logs will be buffered in the  store until the   comes back online. The buffered logs will then be re-read from disk and sent over to the  store.

A  store tells Scribe to send its messages to two different stores. This allows for fanout to multiple endpoints, but only with two branches at a time. It is possible to set up each of the two branches as buffered stores, each using the configuration ( ,   local file buffer) described above.

With    stores, Scribe essentially achieves message replication. If a one of the two    stores in an immediate hierarchy explodes, the second will still continue receiving messages. In this setup, the downstream consumers need to be configured in such a way as to avoid consuming duplicate logs, but this is the only way to guarantee 100% message durability with Scribe.

How to get data in

 * Tail log files into scribe_cat or use scribe_tail. This has the disadvantage of (probably) being pretty inefficient.
 * Modify source software to use scribe client code and log directly to scribe. This is similar to what we are doing now for udp2log.

Pros

 * C++
 * Simple configuration files
 * Generally works well
 * Uses thrift so clients are available for any language.

Cons

 * No longer an active project, Facebook is phasing this out
 * HDFS integration might be buggy
 * Difficult to package
 * Static configuration
 * No publish / subscribe
 * Routing topologies limited. Can only do two branch trees.

Data Flows
Flume is configured via 'data flows'. “A data flow describes the way a single stream of data is transferred and processed from its point of generation to its eventual destination.” Data flows are a centralized high level way of expressing how data gets from producer machine to consumer machines. These flows are configured at the Flume Master, and do not need to be configured at each producer. Each producer needs to run a Flume agent. The Flume agents are then configured to read data from the producers into the data flow by the Flume Master.

Data flows are expressed in static config files, or given as commands to a Flume CLI or web GUI. In general, a flow is of the form

Example data flow:

In this example, two squid nodes are configured to tail their access logs into an End to End agent chain with two node failover. 's access logs will be sent to node. defines a as a source, which simply tells node that it should act as a Flume Collector. The logs that arrive to 's  will be written to HDFS.

and  work similarly. Notice that each of the s specify two nodes. The additional nodes are failover nodes. If  goes down, then   will be notified by the Squid Master (via Zookeeper), and will begin sending its logs to  until   comes back online.

Note that this setup explicitly specifies failover chains. Flume can take care of automatically configuring the failover chains, however there is a note in the documentation that says automatic failover chains do not work with multiple Flume Masters.

Flume's centralized configuration makes it easy to modify data flows on the fly, without having to change config files on producers.

Reliability
Each data flow's reliability level is configurable.

End-to-end
Uses write-ahead-log to buffer event until final sink has ACKed receiving and storing the event. ACKs are handled through master nodes, rather than up through the chain. This guarantees that that a message will reach the end of a data flow at least once. However, if a sink somewhere blocks for some reason, and an upstream source times out waiting for an ACK, the upstream source will resend the message. This can cause duplicate messages to be stored. Since this requires Flume Master coordination, this is the most inefficient option.

Store on failure
Works like scribe’s buffer store. Agent requires ACK from immediate downstream receiver. If downstream receiver is down, the agent stores data locally until downstream comes back up, or until a failover downstream is selected and event is ACKed. This also can cause duplicate messages for the same reason.

Best effort
This mode sends a message to the receiver, without any acknowledgement. This mode is good only if you need high throughput but not high reliability.

Pros

 * Centralized configuration
 * Reliable
 * Highly available (with multi Flume Master setup.
 * Backed by Cloudera, this has been the logs -> HDFS solution of choice.
 * source agents can track logrotated files, HDFS sinks write file names with configurable granularity.

Cons

 * Java, using Flume only would require running JVM on production servers (squids, varnish, etc.).
 * A slow sink could cause a large backlog.
 * Development of Flume has come to a stop, Flume-NG is the next release, and is not production ready.
 * No publish/subscribe feature.