Analytics/Archive/Infrastructure/Logging Solutions Overview

This page is meant as a workspace to evaluate replacements for udp2log. The options discussed here will be used as the main data firehose into the Analytics Kraken Cluster.

= Context =

udp2log
Currently the Wikimedia Foundation uses a custom logging daemon to transport logs from several different sources to 3 (as of July 2012) dedicated logging machines. This is done via udp2log. Any host can send UDP sockets to a udp2log daemon. The udp2log daemon forks several 'filter' processes that are each responsible for filtering and transforming all incoming logs. Usually these logs are then saved to local log files, but they can also then be sent out over UDP again to anogher destination using log2udp.

While udp2log is a simple C program and generally works as it was designed, there are problems with continuing to use it. The most obvious is that it does not scale. udp2log daemons are mainly being used to collect web access logs for all Wikimedia sites. The traffic for these sites is so great that sampling on the incoming log streams needs to be done. None of the 3 log machines currently configured have the capacity to write unsampled log streams to disk, let alone enough storage space to keep these logs. udp2log was not designed as a distributed solution to this problem.

The Analytics team is tasked with building a distributed analytics cluster that can intake and store unsampled logs from any number of sources, and then subsequently do stream and batch processing on this data. We could attempt to enhance udp2log so that it works in a more distributed fashion. However, this problem has already been solved by some very smart people, so we see no need to spend our own resources solving this problem. So so so! What can we use instead?

= Distributed Logging Solutions = The following sections provide an overview of the logging solutions we are considering.

Configuration

 * How do clients find out about servers? (Bad: static configuration. Good: zk, pub-sub.)
 * What kind of configuration is there for routing messages? (ex. brokers route messages; point-to-point; multicast.)

Failure and Recovery

 * What happens when an aggregation endpoint fails? Does the system support failover for aggregation endpoints? What kind of configuration is there for local buffering?
 * Does the system guarantee exactly once delivery, or at least once delivery?
 * Does the system support trees of aggregation? If so, is this chaining DC-aware, or do we have to build the awareness into our design/config?

Maintenance

 * Local logs should be configurably durable. What options are there for automatic file cleanup/deletion?
 * If the system ensures delivery of queued messages via durability, using local log or buffer files, then log rotation is a must; rotation must be per-minute granular (though we’ll probably use a larger window, this is a reasonable minimal bucket).

Terminology

 * Message
 * A generic term for a single piece of data, usually a log line. (A.K.A. event.)


 * Producer
 * This is usually the original source of a message. It can also be used to refer to any upstream source of messages, but for our purposes will be used to identify an original message source (e.g.  squid, varnish, application, etc.). (A.K.A. source.)


 * Consumer
 * The final destination of a message. This can also refer to any downstream consumer in a chain, but in this document will refer to the final destination. (A.K.A. sink.)


 * Agent
 * Any daemon process that takes part in the passing of messages within a given logging or message system. e.g. a Scribe server, a Flume collector, a Kafka broker.


 * Durable
 * Durable messages are persisted in permanent storage and will survive server failure.


 * Reliable
 * A reliable system guarantees that each message will be delivered at least once.

Feature Comparison
We've put together a table comparing features of quite a few options. Below is a summary of the ones we are seriously considering. See this spreadsheet for a more complete comparison.

= Scribe = Scribe is a distributed pushed based logging service. It guarantees reliability and availability by using a tree hierarchy of scribe servers combined with local failover file buffers.

Reliability
Scribe's message durability is provided by its  store type. A  store has a   and   store, each of which can be any store type. Typically, if the Scribe host in question is an aggregator, the  store is a   store that simply forwards messages over to the next Scribe server in the chain. The  is then configured as a simple file store. If the    store goes down and is unavailable, logs will be buffered in the   store until the   comes back online. The buffered logs will then be re-read from disk and sent over to the  store.

A  store tells Scribe to send its messages to two different stores. This allows for fanout to multiple endpoints, but only with two branches at a time. It is possible to set up each of the two branches as buffered stores, each using the configuration ( ,   local file buffer) described above.

With    stores, Scribe essentially achieves message replication. If a one of the two    stores in an immediate hierarchy explodes, the second will still continue receiving messages. In this setup, the downstream consumers need to be configured in such a way as to avoid consuming duplicate logs, but this is the only way to guarantee 100% message durability with Scribe.

How to get data in

 * Tail log files into scribe_cat or use scribe_tail. This has the disadvantage of (probably) being pretty inefficient.
 * Modify source software to use scribe client code and log directly to scribe. This is similar to what we are doing now for udp2log.

Pros

 * C++
 * Simple configuration files
 * Generally works well
 * Uses thrift so clients are available for any language.

Cons

 * No longer an active project, Facebook is phasing this out
 * HDFS integration might be buggy
 * Difficult to package
 * Static configuration
 * No publish / subscribe
 * Routing topologies limited. Can only do two branch trees.

Data Flows
Flume is configured via 'data flows'. “A data flow describes the way a single stream of data is transferred and processed from its point of generation to its eventual destination.” Data flows are a centralized high level way of expressing how data gets from producer machine to consumer machines. These flows are configured at the Flume Master, and do not need to be configured at each producer. Each producer needs to run a Flume agent. The Flume agents are then configured to read data from the producers into the data flow by the Flume Master.

Data flows are expressed in static config files, or given as commands to a Flume CLI or web GUI. In general, a flow is of the form

Example data flow:

In this example, two squid nodes are configured to tail their access logs into an End to End agent chain with two node failover. 's access logs will be sent to node. defines a as a source, which simply tells node that it should act as a Flume Collector. The logs that arrive to 's  will be written to HDFS.

and  work similarly. Notice that each of the s specify two nodes. The additional nodes are failover nodes. If  goes down, then   will be notified by the Squid Master (via Zookeeper), and will begin sending its logs to  until   comes back online.

Note that this setup explicitly specifies failover chains. Flume can take care of automatically configuring the failover chains, however there is a note in the documentation that says automatic failover chains do not work with multiple Flume Masters.

Flume's centralized configuration makes it easy to modify data flows on the fly, without having to change config files on producers.

Reliability
Each data flow's reliability level is configurable.

End-to-end
Uses write-ahead-log to buffer event until final sink has ACKed receiving and storing the event. ACKs are handled through master nodes, rather than up through the chain. This guarantees that that a message will reach the end of a data flow at least once. However, if a sink somewhere blocks for some reason, and an upstream source times out waiting for an ACK, the upstream source will resend the message. This can cause duplicate messages to be stored. Since this requires Flume Master coordination, this is the most inefficient option.

Store on failure
Works like scribe’s buffer store. Agent requires ACK from immediate downstream receiver. If downstream receiver is down, the agent stores data locally until downstream comes back up, or until a failover downstream is selected and event is ACKed. This also can cause duplicate messages for the same reason.

Best effort
This mode sends a message to the receiver, without any acknowledgement. This mode is good only if you need high throughput but not high reliability.

How to get data in
Named pipes or Flume tail.



Pros

 * Centralized configuration
 * Reliable
 * Highly available (with multi Flume Master setup.
 * Backed by Cloudera, this has been the logs -> HDFS solution of choice.
 * source agents can track logrotated files, HDFS sinks write file names with configurable granularity.

Cons

 * Java, using Flume only would require running JVM on production servers (squids, varnish, etc.).
 * A slow sink could cause a large backlog.
 * Development of Flume has come to a stop, Flume-NG is the next release, and is not production ready.
 * No publish/subscribe feature.

Reliability
Kafka makes no reliability guarantees at the producer or broker (agent) level. Consumers are responsible (via ZooKeeper, not manually) for saving state about what has been consumed. Brokers always save buffered messages for a configurable amount of time. LinkedIn keeps a week of data on each broker. This means that if a consumer fails, and we notice it within a week, the consumer should be able to start back up and continue consuming messages from where it left off.

However, LinkedIn built an audit trail to track data loss. Each of their producers and consumers emit statistics for how many messages they have processed for a given topic in a given time period into a special Kafka topic for audit purpose. This topic then contains the sum of messages processed for each topic at each tier (producer and consumer in the hierarchy) for a given time window (e.g. 10 minutes). If the counts at each level match, then it is assumed no data loss has happened. Linked in states “We are able to run this pipeline with end-to-end correctness of 99.9999%. The overwhelming majority of data loss comes from the producer batching where hard kills or crashes of the application process can lead to dropping unsent messages.”

How to get data in

 * tail -f /var/log/squid/access.log | kafka-console-producer.sh --topic squid
 * Already Kafka clients built for these languages: C++, C#, Go, PHP, Python, Ruby

Pros

 * publish / subscribe
 * Efficient memory management (JVM memory usage low) by leveraging OS pagecache + sendfile.
 * Clients written in many languages.
 * Brokers do not keep state.
 * Uses ZooKeeper for configuration and failover.
 * Consumer parallelization
 * No masters, all brokers are peers.

Cons

 * Scala, so runs in JVM.
 * No per message reliability. Have to monitor data loss.
 * No built in 'data flows' like Flume. Have to set these up via subscibed consumers.