Analytics/Archive/Infrastructure/Logging Solutions Recommendation

A prerequisite to providing answers to rich, structured queries about our readers and editors is to capture the incoming request stream. Modest though this goal might appear&mdash;"keep the logs"&mdash;the system to collect, aggregate, and process this firehose of banality is probably one of the most challenging tasks in bringing Kraken from the realm of myth into reality.

The current system has been scaled to its limit and then beyond, such that we currently accept regular failures and a 1:1000 sampling rate. , a C program so simple it's name really is all you need to know about its operation, cannot be used to capture this data stream if we wish it to be used for business-critical decision-making. UDP and sampling introduce unacceptable uncertainty into the data, and hobble the queries we can handle in good faith. Our present stopgap&mdash;ad-hoc data collection&mdash;is not sustainable or desirable.

The request logging system is an imposing mountain of requirements. It has to handle Big Data: the full stream of production requests, but likely replicated multiple times as it touches different points in the serve-chain. This system needs to operate in near-realtime, as it is impossible to catch up if the system consistently falls behind. These real performance needs are, as always, at odds with the needs of stability. As a distributed system, it must expect and account for frequent, unplanned failure. As a system that must interface with all our user-facing servers, it must not impact production stability.

Though the mountain is imposing, it has been scaled before. Our recommendation is Apache Kafka, a distributed pub-sub messaging system designed for throughput. We evaluated about a dozen best-of-breed systems drawn from the domains of distributed log collection, CEP / stream processing, and real-time messaging systems. While these systems offer surprisingly similar features, they differ substantially in implementation, and each is specialized to a particular work profile (a more thorough technical discussion is available as an appendix).

Kafka stands out because it is specialized for throughput and explicitly distributed in all tiers of its architecture. Interestingly, it is also concerned enough with resource conservation to offer sensible tradeoffs that loosen guarantees in exchange for performance &mdash; something that may not strike Facebook or Google as an important feature in the systems they design. Constraints breed creativity.

In addition, Kafka has several perks of particular interest to Operations readers. While it is written in Scala, it ships with a native C++ producer library that can be embedded in a module for our cache servers, obviating the need to run the JVM on those servers. Second, producers can be configured to batch requests to optimize network traffic, but do not create a persistent local log which would require additional maintenance.

Kafka was written by LinkedIn and is now an Apache project. In production at LinkedIn, approximately 10,000 producers are handled by eight Kafka servers per datacenter. These clusters consolidate their streams into a single analytics datacenter, which Kafka supports out of the box via a simple mirroring configuration.

These features are a very apt fit for our intended use cases; even those we don't intend to use &mdash; such as sharding and routing by "topic" categories &mdash; are interesting and might prove useful in the future as we expand our goals.

The rest of this document dives into these topics in greater detail.

The Firehose
Distributed log aggregation is a non-trivial problem for us because of The Firehose. The success of Wikipedia is a double-edged sword, as we are all well aware. At present, the entire request stream is estimated to be ~150k requests/second, handled by 635 servers, who generate around 117 GB/hour (uncompressed) &mdash; 2.75 TB/day. Further, the load is geographically distributed, requiring disparate streams be aggregated. As with all distributed programming challenges, we must expect connections to experience periods of high variance, and participants to randomly fail or disappear.

In this light, it is easy to understand how we have outgrown. It is simple and efficient, with a kind of admirable innocence about it. But while efficiency borne of simplicity has much to recommend it, we are faced with a complex and difficult problem. The result is a series of unsatisfying compromises. offers no reliability guarantees. There is no failover. It seems to resist engaging with its parents: UDP makes it difficult to monitor failure or verify success. And somehow we end up doing its chores: if you wish to partition or replicate the work,  is not interested in helping you. "How about *YOU* go statically configure all the producers and consumers yourself?" "I don't wanna clean up my logfiles!" Our innocent child is a bit of a stubborn brat.

Requirements
With this clearer view of the problem, here are what we see as the requirements:


 * Performance: high throughput so as to handle the firehose.
 * Horizontal Scaling: clustered and distributed, providing fault tolerance and automated recover, transport reliability, and durability for received messages.
 * Producer Client Performance & Stability: agent running on production servers must not impact stability of user-facing servers.
 * Battle Tested: must be in production, handling Big Data somewhere else.
 * Simplicity: low maintenance cost, amenable to automation.
 * Independence: not tied to any producer platform or storage method (e.g., not specialized for HDFS).

Kafka
Apache Kafka is a distributed pub-sub messaging system designed for throughput. The homepage contains a concise statement of its virtues:


 * Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
 * High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
 * Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
 * Support for parallel data load into Hadoop.

Project
The project was developed by LinkedIn for its activity stream, and forms the backbone of its data processing pipeline. That is to say: several years ago, LinkedIn needed to solve exactly the problem we're looking at, and found all the other solutions lacking. They didn't put emphasis on throughput or resource efficiency, or they had awkward logfile-based semantics. So they wrote Kafka, and now it is used by dozens of companies who found themselves in the same place.

Kafka has an active community and a solid roadmap. The team is small (about 5 core committers, it seems) but well-organized and productive. The authors also published several research papers on the system when it was first open sourced.

Architecture
Kafka is a publish-subscribe message passing system. This section attempts to provide an relevant architectural details without getting into technical detail for its own sake.

The elemental unit of transmission in Kafka is a single message -- it has no concept of log files when interacting with producers, nor databases when sending data to consumers. This eliminates a number of awkward problems in serializing or streaming data, and decouples the concept of a producer from its machine and application. All messages are categorized with a "topic", which is used for routing messages to consumers. For the most part, we can ignore the details of this for our discussion.

The agents in a Kafka system are split into three roles:
 * A Producer originates messages; in other systems this may be called a "source" or "client". Each application that generates log data of analytic interest will be acting as a producer -- thus each squid, apache, and/or varnish is a candidate.
 * A Broker (with "Kafka Server" used interchangeably) receives messages from producers, handles message persistence, and routes them to the appropriate topic queues.
 * A Consumer processes messages from topic queue. Though much of Kafka's architecture concerns the relationship between consumers, message consumption, and brokers, we'll be eliding that detail as nearly all consumers will live in the analytics cluster.

This separation of concerns has several beneficial effects:
 * The publish-subscribe model means that producers only need to know how to connect to the broker cluster. The cluster itself can determine the optimal distribution of producers to brokers, conduct load-balancing, and mediate failover. This simplifies all configuration to the point that one uniform config can be used for all hosts in a producer cluster.
 * Producers are intended to be thin, dumb clients: no logs are stored on disk before being sent to the broker; no acks are sent, and there are no retries. To mediate the possibility of messages encountering network difficulties, Kafka uses TCP as the transport and layers atop its own transmission format.
 * Kafka's message format includes compression (gzip or snappy) and the ability to recursively embed messages, enabling very efficient network transmission. The buffer window and/or size can be controlled via configuration.
 * The clear separation of concerns allows for easy multi-datacenter mirroring, a consumer from one cluster can be a producer for another; Kafka ships with this functionality built in.
 * As messages are ordered by arrival timestamp as determined by the broker, there is no need for a single master anywhere in Kafka; all brokers are peers.

Rejected Alternatives
(in progress)