Analytics/Archive/Infrastructure/Logging Solutions Recommendation

A prerequisite to providing answers to rich, structured queries about our readers and editors is to capture the incoming request stream. Modest though this goal might appear&mdash;"keep the logs"&mdash;the system to collect, aggregate, and process this firehose of banality is probably one of the most challenging tasks in bringing Kraken from the realm of myth into reality.

The current system has been scaled to its limit and then beyond, such that we currently accept regular failures and a 1:1000 sampling rate. , a C program so simple it's name really is all you need to know about its operation, cannot be used to capture this data stream if we wish it to be used for business-critical decision-making. UDP and sampling introduce unacceptable uncertainty into the data, and hobble the queries we can handle in good faith. Our present stopgap&mdash;ad-hoc data collection&mdash;is not sustainable or desirable.

The request logging system is an imposing mountain of requirements. It has to handle Big Data: the full stream of production requests, but likely replicated multiple times as it touches different points in the serve-chain. This system needs to operate in near-realtime, as it is impossible to catch up if the system consistently falls behind. These real performance needs are, as always, at odds with the needs of stability. As a distributed system, it must expect and account for frequent, unplanned failure. As a system that must interface with all our user-facing servers, it must not impact production stability.

Though the mountain is imposing, it has been scaled before. Our recommendation is Apache Kafka, a distributed pub-sub messaging system designed for throughput. We evaluated about a dozen best-of-breed systems drawn from the domains of distributed log collection, CEP / stream processing, and real-time messaging systems. While these systems offer surprisingly similar features, they differ substantially in implementation, and each is specialized to a particular work profile (a more thorough technical discussion is available as an appendix).

Kafka stands out because it is specialized for throughput and explicitly distributed in all tiers of its architecture. Interestingly, it is also concerned enough with resource conservation to offer sensible tradeoffs that loosen guarantees in exchange for performance &mdash; something that may not strike Facebook or Google as an important feature in the systems they design. Constraints breed creativity.

In addition, Kafka has several perks of particular interest to Operations readers. While it is written in Scala, it ships with a native C++ producer library that can be embedded in a module for our cache servers, obviating the need to run the JVM on those servers. Second, producers can be configured to batch requests to optimize network traffic, but do not create a persistent local log which would require additional maintenance.

Kafka was written by LinkedIn and is now an Apache project. In production at LinkedIn, approximately 10,000 producers are handled by eight Kafka servers per datacenter. These clusters consolidate their streams into a single analytics datacenter, which Kafka supports out of the box via a simple mirroring configuration.

These features are a very apt fit for our intended use cases; even those we don't intend to use &mdash; such as sharding and routing by "topic" categories &mdash; are interesting and might prove useful in the future as we expand our goals.

The rest of this document dives into these topics in greater detail.

The Firehose
Distributed log aggregation is a non-trivial problem for us because of The Firehose. The success of Wikipedia is a double-edged sword, as we are all well aware. At present, the entire request stream is estimated to be ~150k requests/second, handled by 635 servers, who generate around 117 GB/hour (uncompressed) &mdash; 2.75 TB/day. Further, the load is geographically distributed, requiring disparate streams be aggregated. As with all distributed programming challenges, we must expect connections to experience periods of high variance, and participants to randomly fail or disappear.

In this light, it is easy to understand how we have outgrown. It is simple and efficient, with a kind of admirable innocence about it. But while efficiency borne of simplicity has much to recommend it, we are faced with a complex and difficult problem. The result is a series of unsatisfying compromises. offers no reliability guarantees. There is no failover. It seems to resist engaging with its parents: UDP makes it difficult to monitor failure or verify success. And somehow we end up doing its chores: if you wish to partition or replicate the work,  is not interested in helping you. "How about *YOU* go statically configure all the producers and consumers yourself?" "I don't wanna clean up my logfiles!" Our innocent child is a bit of a stubborn brat.

Requirements
With this clearer view of the problem, here are what we see as the requirements:


 * Performance: high throughput so as to handle the firehose.
 * Horizontal Scaling: clustered and distributed, providing fault tolerance and automated recover, transport reliability, and durability for received messages.
 * Producer Client Performance & Stability: agent running on production servers must not impact stability of user-facing servers.
 * Battle Tested: must be in production, handling Big Data somewhere else.
 * Simplicity: low maintenance cost, amenable to automation.
 * Independence: not tied to any producer platform or storage method (e.g., not specialized for HDFS).

Kafka
(in progress)

Rejected Alternatives
(in progress)