Analytics/Archive/Infrastructure/Logging Solutions Overview

This page is meant as a workspace to evaluate replacements for udp2log. The options here will also be used as the main stream into the Analytics Kraken Cluster.

= Context =

udp2log
Currently the Wikimedia Foundation uses a custom logging daemon to transport logs from several different sources to 3 (as of July 2012) dedicated logging machines. This is done via udp2log. Any host can send UDP sockets to a udp2log daemon. The udp2log daemon forks several 'filter' processes that are each responsible for filtering and transforming all incoming logs. Usually these logs are then saved to local log files, but they can also then be sent out over UDP again to anogher destination using log2udp.

While simple udp2log is a simple C program, and generally works well as it was designed, there are several problems with continuing to use it. The most obvious is that it does not scale. udp2log daemons are mainly being used to collect web access logs for all Wikimedia sites. The traffic for these sites is so great that sampling on the incoming log streams needs to be done. None of the 3 log machines currently configured have the capacity to write unsampled log streams to disk, let alone enough storage space to keep these logs. udp2log was not designed as a distributed solution to this problem.

The Analytics team is tasked with building a distributed analytics cluster that can intake and store unsampled logs from any number of sources, and then subsequently do stream and batch processing on this data. We could attempt to enhance udp2log so that it works in a more distributed fashion. However, this problem has already been solved by some very smart people, so we see no need to spend our own resources solving this problem. So so so! What can we use instead?

Requirements for a distributed logger

 * Hard Requirements:
 * Open source
 * Horizontally Scalable
 * Durable
 * Would be nice:
 * publish / subscribe model
 * ability to modify stream data (AKA content transformation, stream processing) for IP anonymization, GeoCoding, etc.
 * Good documentation, active community.

= Distributed Logging Solutions =

We've put together a table comparing features of quite a few options. Below is a summary of the ones we are seriously considering. See this spreadsheet for a more complete comparison.

Scribe
Scribe is a distributed pushed based logging service. It guarantees reliability and availability by using a tree hierarchy of scribe servers combined with local failover file buffers.

See Also:

Is Scribe Still Maintained?

Why did Facebook develop a new logging service. In particular check out Sam Rash's answer and presentation on Calligraphus and Facebook's data freeway.

What's Up Scribe - Otto's blog post on Scribe packaging and summary of Calligraphus.

Pros

 * C++
 * Simple configuration files
 * Generally works well
 * Uses thrift so clients are available for any language.

Cons

 * No longer an active project, Facebook is phasing this out
 * HDFS integration might be buggy
 * Difficult to package
 * Static configuration
 * No publish / subscribe
 * Routing topologies limited. Can only do two branch trees.