Analytics/Archive/Infrastructure/Overview

= Glossary =


 * Kraken
 * Kraken is the code-name for the distributed computing and data-services platform under construction by the Wikimedia Analytics Team.


 * Hadoop
 * A collection of services for batch processing of large data. Core concepts are a distributed filesystem (HDFS) and MapReduce.


 * HDFS
 * Hadoop Distributed Filesystem.


 * YARN
 * Yet-Another-Resource-Negotiator. Compute resource manager and API for distributed applications.  MapReduce (v2) is a YARN application.  Although not technically correct, YARN is sometimes referred to as MRv2.


 * MapReduce
 * Programming model for parallelizing batch processing of large data sets. Really good for counting things.  Hadoop ships with a Java implementation of MapReduce.


 * Hue
 * Hadoop User Experience. Web GUI for various Hadoop services.  (HDFS, Oozie, Pig, Hive, etc.)  Written in Python Django.


 * Pig
 * High level language abstraction for common implementing MapReduce programs without thinking about the MapReduce model. (Feels like a mix between SQL and awk).  Generates MapReduce programs that are run in Hadoop.


 * Hive
 * Projects structure onto flat data (text or binary) in HDFS and allows this data to be queried in an SQL like manner. Analytics is not currently using Hive, although believe it will be useful for certain types of analysis as more clients have need.


 * Oozie
 * Scheduler for Hadoop jobs. Used for automated reporting based on data availability.


 * Storm
 * Distributed real-time stream processing system. Useful for transformation and computation of streaming data.


 * Kafka
 * Distributed pub/sub message queue. Useful as a big ol' reliable log buffer.


 * Zookeeper
 * Highly reliable dynamic configuration store. Instead of keeping configuration and simple state in flat files or a database, Zookeeper allows applications to be notified of configuration changes on the fly.


 * Cloudera
 * An organization that provides Hadoop ecosystem packages and documentation. Kraken uses Cloudera's CDH4 distribution.


 * webrequest log stream (aka firehose)
 * The Analytics team has been referring to the web access logs generated from all WMF frontend cache webservers as 'webrequest logs'. The existent UDP stream is often referred to as the 'firehose', or the webrequest log stream.


 * event log stream
 * bits varnishes are currently configured to send certain request data made to an  endpoint serviced by the bits Varnish caches. This is intended to be used as a way to get generic application data into Kraken for analysis; the EventLogging extension also uses this endpoint.


 * jmx
 * JMX (Java Management Extensions) is a widely-adopted JVM mechanism to expose VM and application metrics, state, and management features. Nearly all the above-listed technologies provide JMX metrics; for example, Kafka brokers expose BytesIn, BytesOut, FailedFetchRequests, FailedProduceRequests, and MessagesIn aggregated across all topics (as well as individually). Kraken is using jmxtrans to send relevant metrics to Ganglia and elsewhere.

= Architecture =

First! A summary of Kraken's current (as of Feburary 2013) setup. The end of this document will contain a wishlist of the pieces of the architecture that we'd like to change.

Kraken can be used for many other things than processing the webrequest logs but as that is our primary source of big data and this is an overview document, the focus will be on describing the architecture to import, store, and analyze this data stream.

UDP Webrequest Log Stream
Frontend cache servers (varnish, squid, and nginx) are all configured to send webrequest access logs over UDP to 3 destination hosts. These are emery, locke, and oxygen. oxygen is running a socat multicast relay that is consumable by nodes in eqiad.

Analytics' spring 2013 deliverables requires running analysis on unsampled mobile webrequest logs. Node(s) in the analytics cluster run udp2log instances that filter for traffic from the 4 frontend varnish mobile hosts (cp1041-cp1044). udp2log pipes this unsampled data into Kafka shell producers.

Log Buffering - Kafka
Kraken is using kafka-producer-shell.sh to send data from udp2log into Kafka. The 2 Kafka brokers retain this data on disk for up to a week. It is available for consumption by any Kafka consumer. Kafka consumers specify a byte offset in a particular Kafka topic at which they would like to start reading, as well as a byte limit. A byte limit of -1 means to read until the end of the topic is reached.

Cron jobs running on a single analytics node are configured to ask Zookeeper for existing Kafka topics, and to consume particular topics into HDFS. This is done via kafka-hadoop-consumer a Hadoop MapReduce job that consumes from Kafka and writes to HDFS. kafka-hadoop-consumer saves its per-topic read state in Zookeeper, and uses this to know where it left off after the previous import.

Batch Analysis - Hadoop
Raw imported data is saved in HDFS in the /wmf/raw directory. Files in /wmf/raw are not world readable. Analysis of these logs is done by various Pig scripts and Oozie job scheduling, although we are (as of mid February 2013) working to clean and automate this process. Hue is used extensively to browse data in HDFS and to edit and schedule and monitor Oozie jobs as they run.

Oozie is used to schedule and chain runs of Pig scripts based on the availability of data in HDFS. Workflows are triggered once the data is available, and are configured to provide the Pig scripts with the paths of HDFS files to use for a particular run.

Oozie and Pig are generally configured to save the output of their analysis in /wmf/public directory of HDFS. This directory is world readable and is intended to be available for visualization using Limn, or for analysts to download and do further custom analysis or processing. It could also be used to trigger additional automated Oozie workflows.

= Wishlist =

Oh there are so many hacky pieces of this setup! Here's what we'd like to change:


 * We'd love to run Kafka producers on the frontend cache webservers, rather than a single udp2log stream. A single stream means that every NIC must receive every webrequest log byte.  Kafka is meant to be dynamically configured via Zookeeper.  Scaling it horizontally and automatic failover are built in.  If the frontends sent their data directly to Kafka, we could do away with the hacky and inefficient unsampled udp2log -> kafka shell producer setup we have currently.


 * The existing ETL phase has got to change. The crappy kafka-hadoop-consumer spawned by a cron job to consume from Kafka topics into HDFS just sucks.  Replacement contenders are LinkedIn's Camus and/or Storm.


 * We're importing raw webrequest logs, which contain sensitive data. We would like to use Storm topologies to sanitize this data, as well as to do other transformations before the data is stored in HDFS.

Below is the envisioned architecture in diagram form:

= Envisioned Architecture Diagram =