Analytics/Hardware

David Schoonover, March 2012

The Wiki Movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. We are starved for it. This document describes the resources needed to sate our data-hunger; in addition, it outlines several operational proposals for partially feeding this need, though I suspect we will underestimate how hungry we are until after we get the first byte.

Overview
It is best to think of data processing in terms of searching for relationships. Relationships might not be fully realized within one record, so it is often necessary to scan the dataset multiple times, or to be computationally tricky. Searching a lot of data is hard. Searching for a complex relationship is hard. Searching a lot of data for a complex relationship is exponentially hard.

Mo data, mo processing2.

Available query capacity, on the other hand, scales linearly with the size of the cluster dedicated to stream and query processing. We can often efficiently partition a hard task, but this does not make it less hard; more machines can make a problem intractable rather than impossible, but not easy rather than hard.

With this in mind, I have broken down the hardware requirements into tranches using the proportion of the incoming data stream that could be processed, sometimes for only a subset of high-level metric-groups.

Tranche A: Incipient Data Services Platform

 * 30 commodity servers for stream and job processing.
 * 10 high-performance servers for the query workload and some stream processing.
 * 5-10 low-performance servers required for cluster support.
 * 3-5 commodity servers for test and staging cluster.

We strongly recommend this hardware profile, as it enables the Analytics team to meet all of its goals.

This would provide the processing, storage, and query capacity for:
 * All current request streams, including all wiki, media/upload, and mobile traffic.
 * Future data from click tracking, A/B testing, mobile instrumentation, MediaWiki internal instrumentation, and extension instrumentation via a MediaWiki internal API.
 * Up to 30% traffic growth.

This hardware profile would be capable of providing all typical web traffic analytics fully joined with wiki userdata classifiers, as well as campaign tracking, "clickstream" or referrer chain analysis, and conversion funnels. The same level of detail would be available for edit activity, combined with a limited amount of graph-processing on the article content dataset.

In addition, this cluster would support a public Query API. Sufficient hardware begets the apotheosis of analytics: a true data services platform, capable of providing realtime insight into community activity and a new view of humanity's knowledge to power applications, mash up into websites, and stream to devices.

Tranche B: Non-Media Analytics

 * 10 commodity servers for stream and job processing.
 * 10 high-performance servers for the query workload and some stream processing.
 * 5-10 low-performance servers required for cluster support.
 * 3-5 commodity servers for test and staging cluster.

To reduce hardware demands, we must reduce both the volume of data and the complexity of our processing. This profile retains the ability to provide the space of analytics queries described in tranche A, but it elides the Upload/Media data stream to reduce volume by approximately 55%. It could still handle projected 2013 traffic growth and new instrumentation as described, though it would be unwise to attempt extensive graph processing on the content dataset. It would not be capable of servicing a public API.

Tranche C: Web Analytics

 * 10 high-performance servers for all query, stream, and job processing.
 * 5 low-performance servers required for cluster support.
 * 3-5 commodity servers for test and staging cluster.

This profile represents the minimum necessary hardware allocation to perform the analytics tasks identified as high priority for the foundation. It would process only edit activity, and traffic from wiki, search, and mobile. Limited instrumentation could be serviced if campaigns are carefully scoped and monitored. Likewise, the full query space could still be explored, but ad-hoc queries would be executed in serial; only cached, materialized data would be available for exploration in a dashboard. It would not be capable of servicing a public API.

Extrapolated Cluster Workload
Requests represent a decent heuristic for the cluster workload, as each request becomes a single event entering the cluster. Events are buffered, bundled, and compressed before being written to the cluster ETL queue. This bundling reduces the volume of writes by the size of the buffer window (at least 10:1), and compression by about 3:1; the bundle is replicated at least three times.

Extrapolating from these numbers, we can expect approximately 23.235 GB/day of storage consumed by the current full event stream (about 30 GB/day with 30% growth). The instrumentation stream, estimated from the above, would add an additional 7-10 GB/day.

Using those estimates, the storage currently attached to the Cisco machines (16.2TB in 300GB SAS 10k drives) would be exhausted after approximately 11 months.

System Overview
"Analytics cluster" is something of a misnomer for sufficiently big data; most of the cleverness goes into machine coordination, and most of the engineering energy goes into fault tolerance. The tools we use to perform these tasks are not specialized for analytics in the way of most data warehousing servers of the past. That said, a well-built cluster is still tremendously complex (even if it is also a powerful computational platform). This section traces the oft-convoluted path of data and processing.

Data Import
There are two continuous vectors of data import:
 * Pixel Service endpoint, a REST API for submitting events.
 * Streaming Agents on web and cache servers.

Both import vectors converge on the Bundler Gateway, a cluster of stateless aggregators ("Bundlers") that shard and buffer incoming data before writing it as a page to the ETL queue.

Stream and Job Processing Workers
All workers are homogenous in their storage membership and replication duties, but are further partitioned for processing jobs into several virtual or logical clusters that are each optimized for different work profiles due to data locality.


 * ETL Cluster: The so-called "short-request" cluster, configured for a write-dominated (low-read) workload. This cluster is the target of the Bundler aggregations and is near-continuous in workload and aims to be realtime in execution profile. The ETL topology will continuously process new blocks into canonicalized form and write them back to the cluster, and the Blocks have a TTL set for future deletion. Standing queries beyond normal ETL will be serviced by additional Storm topologies running against this cluster. This cluster needs lots of compute, but not as much RAM (nearly no repeated reads).
 * Analytics Cluster: Configured for a moderately read-heavy load (probably 65-35, but it depends on the jobs), as this cluster will be the target of MapReduce jobs, Pig and Hive jobs, Storm topologies for CEP, and Storm DRPC topologies to service graph-traversal queries. This cluster needs lots of RAM and CPU, as read speed will depend on the hot-set, and write speed will depend on CPU.

ETL Phase (Hard Realtime)
The Extract-Transform-Load step parses events into a compact binary format, performs IP lookup for geo and/or mobile carrier along with subsequent anonymization, and canonicalizes verbose data (like User Agents) before writing the result back to the database. Additional high-priority, hard realtime standing queries (such as event-driven notifications or triggers) will run as independent-but-linked Storm topologies, triggered after the canonical ETL phase.

The data stream switches from push to pull at this point, like a waterfall terminating in a lake. Subsequent processing steps are batched by their staleness constraints.

Short-Batch Phase (Soft Realtime)
Jobs that desire an almost-realtime view (such as the index on edits, or events by geo/mobile) will run frequently to batch-process only new data. They are expected to be able to do their work on a subset of data, and to be able to terminate quickly. These jobs can be implemented on top of either Storm or Hadoop.

Batch
Jobs that require a full view of the data are relegated to MapReduce, and will use the normal Hadoop scheduler. Early on, this will likely include a variety of batch imports, such as MySQL exports, an XML article dump, fundraising data, community datasets (from, for example, Toolserver or stats.grok.se), historical logs, or 3rd party data (like WikiTrust or academic datasets).

Queries
Query Gateways will receive queries both internally and externally, handle API keys, perform usage-tracking and enforce rate-limiting. The internal query gateway can execute queries utilizing a custom Storm topology as DRPC for short-running scans or point-queries in addition to executing Hive/Pig queries and notifying the user on completion. External API queries will use Storm DRPC and be restricted in runtime and resource use.

Support Servers
Distributed computing clusters are aptly thought of as an ecosystem, and thus a number of exotic creatures occupy niche-but-necessary support roles to maintain the health of the whole:
 * 3-5 ZooKeeper nodes (used by Hadoop and Storm to manage cluster membership)
 * Hadoop Job Tracker
 * Storm Job Tracker (Nimbus)
 * JMX Monitoring Server (probably Zabbix; should be switch-local as it is very chatty)

Secondary Testing Cluster
A small (3-5 node) secondary cluster would be physically and logically distinct from production cluster, and be loaded with a subset of the data to reduce job times.

It would be aimed at providing:
 * Pre-deploy integration testing
 * Job testing and debugging
 * A platform for staging and utility jobs

Server Hardware Guidelines
Generally speaking, we prefer Quantity over Quality: a larger cluster of typical commodity machines is better than a smaller cluster of high-performance machines. That said, there are a few places where high-performance servers are necessary (the NameNodes, the stream processors) or will greatly reduce job execution time (query servers).

Compute
12+ cores, 16+ preferred.


 * It’s best to have more cores than spindles or IO throughput suffers.
 * Cassandra and HBase are highly parallelized and often see CPU bottlenecks before memory bottlenecks.
 * Everything is compressed, often even in memory. While snappy is fairly compute-friendly, bzip2 is not.
 * Processing, indexing, and querying operations all make heavy use of hash-driven and probabilistic data-structures (bitmap indices, cardinality estimators).

Memory
32GB+ RAM, 48GB+ preferred.


 * HDFS NameNode is a memory consuming monster.
 * Stream processing benefits greatly from increased memory, as batch-size improves with memory-size, and processing latency decreases with batch-size, which determines throughput.
 * Batch processing / MapReduce doesn’t benefit as much from high RAM, as there are few repeated reads for caching.
 * Querying processing benefits greatly from increased RAM, as it would love to store everything in memory.

Storage
8-10x 2TB disks, with total cluster storage above 160TB (0.5PB+ preferred).


 * Data is replicated 3x across the cluster.
 * Data size on a given node may double during a major compaction.
 * Intermediate “shuffle” files used for MapReduce are often up to 25% of data size.
 * Data is denormalized and indexed multiple times to provide reasonable query speeds on large datasets.

Network
To quote Cloudera, talking about HBase:

"A typical configuration is to organize the nodes into racks with a 1GE Top Of Rack (TOR) switch. The racks are typically interconnected by one or more low-latency high-throughput dedicated Layer-2 10GE core switches. Many customers are happy with ~40 node clusters that can fit onto one rack with a typical 48-port switch. Even if all of your nodes can fit into one rack but you plan to scale beyond one rack, Cloudera recommends to go with at least two racks from the start to enforce proper practices and network topology scripting."

I would leave the rack configuration up to ops; I expect we will grow above 40 nodes, but not within FY2012-2013.

Additionally, as we plan to do substantial stream processing due to the need to anonymize all request data, these considerations apply to us:

"When we encounter applications that produce large amounts of intermediate data–on the order of the same amount as is read in–we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Alternatively for customers who have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network bound workloads."

Relevant Feature Requests

 * Collect enwiki clickstream data (we could use it to automatically fix links to disambiguation pages and more)
 * Make image views statistics available through wikistats
 * stats.wikimedia.org needs an API
 * Provide statistics about how many books/PDFs have been created/ordered through the Collection extension
 * Track number of third parties making requests for the raw text of our spam blacklist