Analytics/Kraken

Kraken infrastructure documentations has been moved to wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Kraken

Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.

Rationale
The Wikimedia movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

Documentation



 * Getting Access
 * Overview
 * Software Overview
 * Data
 * Product Codes
 * Data Streams
 * Data Formats
 * Pixel Service Endpoint
 * Request Logging
 * Hardware Planning
 * Notes
 * Kraken Benchmarks
 * Hadoop Setup Notes
 * Test Cluster Setup Notes
 * Meeting Notes
 * Security Review Meeting
 * Architecture Review Meeting

Request Logging
Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.


 * Request Logging
 * System Recommendation
 * Distributed Logging Systems Research
 * Feature Comparison Spreadsheet
 * Pixel Service Endpoint

Tasks


 * Work with Ori, Patrick, & ops to figure out Kafka producer situation for production [otto, dsc, ori, preilly, ops]
 * MediaWiki EventLogging integration

Data Tools
Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.


 * Tips and Notes on Hadoop Tools
 * Oozie Tutorial
 * Data Formats

Infrastructure
Bucket for general cluster infrastructure and maintenance tasks.
 * Kraken Puppetization
 * Ports
 * Hardware Planning
 * JMX Monitoring Research

Meeting Notes


 * Security Review Meeting
 * Architecture Review Meeting

Tasks


 * Fix the one busted Cisco machine (an07) [otto]
 * Data owner services -- export, dashboard, visualizations (on Hue?) [dsc, dan]
 * Hue plugin for Limn integration? [dsc]
 * Get analysts experimenting with Hive/Pig [DvL]