Analytics/Kraken/Data Formats

The storage and transport formats for data are an implementation detail that data consumers should have no reason to know or care about. That said, the content of records, and ease of accessing that data is obviously of great interest to everyone. This page outlines the technologies involved--for the benefit of project contributors--as well as the structure of the data.

So! If you're an analyst, an engineer interested in instrumenting your application, or any other data consumer we need your feedback on these structures! Head to the talk page and have your say.

= Implementation =

So far, the Analytics team proposes two data schemes: one to describe general web request traffic, and another for event data. Presently, the plan is to use Apache Avro as our data definition and serialization library. (For comparison, see Google's Protocol Buffers and Apache Thrift.)

Avro offers many benefits:
 * 1) Avro is natively supported by the entire Hadoop stack, as well as bindings for most major languages.
 * 2) Avro schemas are written in JSON, using a simple DDL.
 * 3) Avro supports highly efficient binary serialization and per-record compression (gzip or snappy).
 * 4) Avro datafiles contain the schema to describe the data, so there is never ambiguity regarding the types of data.
 * 5) Avro supports schema evolution -- adding a field, renaming a field is all possible without breaking backwards compatibility.

= Schemas =

Web Request Schema
Schema for the storage of normal web request traffic, such as views of a wiki page. (The Avro schema specification has more information on the details of the types and formats you see here.) JSON source: https://github.com/wmf-analytics/kraken/blob/master/src/main/avro/WebRequest.avro.json

Notes on the fields below are forthcoming soon.

{   "name"      : "WebRequest", "namespace" : "org.wikimedia.analytics.kraken", "doc"      : "Represents a web request.", "type"     : "record", "fields": [ { "name": "timestamp", "type": "long", "doc": "micros since the epoch" }, { "name": "product_code", "type": "string", "default":"web", "doc":"Product that generated the data for this request" }, { "name": "ip", "type": ["int", "string"], "order": "ignore", "doc":"int == IPv4; string == IPv6 or hash" }, { "name": "uid", "type": ["null", "string"], "default":null, "order": "ignore", "doc":"User UUIDv4" }, { "name": "url", "type": "string" }, { "name": "referer", "type": ["null", "string"], "default":null, "order": "ignore" }, { "name": "method", "type": { "type": "enum", "name":"HTTPRequestMethod", "symbols": ["GET", "POST", "PUT", "DELETE", "UPDATE"] }, "order": "ignore" }, { "name": "ua", "type": "string", "order": "ignore" }, { "name": "ua_flags", "type": "int", "default":0, "order": "ignore", "doc": "Bitfield of UA components; 0 when empty" }, { "name": "carrier", "type": ["null", "string"], "default":null, "order": "ignore", "doc": "Mobile carrier for Zero project; from X-CARRIER header" }, {"name": "locale": "type": "string", "default": null, "doc": "Accept-language header as supplied by browser"}, { "name": "response_server", "type": "string", "order": "ignore" }, { "name": "response_status", "type": "int", "order": "ignore" }, { "name": "response_time", "type": "long", "order": "ignore" }, { "name": "response_size", "type": "long", "order": "ignore" }, { "name": "response_mime", "type": "string" }, { "name": "metadata", "type":{ "type": "map", "values": "string" }, "order": "ignore" }, { "name": "tags", "type":{ "type": "array", "items": "string" }, "order": "ignore" } ] }

Event Data Schema
Schema for the storage of event data, logged via the Pixel Service endpoint. (The Avro schema specification has more information on the details of the types and formats you see here.) JSON source: https://github.com/wmf-analytics/kraken/blob/master/src/main/avro/EventData.avro.json

Notes on the fields below are forthcoming soon.

{   "name"      : "EventData", "namespace" : "org.wikimedia.analytics.kraken", "doc"      : "Represents a logged event.", "type"     : "record", "fields": [ { "name": "timestamp", "type": "long", "doc": "Microseconds since the epoch." },       { "name": "product_code", "type": "string", "doc":"Product that generated the data for this request." },       { "name": "uid", "type": ["null", "string"], "default":null, "order": "ignore", "doc":"User UUIDv4" }, { "name": "visit_id", "type": "string", "order": "ignore", "doc":"Visit/session identifier, representing a continuous (without significant idle time) set of pageloads by a user." },       { "name": "pageload_id", "type": "string", "order": "ignore", "doc":"Pageload identifier, representing a particular pageload." },       { "name": "event", "type": "string", "doc": "Event name." },       { "name": "data", "type":{ "type": "map", "values": "string" }, "order": "ignore", "doc": "Data payload of the event." },       { "name": "ip", "type": ["int", "string"], "order": "ignore", "doc":"int == IPv4; string == IPv6 or hash" }, { "name": "url", "type": "string", "doc": "URL of the page that generated the event. (Appears as the referer in the actual HTTP request.)" }, { "name": "referer", "type": ["null", "string"], "default":null, "order": "ignore", "doc": "Referer of the page that generated the event. (Sent in the data payload as the key `ref`.)" }, { "name": "ua", "type": "string", "order": "ignore" }, { "name": "ua_flags", "type": "int", "default":0, "order": "ignore", "doc": "Bitfield of UA components; 0 when empty" }, { "name": "carrier", "type": ["null", "string"], "default":null, "order": "ignore", "doc": "Mobile carrier for Zero project; from X-CARRIER header" }, { "name": "metadata", "type":{ "type": "map", "values": "string" }, "order": "ignore", "doc": "Additional metadata annotations." },       { "name": "tags", "type":{ "type": "array", "items": "string" }, "order": "ignore", "doc": "Tags that identify the request as a particular type." },   ] }