Analytics/Kraken/Data Formats

The storage and transport formats for data are an implementation detail that data consumers should have no reason to know or care about. That said, the content of records, and ease of accessing that data is obviously of great interest to everyone. This page outlines the technologies involved--for the benefit of project contributors--as well as the structure of the data.

So! If you're an analyst, an engineer interested in instrumenting your application, or any other data consumer we need your feedback on these structures! Head to the talk page and have your say.

= Incoming Log Formats = Kraken consumes two main data streams, each of which is generated by front end web cache servers. Web Requests come from many different cache sources (squid, varnish, nginx), and contain the full stream of web request access logs. Event Data is generated by specific products, and is useful for logging custom data not available in plain web access logs.

Web Request Format

 * 1) Server Hostname
 * 2) Sequence number
 * 3) Timestamp (in ISO 8601 format (plus milliseconds), according to the server's clock.)
 * 4) Request service time in ms
 * 5) Client IP
 * 6) HTTP request status code
 * 7) Reply size including HTTP headers
 * 8) Request method (GET/POST etc)
 * 9) URL
 * 10) Squid hierarchy status, peer IP
 * 11) Content Type
 * 12) Referer
 * 13) X-Forwarded-For
 * 14) User-Agent
 * 15) Accept-Langage
 * 16) X-CS (generic analytics header, with serialized key/value pairs)

Event Data Format
Event data is generated by Varnish running on the bits servers. Each has a varnishncsa instance logging the following information.


 * 1) Request path
 * 2) Query params
 * 3) HTTP host (aka request hostname)
 * 4) Timestamp (in ISO 8601 format (plus milliseconds), according to the server's clock.)
 * 5) Client IP  (aka remote address/host)
 * 6) X-Forwarded-For
 * 7) Referer
 * 8) Accept-Language
 * 9) Cookie
 * 10) X-WAP-Profile
 * 11) User-Agent
 * 12) Server-Hostname
 * 13) Sequence Number (generated by varnishncsa instance).

The corresponding varnishncsa log format string is:

%U %q	%{Host}i	%t	%h	%{X-Forwarded-For}i	%{Referer}i	%{Accept-Language}i	%{Cookie}i	%{X-WAP-Profile}i	%{User-agent}i	%l	%n

(There are literal tabs in this string; varnishncsa does not translate "\t".)   Note that fields (like User-Agent) are URL encoded, whereas the query params are not.

The request path is placed at the beginning of this format so that consumers may easily filter out their relevant messages via string prefix. The request path will be expected to contain a product_code, i.e. /event/.

= Serialization =

Implementation
So far, the Analytics team proposes two data schemes: one to describe general web request traffic, and another for event data. Presently, the plan is to use Apache Avro as our data definition and serialization library. (For comparison, see Google's Protocol Buffers and Apache Thrift.)

Avro offers many benefits:
 * 1) Avro is natively supported by the entire Hadoop stack, as well as bindings for most major languages.
 * 2) Avro schemas are written in JSON, using a simple DDL.
 * 3) Avro supports highly efficient binary serialization and per-record compression (gzip or snappy).
 * 4) Avro datafiles contain the schema to describe the data, so there is never ambiguity regarding the types of data.
 * 5) Avro supports schema evolution -- adding a field, renaming a field is all possible without breaking backwards compatibility.

Web Request Schema
Schema for the storage of normal web request traffic, such as views of a wiki page. (The Avro schema specification has more information on the details of the types and formats you see here.) JSON source: https://github.com/wmf-analytics/kraken/blob/master/src/main/avro/WebRequest.avro.json

Notes on the fields below are forthcoming soon.

{   "name"      : "WebRequest", "namespace" : "org.wikimedia.analytics.kraken", "doc"      : "Represents a web request.", "type"     : "record", "fields": [ { "name": "timestamp", "type": "long", "doc": "micros since the epoch" }, { "name": "product_code", "type": "string", "default":"web", "doc":"Product that generated the data for this request" }, { "name": "ip", "type": ["int", "string"], "order": "ignore", "doc":"int == IPv4; string == IPv6 or hash" }, { "name": "uid", "type": ["null", "string"], "default":null, "order": "ignore", "doc":"User UUIDv4" }, { "name": "url", "type": "string" }, { "name": "referer", "type": ["null", "string"], "default":null, "order": "ignore" }, { "name": "method", "type": { "type": "enum", "name":"HTTPRequestMethod", "symbols": ["GET", "POST", "PUT", "DELETE", "UPDATE"] }, "order": "ignore" }, { "name": "ua", "type": "string", "order": "ignore" }, { "name": "ua_flags", "type": "int", "default":0, "order": "ignore", "doc": "Bitfield of UA components; 0 when empty" }, { "name": "carrier", "type": ["null", "string"], "default":null, "order": "ignore", "doc": "Mobile carrier for Zero project; from X-CARRIER header" }, {"name": "locale": "type": "string", "default": null, "doc": "Accept-language header as supplied by browser"}, { "name": "response_server", "type": "string", "order": "ignore" }, { "name": "response_status", "type": "int", "order": "ignore" }, { "name": "response_time", "type": "long", "order": "ignore" }, { "name": "response_size", "type": "long", "order": "ignore" }, { "name": "response_mime", "type": "string" }, { "name": "metadata", "type":{ "type": "map", "values": "string" }, "order": "ignore" }, { "name": "tags", "type":{ "type": "array", "items": "string" }, "order": "ignore" } ] }

Event Data Schema
Schema for the storage of event data, logged via the Pixel Service endpoint. (The Avro schema specification has more information on the details of the types and formats you see here.) JSON source: https://github.com/wmf-analytics/kraken/blob/master/src/main/avro/EventData.avro.json

Notes on the fields below are forthcoming soon.

{   "name"      : "EventData", "namespace" : "org.wikimedia.analytics.kraken", "doc"      : "Represents a logged event.", "type"     : "record", "fields": [ { "name": "timestamp", "type": "long", "doc": "Microseconds since the epoch." },       { "name": "product_code", "type": "string", "doc":"Product that generated the data for this request." },       { "name": "uid", "type": ["null", "string"], "default":null, "order": "ignore", "doc":"User UUIDv4" }, { "name": "visit_id", "type": "string", "order": "ignore", "doc":"Visit/session identifier, representing a continuous (without significant idle time) set of pageloads by a user." },       { "name": "pageload_id", "type": "string", "order": "ignore", "doc":"Pageload identifier, representing a particular pageload." },       { "name": "event", "type": "string", "doc": "Event name." },       { "name": "data", "type":{ "type": "map", "values": "string" }, "order": "ignore", "doc": "Data payload of the event." },       { "name": "ip", "type": ["int", "string"], "order": "ignore", "doc":"int == IPv4; string == IPv6 or hash" }, { "name": "url", "type": "string", "doc": "URL of the page that generated the event. (Appears as the referer in the actual HTTP request.)" }, { "name": "referer", "type": ["null", "string"], "default":null, "order": "ignore", "doc": "Referer of the page that generated the event. (Sent in the data payload as the key `ref`.)" }, { "name": "ua", "type": "string", "order": "ignore" }, { "name": "ua_flags", "type": "int", "default":0, "order": "ignore", "doc": "Bitfield of UA components; 0 when empty" }, { "name": "carrier", "type": ["null", "string"], "default":null, "order": "ignore", "doc": "Mobile carrier for Zero project; from X-CARRIER header" }, { "name": "metadata", "type":{ "type": "map", "values": "string" }, "order": "ignore", "doc": "Additional metadata annotations." },       { "name": "tags", "type":{ "type": "array", "items": "string" }, "order": "ignore", "doc": "Tags that identify the request as a particular type." },   ] }