Analytics/Archive/Pixel Service

'''This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics'''

The Pixel service is the "front door" to the analytics system: a public endpoint with a simple interface for getting data into the datastore.

Components

 * Request Endpoint : HTTP server that handles  requests to the PIXEL service endpoint, responding with   or an actual, honest-to-god 1x1 transparent gif. Data is submitted into the cluster via query parameters.
 * Messaging System : routes messages (request information + data content) from the request endpoint to the datastore. This component is intended to be implemented by Apache Kafka.
 * Datastore Consumer : consumes messages, shunting them into the datastore utilizing HDFS staging and/or append.
 * Processing Toolkit : a standard template for a pig job to process (count, aggregate, etc) event data query string params, handling standard indirection for referrer and timestamp, Apache Avro de/serialization, and providing tools for conversion funnel and A/B testing analysis.
 * Event Logging Library : a JS library with an easy interface to abstract the sending of data to the service. Handles event data conventions for proxied timestamp, referrer; normal web request components.

Service prototype
To get up and running right away, we're going to start with an alpha prototype, and work with teams to see where it goes.


 * on bits multicast stream -> udp2log (1:1) running in Analytics cluster
 * Until bits caches are ready, we'll also have a publicly accessible endpoint on
 * Kafka consumes udp2log, creating topic per product-code -- no intermediate aggregation at cache DC
 * Cron to run Kafka-Hadoop consumer, importing all topics into Hadoop to datetime+producer-code paths

EventLogging Integration TODOs

 * Make sure all event data goes into Kraken (I think it may only be esams at the moment, not sure). [ottomata] (Dec)
 * Divvy up some TODOs with Ori:
 * Keeping udplog seq id counters for each bits host and emitting some alert if gaps detected
 * Until https://rt.wikimedia.org/Ticket/Display.html?id=4094 is resolved, monitor for truncated URIs (detectable because missing trailing ';') and set up some alerting scheme
 * Speaking of that RT ticket: check w/Mark if we can do something useful to move that along (like update the patch so it applies against the versions deployed to prod).
 * Figure out a useful arrangement for server-side events (basic idea: call wfDebugLog(..) on hooks that represent "business" events, have wfDebugLog write to UDP / TCP socket pointing at Kraken. See EventLogging extension for some idea of what I mean.
 * already done? EventLogging's efLogServerSideEvent validates events against a versioned schema on meta-wiki and writes them using wfDebugLog (currently to UDP). E3 logs all AccountCreation events on all servers using this. -- S Page (WMF) (talk) 00:39, 12 January 2013 (UTC)
 * Things Ori needs and would repay in dev time and/or sexual favors: - Puppetization of stuff on Vanadium - Help w/MySQL admin
 * Other EventLogging TODOs: mw:Extension:EventLogging/Todos
 * Figure out how to map event schemas to Avro(?) or some other way to make Hadoop schema-aware so the data is actually useful rather than just blob-like

Getting to production
We're pretty settled on Kafka as the messaging transport, but to use the dynamic load-balancing and failover features we need a ZooKeeper-aware producer &mdash; unfortunately, only the Java and C# clients have this functionality. (This is a blocker for both the Pixel Service AND general request logging.)

Three options:
 * 1) Pipe logging output from Squid & Varnish into the console producer (which implies running the JVM in production);
 * 2) Write code (a Varnish plugin plus configuration as described here, as well as a Squid module, both in something C-like) to do ZK-integration and publish to Kafka
 * 3) Continue to use udp2log -> Kafka with the caveat that the stream is unreliable until it gets to Kafka.

What HTTP actions will the service support?
.

What about s?
No. Only. Other than content-length, there's no real justification for a, and if you're sending strings that are greater than 2k, you kind of already have a problem.

Can I send JSON?
Sure, but we're probably not going to do anything special with it -- the JSON values will show up as strings that you'll have to parse to aggregate, count, etc. Ex:  (and recall you'll have to  ).

As we want to build tools to cover the normal cases first, this is not really recommended. (Just use  KV pairs as usual.) If anyone has a REEEEALLY good use-case, we can talk about having a key-convention for sending a json payload, like, say, calling the key.

If I send crazy headers, will the service record them?
No. We will not parse anything other than the query string.

Custom headers are exactly what we want to avoid -- think of the metadata in an HTTP request as being an interface. You want it to be minimal and well-defined, so little custom parsing needs to occur. KV-pairs in the query string are both flexible and generic enough to meet all reasonable use-cases. If you really need typing, send JSON as the value (as mentioned above).