Analytics/Archive/Pixel Service

The Pixel service is the "front door" to the analytics system: a public endpoint with a simple interface for getting data into the datastore.

Components

 * Request Endpoint : HTTP server that handles  requests to the PIXEL service endpoint, responding with   or an actual, honest-to-god 1x1 transparent gif. Data is submitted into the cluster via query parameters.
 * Messaging System : routes messages (request information + data content) from the request endpoint to the datastore. This component is intended to be implemented by Apache Kafka.
 * Datastore Consumer : consumes messages, shunting them into the datastore utilizing HDFS staging and/or append.
 * Processing Toolkit : a standard template for a pig job to process (count, aggregate, etc) event data query string params, handling standard indirection for referrer and timestamp, Avro de/serialization, and providing tools for conversion funnel and A/B testing analysis.
 * Event Logging Library : a JS library with an easy interface to abstract the sending of data to the service. Handles event data conventions for proxied timestamp, referrer; normal web request components.

Service prototype
To get up and running right away, we're going to start with an alpha prototype, and work with teams to see where it goes.


 * on bits multicast stream -> udp2log (1:1) running in Analytics cluster
 * Until bits caches are ready, we'll also have a publicly accessible endpoint on
 * Kafka consumes udp2log, creating topic per product-code -- no intermediate aggregation at cache DC
 * Cron to run Kafka-Hadoop consumer, importing all topics into Hadoop to datetime+producer-code paths

Getting to production
We're pretty settled on Kafka as the messaging transport, but to use the dynamic load-balancing and failover features we need a ZK-aware producer -- unfortunately, only the Java and C# clients have this functionality. (This is a blocker for both the Pixel Service AND general request logging.)

Three options:
 * 1) Pipe logging output from Squid & Varnish into the console producer (which implies running the JVM in production);
 * 2) Write code (a Varnish plugin plus configuration as described here, as well as a Squid module, both in something C-like) to do ZK-integration and publish to Kafka
 * 3) Continue to use udp2log -> Kafka with the caveat that the stream is unreliable until it gets to Kafka.

What HTTP actions will the service support?
.

What about s?
No. Only. Other than content-length, there's no real justification for a, and if you're sending strings that are greater than 2k, you kind of already have a problem.

Can I send JSON?
Sure, but we're probably not going to do anything special with it -- the JSON values will show up as strings that you'll have to parse to aggregate, count, etc. Ex:  (and recall you'll have to  ).

As we want to build tools to cover the normal cases first, this is not really recommended. (Just use  KV pairs as usual.) If anyone has a REEEEALLY good use-case, we can talk about having a key-convention for sending a json payload, like, say, calling the key.

If I send crazy headers, will the service record them?
No. We will not parse anything other than the query string.

Custom headers are exactly what we want to avoid -- think of the metadata in an HTTP request as being an interface. You want it to be minimal and well-defined, so little custom parsing needs to occur. KV-pairs in the query string are both flexible and generic enough to meet all reasonable use-cases. If you really need typing, send JSON as the value (as mentioned above).