Extension:EventLogging

In order to evaluate Editor engagement experiments, code needs to track user actions. E3 Experiments and some other extensions currently use Extension:ClickTracking to do this, there is an overview in Extension:E3 Experiments/Architecture. It has numerous problems, associated with making a request to api.php that results in multiple DB calls.

This page describes an improved method that E3 is developing.

Event information
event tracking represents events in browser in a key-value data structure
 * all events include the keys event_id, timestamp, token
 * can include any information in additional keys, such as link targets, user data made available in JS variables, computed values such as time on page, etc.

To track an event JS client code simply makes a request to http:// bits.wikimedia.org/event.gif?key=val&foo=bar&...


 * URLEncode nested structures like JSON


 * there are limitations on Ajax XMLHttpRequest cross-domain policies, etc.
 * so just create an img and set this as its src URL

What happens to request?
bits.w.o is served by separate cluster with a varnish front-end (not squid). varnish can do simple responses, so configure it to return HTTP 204 "No content"
 * supported even by older browsers

Varnish "logging" just fills a shared memory segment circular buffer. A separate command-line tool `varnishlog` can filter on the pattern "event.gif" and pipe to a tool that sends stdin to the log collection machine "vanadium"
 * (see for syntax)

limits
 * query string is limited (by whom? browsers, network layers... ?) to 2000 characters in some cases, so system will emit an error if data goes over.
 * don't send Unicode data, so have to encode them somehow.

For more than 2000 characters, and in general, send a hash or some other ID that the back-end can related to database information. E.g. username hash to relate to previous actions, or userID of logged in user to relate to edit counts.

sending using ZeroMQ
The sending tool is zpub, the receiver on vanadium is zsub, these are very simple C programs Ori wrote that use the ZeroMQ intelligent light-weight transport layer.
 * ZeroMQ has lots of useful features.
 * interest in it from other teams

is packaged, puppetized, and primed for production.

Processing logged events
The listener on the log collection machine "vanadium" is zsub, subscribing to the ZeroMQ publisher.

So the listener receives the log of the http:// bits.wikimedia.org/event.gif?key=val&foo=bar&...request, extracts the key-value pairs, and stores them in:

redis key-value store
vanadium will run redis, a distributed key-value store. redis has features beyond memcache :
 * can nest a set of key-value pairs inside a key, e.g. 'ori' can have user_id => xy, lastview -> abc
 * has sorted sets
 * the sorted set is sorted by a score, e.g. timestamp, or userid
 * say an event comes in, first goes to the sorted set for the event_id
 * this is what's useful for rev tagging, which is annotating an edit with additional information (rather than trying to add columns to the page table)
 * redis server can override timestamp if it's too out of sync
 * has its own pub-sub, so a connector can watch for certain kinds of events coming in and hook them back into mySQL. Or can do batch import.

So vanadium's redis stores all the key-value info from events. For analysis, we could import data sets from vanadium into another redis instance on another system, or import into a conventional SQL database.

This redis part might be replaced by Kraken project from Analytics.

Hadoop might be a more efficient solution for much larger datasets, but redis is very performant.

current implementation
is a test Node implementation of serializing query strings into redis.

It's not currently hooked into ZeroMQ, it's responding directly to HTTP requests. However it's simple matter to change it to read URLs from stdin, and then pipe the output of zsub to it.

New approach vs Extension:ClickTracking

 * the log line generated by CT always includes the user data isLoggedIn and (if logged in) edit count and edit counts over three time periods. The new approach doesn't have the overhead of retrieving these, but
 * the JS code issuing track events has access to most of the same information in JavaScript variables and can add it to the track action.
 * or back-end analytics code can relate a logged-in user to the same info from DB tables.
 * CT sets a non-persistent token in the clicktracking-session cookie when it issues a track action
 * E3 Experiments tend to set their own userbuckets cookie.

persistent cookie

 * Currently Article Feedback Tool 4 is the only supplier of a persistent identifier token, we think it's set for a year and is based on time.
 * we think AFT4 regenerates a different cookie if the user clears cookies.
 * is it OK to try to create a persistent user ID for anonymous readers (e.g. by user agent, etc.)
 * as Article Feedback Tool version 5 starts to replace AFT4, readers getting a persistent cookie will slowly decrease

other cookies

 * ClickTracking has clicktracking-session (but not persistent)
 * E3Experiments has its murmurhash3 cookie

If E3 starts to set its own persistent cookie (outside of AFT4), perhaps need OK from legal.

Could start using this for the Account Creation User Experience experiment.

transition
Diderich's team is happy to have E3 develop this

Analytics is happy for E3 to use this in the interim while Kraken is developed, and for the general zpub send to back-end approach.

Hoping to test this on labs, varnish and redis are in place on various servers.

Currently WMF has imposed a limit on ClickTracking usage, so we only sample a small percentage of impressions. The new system should be better, but can only identify new limit of new system in production.

Probably leave the existing ClickTracking extension running, and transition E3 experiments over to using this improved version. Clients continue to call $.trackAction and similar JavaScript functions, but their implementation will change, from XMLHttpRequests of api.php, to requests of the event.gif.

early feedback
Tim Starling:
 * Instead respond with status 200 empty response? but empty response can make browser wait, and 204 well-supported.
 * Instead use the upcoming varnish patch to send UDP messages? but using a message queue decouples the notification from varnish
 * Why not then use this approach for other Wikimedia projects? their call.

alternative
We could cut out the middlemen and request the beacon image from a new analytics.wikimedia.org domain, route this directly to the Node server, and have Node return the status 204 as well as feeding the key-value pairs into Redis. Fewer layers, but:
 * node not optimized for traffic
 * we can rely on varnish front-end to filter out extraneous and bogus requests
 * by piggybacking off bits.w.o, we get its geographical replication