Extension:EventLogging

To measure the effects of changes in MediaWiki's user interface we need to collect aggregate data on how users (both readers and authenticated editors) interact with specific UI elements. Several WMF projects used Extension:ClickTracking to this aim – an extension that was originally developed in the context of the Usability Initiative which suffers from numerous problems and limitations. This page describes an improved method for logging user-generated events that the E3 team is developing.

Background
The ClickTracking extension has been used in the past few years to collect data about user-generated events in the context of product development analysis. Examples include: When used for this purpose, the format and type of data collected has been documented on Meta, see for example the clicktracking specifications for Article Feedback project. Logging user-generated events via the ClickTracking API imposes several constraints on the data format and raises major scalability concerns. The new event logger is redesigned to be agnostic about the type of event-related data that needs to be collected and to overcome performance/scalability issues.
 * counting impressions and measuring click-through and conversion rates of various calls to action;
 * measuring the breakdown of clicks on inbound links pointing to a special page;
 * ensuring that the sampling/bucketing methods used for the purpose of A/B testing are accurate;

Event data
All events collected by the logger include the following set of required data: Event identifiers fall into one of these broad categories: Event identifiers can also include bucket or user group information, used in the context of A/B tests. The log can include additional data for each event, such as link targets, user data made available in JS variables, computed values such as time on page, etc. All data collected by the event logger is subject to the Wikimedia Foundation's privacy policy.
 * 1) an event identifier
 * 2) a timestamp
 * 3) an anonymous token assigned to the client
 * impressions of a page or a UI element
 * clicks on specific UI elements
 * form submit actions
 * server-generated events

Logging events in JavaScript
Event data is logged in the form of key-value arrays that are sent by the client as requests to a URL such as:  http:// bits.wikimedia.org/event.gif?key=val&foo=bar&... there are limitations on Ajax XMLHttpRequest cross-domain policies, etc.
 * URLEncode nested structures like JSON
 * so in JavaScript (jQuery or other DOM manipulation) just create an img and set this as its src URL

bits.w.o is served by separate cluster with a varnish front-end (not squid). varnish can do simple responses, so configure it to return HTTP 204 "No content"
 * supported even by older browsers

Varnish "logging" just fills a shared memory segment circular buffer. A separate command-line tool `varnishlog` can filter on the pattern "event.gif" and pipe to a tool that sends stdin to the log collection machine "vanadium"
 * (see for syntax)

limits
 * query string is limited (by whom? browsers, network layers... ?) to 2000 characters in some cases, so system will emit an error if data goes over.
 * don't send Unicode data, so have to encode them somehow.

For more than 2000 characters, and in general, send a hash or some other ID that the back-end can related to database information. E.g. username hash to relate to previous actions, or userID of logged in user to relate to edit counts.

Transmitting event data
The sending tool is zpub, the receiver on vanadium is zsub, these are very simple C programs Ori wrote that use the ZeroMQ intelligent light-weight transport layer. One publishes the events in the varnish logs, the other subscribes to the events.
 * ZeroMQ has lots of useful features: lightweight, handles one end or the other going away, etc.
 * interest in it from other teams

is packaged, puppetized, and primed for production.

Processing logged events
The listener on the log collection machine "vanadium" is zsub, subscribing to the ZeroMQ publisher. So the listener receives the log of the http:// bits.wikimedia.org/event.gif?key=val&foo=bar&...request, extracts the key-value pairs, and stores them in:

Storing event data
Vanadium will run redis, a distributed key-value store, to store user-generated event data.

Redis has features beyond memcache :
 * can nest a set of key-value pairs inside a key, e.g. 'ori' can have user_id => xy, lastview -> abc
 * has sorted sets
 * the sorted set is sorted by a score, e.g. timestamp, or userid
 * say an event comes in, first goes to the sorted set for the event_id
 * this is what's useful for rev tagging, which is annotating an edit with additional information (rather than trying to add columns to the page table)
 * redis server can override timestamp if it's too out of sync
 * has its own pub-sub, so a connector can watch for certain kinds of events coming in and hook them back into mySQL. Or can do batch import.

So vanadium's redis stores all the key-value info from events. For analysis, we could import data sets from vanadium into another redis instance on another system, or import into a conventional SQL database.

This redis part might be replaced by Kraken project from Analytics.

Hadoop might be a more efficient solution for much larger datasets, but redis is very performant.

Data Model

 * transcluded from subpage /Data model

Current implementation
is a test Node implementation of serializing query strings into redis.

It's not currently hooked into ZeroMQ, it's responding directly to HTTP requests. However it's simple matter to change it to read URLs from stdin, and then pipe the output of zsub to it.

Event logging vs Extension:ClickTracking

 * the log line generated by CT always includes the user data isLoggedIn and (if logged in) edit count and edit counts over three time periods. The new approach doesn't have the overhead of retrieving these, but
 * the JS code issuing track events has access to most of the same information in JavaScript variables and can add it to the track action.
 * or back-end analytics code can relate a logged-in user to the same info from DB tables.
 * CT sets a non-persistent token in the clicktracking-session cookie when it issues a track action
 * E3 Experiments tend to set their own userbuckets cookie.

Persistent cookies

 * Currently Article Feedback Tool 4 is the only supplier of a persistent identifier token, we think it's set for a year and is based on time.
 * we think AFT4 regenerates a different cookie if the user clears cookies.
 * is it OK to try to create a persistent user ID for anonymous readers (e.g. by user agent, etc.)
 * as Article Feedback Tool version 5 starts to replace AFT4, readers getting a persistent cookie will slowly decrease

Other cookies

 * ClickTracking has clicktracking-session (but not persistent)
 * E3Experiments has its murmurhash3 cookie

If E3 starts to set its own persistent cookie (outside of AFT4), perhaps need OK from legal.

Could start using this for the Account Creation User Experience experiment.

Transition
Diederik's team is happy to have E3 develop this

Analytics is happy for E3 to use this in the interim while Kraken is developed, and for the general zpub send to back-end approach.

Hoping to test this on labs, varnish and redis are in place on various servers.

Currently WMF has imposed a limit on ClickTracking usage, so we only sample a small percentage of impressions. The new system should be better, but can only identify new limit of new system in production.

Probably leave the existing ClickTracking extension running, and transition E3 experiments over to using this improved version. Clients continue to call $.trackAction and similar JavaScript functions, but their implementation will change, from XMLHttpRequests of api.php, to requests of the event.gif.

Early feedback
Tim Starling:
 * Instead respond with status 200 empty response? but empty response can make browser wait, and 204 well-supported.
 * Instead use the upcoming varnish patch to send UDP messages? but using a message queue decouples the notification from varnish
 * Why not then use this approach for other Wikimedia projects? their call.

Alternatives
We could cut out the middlemen and request the beacon image from a new analytics.wikimedia.org domain, route this directly to the Node server, and have Node return the status 204 as well as feeding the key-value pairs into Redis. Fewer layers, but:
 * node not optimized for traffic
 * we can rely on varnish front-end to filter out extraneous and bogus requests
 * by piggybacking off bits.w.o, we get its geographical replication