Extension:EventLogging

To measure the effects of changes in MediaWiki's user interface we need to collect aggregate data on how users (both readers and authenticated editors) interact with specific UI elements. Several WMF projects used Extension:ClickTracking to this aim – an extension that was originally developed in the context of the Usability Initiative which suffers from numerous problems and limitations. This page describes an improved method for logging user-generated events that the E3 team is developing.

Background
The ClickTracking extension has been used in the past few years to collect data about user-generated events in the context of product development analysis. Examples include: When used for this purpose, the format and type of data collected has been documented on Meta, see for example the clicktracking specifications for Article Feedback project. Logging user-generated events via the ClickTracking API imposes several constraints on the data format and raises major scalability concerns. The new event logger is redesigned to be agnostic about the type of event-related data that needs to be collected and to overcome performance/scalability issues.
 * counting impressions and measuring click-through and conversion rates of various calls to action;
 * measuring the breakdown of clicks on inbound links pointing to a special page;
 * ensuring that the sampling/bucketing methods used for the purpose of A/B testing are accurate;

Event data
All events collected by the logger include the following set of required data: Event identifiers fall into one of these broad categories: Event identifiers can also include bucket or user group information, used in the context of A/B tests. The log can include additional data for each event, such as link targets, user data made available in JS variables, computed values such as time on page, etc. All data collected by the event logger is subject to the Wikimedia Foundation's privacy policy.
 * 1) an event identifier
 * 2) a timestamp
 * 3) an anonymous token assigned to the client
 * impressions of a page or a UI element
 * clicks on specific UI elements
 * form submit actions
 * server-generated events

Logging events in JavaScript
Event data is logged in the form of key-value arrays that are sent by the client as requests to a URL such as:  http:// bits.wikimedia.org/event.gif?key=val&foo=bar&... there are limitations on Ajax XMLHttpRequest cross-domain policies, etc.
 * URLEncode nested structures like JSON
 * so in JavaScript (jQuery or other DOM manipulation) just create an img and set this as its src URL

bits.w.o is served by separate cluster with a varnish front-end (not squid). varnish can do simple responses, so configure it to return HTTP 204 "No content"
 * supported even by older browsers

Varnish "logging" just fills a shared memory segment circular buffer. A separate command-line tool `varnishlog` can filter on the pattern "event.gif" and pipe to a tool that sends stdin to the log collection machine "vanadium"
 * (see for syntax)

limits
 * query string is limited (by whom? browsers, network layers... ?) to 2000 characters in some cases, so system will emit an error if data goes over.
 * don't send Unicode data, so have to encode them somehow.

For more than 2000 characters, and in general, send a hash or some other ID that the back-end can related to database information. E.g. username hash to relate to previous actions, or userID of logged in user to relate to edit counts.

Transmitting event data
The sending tool is udp2log.

Processing logged events
The listener on the log collection machine "vanadium" is zsub, subscribing to the ZeroMQ publisher. So the listener receives the log of the http:// bits.wikimedia.org/event.gif?key=val&foo=bar&...request, extracts the key-value pairs, and stores them in:

Storing event data
EventLogging currently uses Redis, a distributed key-value store, to store user-generated event data.

Data Model

 * transcluded from subpage /Data model

Current implementation
See wikitech:Event logging

Event logging vs Extension:ClickTracking

 * the log line generated by CT always includes the user data isLoggedIn and (if logged in) edit count and edit counts over three time periods. The new approach doesn't have the overhead of retrieving these, but
 * the JS code issuing track events has access to most of the same information in JavaScript variables and can add it to the track action.
 * or back-end analytics code can relate a logged-in user to the same info from DB tables.
 * CT sets a non-persistent token in the clicktracking-session cookie when it issues a track action
 * E3 Experiments add data to the userbuckets cookie when bucketing users in an experiment
 * there are many other tokens available, e.g. the mediaWiki.user.id cookie issued by JS mw.user.id, etc.