Extension:EventLogging/Guide

From MediaWiki.org
Jump to navigation Jump to search

What is EventLogging?[edit]

If you have a question about how visitors are interacting with our site, EventLogging can capture the data required to answer it. EventLogging gathers the information each querier needs, and makes that data easily available for analysis.

The goal of EventLogging is not to capture every action on the site, but to capture rich, well-documented data that can be efficiently analyzed and used over the long term. Data captured by EventLogging are validated, versioned, and carefully described so that error and misunderstanding are minimized.

Every data collection job has a public-facing page, where the data to capture are specified and refined. Users can scrutinize the data-collection process, offer insights, and discuss concerns. This public page contains a new type of wiki content: a JSON Schema, which provides a data description that is legible to both humans and machines. The schema does not automatically grab data, rather it provides a place for analysts, engineers, product managers and others to work collaboratively. The schema becomes a contract, a consensus as to the meaning and implementation of a data model:

  • Developers write code to log events that match the schema.
  • The schema tells analysts what information is in the logged events.

EventLogging helps ensure that the data collected and, ultimately, analyzed to answer our questions, are the desired data. The system validates data on the client-side (to immediately alert developers of errors) and server-side (to ensure that the collected data are truly valid). If a schema is updated and redeployed, the system assigns a new revision ID. Collected data are transmitted to a data warehouse, where valid data are stored in a MySQL table that is automatically created by the system. The data can be easily accessed by additional subscribers for other purposes, such as visualizations.

EventLogging replaces the previous event-logging system, ClickTracking, which uses a single format (a string) to describe all data not captured by a small set of defined fields. Over time, we discovered that the string format failed to adequately describe collected data, and, in fact, made comparing and understanding the data costly and difficult. Simply parsing the format could present problems, as the ‘@’ delimiter sometimes appeared in the data themselves. Additionally, the system made too many API calls to scale, and provided no mechanism for validating the data or describing them well. By using collaboratively edited schemas to define data, EventLogging provides a much more robust framework for creating and documenting data models, and performing iterative revisions.

Underlying technology[edit]

This section is about the general technology behind MediaWiki's EventLogging extension. If you'd like to know how WMF-specific EventLogging system is backed, you can read more here: http://wikitech.wikimedia.org/wiki/Analytics/EventLogging.

EventLogging is a MediaWiki extension that performs server- and client-side logging. It also has "back-end" code to transmit, process, and store the collected information. The system uses schema files to define, describe, and validate collected data. Schemas are JSON schema objects that you store and edit in some wiki; schemas in use on WMF wikis are located in the ‘Schema:’ namespace on MetaWiki. Note that JSON is a new type of wiki content, and EventLogging uses ContentHandler, developed by Wikimedia Deutschland for Wikidata, to store and version the JSON content like wikitext.

To use EventLogging, you call a function and pass it a versioned schema name and a plain object that matches that schema. To log an event on the server in PHP, call EventLogging::logEvent(). To log an event in client-side JavaScript code, call mw.eventLog.logEvent(). If you are unsure of whether to log an event on the client or server side, consider the data you are collecting. Transaction information (e.g., the addition, deletion, or modification of information stored in the MediaWiki database) is easiest to capture on the server-side. Information about how a user is interacting with the browser environment (e.g., a page view or notification) is easiest to capture client-side.

Server-side logging:
EventLogging::logEvent( string $schemaName, integer $revId, array $event )

$schemaName

The name of the schema. For example, "NavigationTiming"

$revId

The schema's revision ID. For example, 5336845

$event

An associative array of event data, conforming to the schema. Example:
 [ "userId" => 1234 ]
Client-side logging:
mw.eventLog.logEvent( schemaName, event )

schemaName

The resource name of the schema. For example, "NavigationTiming"

event

A plain JavaScript object of event data conforming to the schema. Example:
{ userId: 1234 }

Events can only be logged as wholes. If a schema's properties span both server-only and client-only data, the schema can be split into two complementary schemas, or the developer can do some extra work to make sure all values are available on one side or another. The developer may also suggest an alternate combination of schema values that would also capture the required information.

Once the schema has been finalized and implemented, EventLogging can be deployed to collect event records. A client- or server-side event will trigger the implemented code, which grabs the data and triggers EventLogging. EventLogging will then validate each event record and annotate it with additional information (via the m:Schema:EventCapsule). EventLogging handles the URL-encoding and decoding required to transmit and, ultimately, broadcast valid event records in JSON format. This stream of valid records produced by EventLogging is available to any and all subscribers (MySQL, MongoDB, Visualization, etc).

For example, suppose the fundraising team has created a new banner and is interested in capturing information about whether or not users have clicked it. To use EventLogging to capture this information, the team would first create a schema file (BannerImpression) that defines the event data to capture (e.g., Was the banner clicked? true/false):

BannerImpressionSchema.png

We will look in more depth at schema creation and best practices in later sections. For now, just note the property “wasClicked”, which defines a Boolean value that describes whether or not a user has clicked the fundraising banner.

Note that a developer must implement the schema by creating the code that will programmatically assemble the event data and invoke EventLogging.

Once the schema file and its implementation are complete, the schema can be deployed and EventLogging will begin to capture event records. For example, if a user views the fundraising banner, but does not click the banner, the captured event will be:

{wasClicked: false}

MySQL[edit]

Because many of our existing analytical tools are designed to work with MySQL, and because the user and transaction data generated by MediaWiki are stored in MySQL tables, using MySQL for EL data is often useful. EventLogging data are currently stored in MySQL on the s1-analytics server.

The json2sql client, which writes EL events to MySQL, subscribes to the stream of valid JSON produced by ZeroMQ. When json2sql receives an event record, the client checks to see if a MySQL table exists for the data. If so, json2sql places the record in the existing table. If no table exists, json2sql generates one automatically using the schema from MetaWiki to construct a SQL statement instructing the database to create a table that has appropriate columns:

For example,

CREATE TABLE `BannerImpression_5329872` (
    id INTEGER NOT NULL AUTO_INCREMENT,
    uuid VARCHAR(255),
    `clientIp` VARCHAR(255),
    `clientValidated` BOOL,
    `isTruncated` BOOL,
    timestamp VARCHAR(14),
    `webHost` VARCHAR(255),
    wiki VARCHAR(255),
    `event_wasClicked` BOOL,
    PRIMARY KEY (id),
    CHECK (`clientValidated` IN (0, 1)),
    CHECK (`isTruncated` IN (0, 1)),
    CHECK (`event_wasClicked` IN (0, 1))
)ENGINE=InnoDB CHARSET=utf8

Once the table has been created (or if the table exists already), json2sql issues a SQL statement instructing the database to insert the event as a new record in the table:

INSERT INTO `BannerImpression_5329872`
    (uuid, `clientIp`, `clientValidated`, timestamp, `webHost`, wiki, `event_wasClicked`)
  VALUES
    ('fb378cdda3fe58799c334f9565365246', 'e6553bbd10a51a2c6270147ea8617a5080863ac6',
        1, '20130318192909', '127.0.0.1', 'enwiki', 0)

See #Analyzing EventLogging data for more information about working with EventLogging data.

Installing the EventLogging extension[edit]

Please see Extension:EventLogging for information about downloading and configuring EventLogging and setting up a developer environment.

About this document[edit]

This document is intended to capture the current best practices for using EventLogging. We encourage questions, comments, suggestions and concerns as part of the ongoing process of identifying how to most effectively use the system.

Events and schemas briefly defined[edit]

A couple terms to know:

Events: An event is a record of a user action on the site, such as previewing an edit, or collapsing the "Toolbox" section in the sidebar. Though often triggered by a click, events themselves represent any item of interest (the state of an application or the presence of a notification, for example). Events are not necessarily triggered by a click, either. For example, EventLogging can capture event information, such as the time it takes to fetch an image from a local browser, directly from interactions with the browser itself. We capture and analyze event data in aggregate to better understand how readers and editors interact with our site, to identify usability problems, and to provide feedback for features engineers.

Schemas: A schema describes the structure of an event record by enumerating properties and specifying constraints on types of values.

Using EventLogging: The workflow[edit]

The EventLogging workflow is a flexible process that facilitates asynchronous collaboration among contributors. Analysts, engineers, product managers, and others can work in parallel, implementing the code required to grab the data, and/or contributing to the process of schema refinement at any stage of the process.

The schema file provides a centralized place for all development. Though conversations about how to specify or collect data may occur in person or over email, the dialogue will ultimately be reflected in the schema itself and, if best practices are followed, in its documentation.

By clearly defining data, the schema helps all users understand which data need to be collected (pageTitle or pageID?) and, later, what the collected data represent (new users, or new users served an experimental treatment?). Data definitions help minimize error and ensure that the correct information is captured.

Posing a question[edit]

Though questions about how users are interacting with the site are most often posed by a product manager or UI developer, questions can be posed by anyone. One of the benefits of EventLogging is that the tool can be used by anyone—both within and outside of the Wikimedia Foundation—to initiate an inquiry.

A question may require an experiment, or not. For example, a product manager might be interested in seeing how users, or a subset of users, currently interact with the site. ‘How well do Talk pages work for new users?’ or ‘How many new users successfully complete an edit within 24 hours?’ are questions that require only that the schema capture usage patterns that can be analyzed to provide an answer.

Other questions imply an experiment. For example, “Which Create Account button design is most effective?” or “Which onboarding experience better helps people edit?” In these cases, the inquiry would involve an experiment, and possibly multiple experimental iterations. We talk more about working with schemas and experiments in later sections.

Once you have identified your question, get the ball rolling by creating a Schema page, where the process can continue. See here for more information.

Identifying metrics[edit]

Once you have identified your question, it’s time to start thinking about the metrics that can answer it. When thinking about metrics, make sure to clearly define each one, and to limit the measurements to only those necessary for answering the posed question.

For example, to answer the question, “Which onboarding experience better helps people edit?” we would need to capture the number of users that are exposed to each onboarding experience, and the number that subsequently edit successfully.

Now is the time to even further refine these desired metrics. For example, the users we are interested in are newly registered users (not anonymous users, or users who registered last week). Successful editing means completing one edit (or five edits? or ten?) successfully within 24 hours (or 24 days? or 24 minutes?). The more clearly defined the desired metrics, the more obvious their implementation becomes.

Although you may feel the impulse to collect additonal information because it seems interesting or exciting, resist the temptation. Collecting unnecessary data dilutes the focus and meaning of the schema, and adds additional implementation cost. Be particularly careful about personally identifiable information--any such information, such as IP or browsing behavior might trigger the need to throw out schema data on a rolling basis (e.g. 30 rolling window) to comply with law or internal policy.

If you have any questions about identifying appropriate metrics, please contact an analyst.

Drafting a schema[edit]

The schema is a JSON schema object collaboratively edited by the analysts, engineers, and product managers contributing to the project. Over time, and as each of the developing parties contributes expertise, the schema begins to more precisely define the data necessary for each analysis.

As the schema is being developed, engineers work on developing the code required to grab the specified data. At any point in the schema-drafting process, the schema can be deployed locally to test the implementation. Use the EventLogging devserver to print collected data to a terminal window.

For more information about schemas, please see Creating a schema.

Peer review[edit]

As the schema is developed, it benefits from the expertise of each of the contributing parties. Analysts, engineers, and product managers each bring their perspectives and knowledge to the schema, helping to ensure that it is viable and sound.

An analyst makes sure that the specified metrics are, in fact, appropriate and required for answering the posed question. For example, an early version of the GettingStarted schema specified that the system log the protection-level of editable pages. Although this information is interesting, it is not directly relevant to the original question (Which onboarding experience better helps people edit?), and the analyst knew to remove the property from the schema.

Engineers are experts on implementation cost, and can see which data will be easy to collect and which might require extensive processing, or even take down the website were the system to attempt to capture them. Engineers know which metrics can be collected reliably, and which not, and can make recommendations accordingly.

The product manager, who tracks how much time everyone puts into the project, has a good sense for when the cost of an engineering effort outweighs its value. The product manager will weigh in if an analysis, however interesting, is outside the needs of the organization.

Security/Privacy concerns[edit]

If you don't want all your data purged on a rolling 90 day basis, you will need to whitelist the table and the specific fields in the schema you want to keep. If you add new fields to an existing schema, you will need to whitelist those as well. Even given this purging, there are some fields we never collect.

This section is a work in progress, but here are some data policies to be aware of:

Analytics/EventLogging/Data retention and auto-purging

Analytics/EventLogging/Publishing

Finalizing a schema[edit]

Before a schema can be deployed, it must get a final review from an analyst, who ensures that it does not violate the Wikimedia Foundation’s privacy policies.

Deploying a schema[edit]

Once a schema has been completed to the satisfaction of its contributors and approved by an analyst, production code that logs to it can be deployed.

Underlying technology summarized the code to log to a schema. The schema's wiki page helps by providing actual code that loads the schema and logs to it. To view it, click the red ‘<>’ button at the top-right corner of the schema page. In extension.json, you'll add something like:

{
    "attributes": {
        "EventLogging": {
            "Schemas": {
                "GettingStarted": 5285779
            }
        }
    }
}

Under manifest_version 1 it would look like:

{
    "EventLoggingSchemas": {
        "GettingStarted": 5285779
    }
}

And in PHP to log an event:

EventLogging::logEvent( 'GettingStarted', $event );

And the JavaScript:

mw.eventLog.logEvent( 'GettingStarted', { /* ... */ } );

You'll need to fill in the event with appropriate values for your schema's fields. Logging an event will transmit collected data to a central area where it is warehoused.

Note that the schema's wiki page does not automatically indicate whether it is in use. We recommend that you note this information on the schema’s Talk page (see Collaboration: Schema Talk pages for more information). Deployed schemas will automatically create a MySQL table (named SchemaName_versionNumber) for the collected data on the data store.

QA[edit]

EventLogging uses both client- and server-side validation to help ensure that the data collected are the correct ones. Although machine validation captures many errors, the data must also be reviewed by a human before they are deemed sound.

If client-side validation fails, in debug mode it displays a warning in the browser's JavaScript console (if it is open), immediately alerting developers that a schema is not working properly. To make validation errors more obvious, you can run user JavaScript that that displays them in a banner:

// Show EventLogging validation errors in a dismissible bar at the top of page.
var $el = $( '<pre style="background: yellow; margin: 0; padding: 8px; position: fixed; top: 0; width: 100%; z-index: 99"></pre>' );
$el.click( function () { $el.empty().detach(); } );
mw.trackSubscribe( 'eventlogging.error', function ( topic, err ) {
    $el.text( function ( idx, text ) {
        return ( text && text + '\n' ) + err;
    } ).appendTo( 'body' );
} );

If you're developing on WMF wikis, add this snippet to https://meta.wikimedia.org/wiki/Special:MyPage/global.js and http://meta.wikimedia.beta.wmflabs.org/wiki/Special:MyPage/global.js

Server-side validation occurs on the server where events are ultimately stored. EventLogging uses a tiered model of data handling:

Events that meet the requirements specified in the schema are unpacked and broadcast so that they are available to subscribers. One subscriber stores valid events in a MySQL table, where analysts can easily query them in isolation or in joins with other data sets.

Validation failures trigger exceptions, which are written to standard error. In WMF's server configuration, stderr is captured by Ubuntu Upstart, so the failures are at /var/log/upstart/eventlogging-processor.log on the logging machine.

Events that do not conform to the schema are logged, but only in a raw bin.

Analysts also examine the data to catch the errors machines cannot. A value can conform to the letter of the schema (which specifies that it have a string type), but be meaningless (a string of gibberish). Unit tests should test the soundness of the data, though sometimes these are difficult to implement.

Because analysts are familiar with the known patterns of human activity on the site, they are able to determine if the collected data are unrealistic and indicate an error in implementation or assumption. Analysts will be able to identify edge cases—the times when the data are unexpected, either because users have behaved in a way that was not anticipated, or because of a glitch in an experiment. Sometimes, these edge cases can simply be flagged; other times, they require a change in implementation. The analyst will decide how best to respond to the inconsistencies.

Analysis[edit]

Once the data have been collected and validated, they are ready to be analyzed. We are currently developing tools that will permit users to easily generate reports for funnel analyses and visualizations. We are also working on tools that will generate statistical tests, so that users can determine if an observed difference is significant, or random.

Creating a schema[edit]

Schemas describe data in a human and machine readable way. They specify the names and data types (integer, string, boolean, etc) of event data, and are used to clarify, understand, and validate information. Every schema includes documentation about the referenced data so that all users can understand what a given data set contains—both before and long after the data are collected.

Schemas represent a new type of MediaWiki content, JSON schema objects, which adhere to the JSON schema specification, version 3 ([1]). Schemas are maintained in the ‘Schema:’ namespace on Meta-Wiki, and can be edited by all autoconfirmed users. The Talk page attached to each schema is unprotected and is used both to share information about the schema and its status, and as a forum for public discussion. Passers-by can, and are encouraged, to make good-faith improvements to schemas (improving the wording of a property description, for example). However, just because a passer-by can alter a schema does not mean that all changes are appropriate. Each schema requires much care and coordination to design, and drastically altering an existing schema to meet the needs of a different project can be disruptive.

It is important to note that schemas do not automatically generate data. Schemas are a definition of data and can be used to validate and describe data, but an engineer must still programmatically grab the event data. We encourage you to think of the schema as a contract between analysts, developers, and product managers. This contract, collaboratively developed, makes the choice of what data to collect explicit (titleID? Or pageTitle?) to minimize confusion both when implementing the model and, later, when analyzing the data that conforms to it.

Creating a schema wiki page[edit]

Schemas are maintained in the ‘Schema:’ namespace on Meta-Wiki, and can be created and edited by all autoconfirmed users.

The schema page name is the name of the schema. It determines things like the name of its module in JavaScript and the name of the SQL table holding processed events, so choose carefully. When creating a schema wiki page, please adhere to the following naming conventions:

  • Use CamelCase (e.g., GettingStarted, Echo, AccountCreation)
  • No spaces

Once you have created the schema wiki page, edit its Talk ("Discussion") page and add the {{SchemaDoc}} template. This template identifies a contact person, information about the project, and the status of the schema itself. Specifying this information in a template automatically categorizes schemas and makes it easy to sort schemas by the template's parameters.

Using JSON schema syntax[edit]

Schemas are JSON schema objects, which define the required and optional properties of captured event data. The quickest way to get a sense for the syntax is to look at some existing schemas. When you first look at a schema, the human-readable version is displayed. Click the Edit tab to view and/or update the JSON schema code itself. You will notice a schema description followed by an array of schema properties.

Note that currently, only the JSON features most relevant to EventLogging have been implemented. These include:

  • description
  • type: boolean, integer, number, string, timestamp
  • required: true/false (if not set, defaults to false)
  • enum

Using descriptions[edit]

Schemas use JSON descriptions to clarify the purpose of the schema itself as well as each of its individual properties.

The description of the schema should be brief—a line or two about the data the schema captures:

TemplateDescription.png


More detailed information about the project (information about experimental conditions, for example) can be noted in the schema Talk page, and we encourage you to include additional context there.

The description of schema properties should focus on meaning, not implementation details. Although describing a property by the name of a corresponding variable might seem like a good idea, the meaning of the variable is only understood by a subset of schema users, and the description becomes meaningless if and when the variable name changes.

In the case of a value that is a boolean type, the property description is often best posed as a question:

SchemaIsAnon.png

Using enum[edit]

JSON enums are used when a value is required to be one of a known set of values. A good example of a situation for which we recommend the use of enum is a funnel, which captures information about a known flow of possible user actions:

SchemaEnum.png

For instances in which a value is required to be a single known value (e.g., that all users be newly registered, for example) it is better to use a boolean type to describe the field:

SchemaBoolean.png

Useful tools[edit]

  • JSONLint is a web-based JSON validator and formatter. Use this to catch simple errors in your schema's syntax.
  • JSONschemaLint is a web-based tool that validates a JSON structure against a JSON schema. You can paste your schema into this and see if it validates a sample event such as { "wasClicked": false } (note the quotation marks, it is stricter than a simple JavaScript object).
  • JSONschema.net will create a JSON schema from an existing JSON object. Note that this tool generates schema elements that are not currently implemented by EventLogging (e.g. "id”).

Choosing which data to capture[edit]

One of the choices that is made and/or refined as a schema is developed is which specific data to capture. Some data are ‘expensive,’ requiring database queries and processing to obtain; other data are easy to capture, and so require fewer resources.

When thinking about your schema and which specific data to grab, a good place to start is this list of easily available information.

In addition to using low-cost data whenever possible, the data you choose to gather should be the most reliable data possible. A pageID is more reliable than an articleTitle, for example, as a page can be renamed, while the pageID is constant. Page impressions are more reliable than page clicks, and userIDs are more reliable than userStrings, which represents a string of characters (English, Chinese, etc) that can be difficult to handle.

Choosing schema property names[edit]

EventLogging does not enforce a standard vocabulary, and you are welcome to create property names (i.e., schema keywords) that best describe the data you’d like to collect. That said, we strongly encourage you to standardize the property names used by your project, and, if relevant, used commonly across the organization. For example, many analyses require information about users (e.g., whether they are newly registered or anonymous). By using a consistent vocabulary to refer to these qualities (isNew, isAnon), the data collected over time and with multiple schemas, can be more easily compared and understood.

Currently, the best place to see how schema properties have been specified and used in the past is here.

When creating your own schema property names, please adhere to the following naming conventions:

  • use headlessCamelCase (e.g., pageTitle, userID, editCount)
  • no spaces

Editing a deployed schema[edit]

It’s perfectly fine to make edits to a deployed schema, and doing so will not compromise a currently running data collection job. When you update a schema, MediaWiki automatically gives it a new revision number, just like any other wiki page. Since event logging code always references the schema by name + revision, all current data-collection jobs will continue to point to the previous schema version, until source code is explicitly updated to refer to the newer version. At that point the system will automatically create a new SQL table named SchemaName_revisionNNN.

Schemas and schema versions[edit]

When running an experiment—testing to see which of two user interfaces is more effective, for example—you will often run multiple iterations to test different factors and/or interfaces, or to correct an error in the original implementation. Though the schema itself might not change drastically, the experimental conditions could be dramatically different.

We recommend that you create a new schema file for experimental iterations that reflect a substantial change to the schema or to the experiment/experimental conditions. This way, the information that is specific to the iteration can be documented in the schema Talk page, and the new iterations will not be cluttered with legacy data.

The page names of new schemas created for iterations should include the name of the original schema, along with a brief notation that reflects the specific iteration:

GettingStarted GettingStarted0B1 GettingStarted0B2 …

Small changes to the schema or its implementation do not necessarily warrant a new schema page, and can be identified with a schema property. For example, the GettingStarted schema uses a property called ExperimentID to identify minor bug fixes made over the course of a single experimental iteration. In this case, neither the schema nor the experiment change, but the analyst still wishes to capture information about the implementation change in case it impacts the data.

If you are unsure whether a schema edit implies creating a new schema or not, or please ask an analyst.

Collaboration[edit]

One of the strengths of EventLogging is that it facilitates collaboration via the schema file. Because many people are involved in creating a schema, it is also important that all parties be diligent about documenting their work. Documentation appears, or should appear, in each of the following places:

JSON descriptions[edit]

The JSON descriptions clarify both the purpose of the schema and the meaning of its individual values. Use clear concise language in each description so that people developing the schema and referring to it in the future can understand the meaning.

Edit summary for Wiki schema edits[edit]

Each time you make a change to a schema file, please document what you have changed in the edit summary field. Providing this information helps other users understand how and why the file was changed.

Schema Talk pages[edit]

The Talk page associated with each schema is a place for free-form discussion and a good venue in which to raise questions and make comments about a schema. We also encourage schema authors to use a schema's Talk page to

  • Provide additional details about the schema or the considerations that went into its design.
  • If the schema is used for an experiment, note details about the experimental conditions.
  • Log and justify major changes to a schema.

Put {{SchemaDoc}} at the top of each schema's Talk page (see sample). This template standardizes the presentation of the schema's status (draft, active, inactive, or deprecated) and the name of a contact person.

Programming/Instrumenting your code[edit]

Extension:EventLogging/Programming has tips and suggestions for developers writing code to log events.

Analyzing EventLogging data[edit]

EventLogging captures event records and broadcasts them so that they are available to be processed and stored in the fashion that best meets your needs. Currently, data are stored in MySQL tables and MongoDB (in their native JSON format), but new clients can subscribe using Kafka to the stream of EL records to process or store them in other ways: see simple Python sample code to do so. An existing mobile client is used to generate real-time metrics about mobile app usage. See also https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Accessing_Data

Many of our existing analytical tools are designed to work with SQL tables, including event records. Using MySQL for EventLogging records also permits them to be easily joined with MediaWiki's user and transaction records which it stores in MySQL tables also.

The MySQL tables for EventLogging data are created on the s1-analytics server; see wikitech:Analytics/Data access more details about accessing this and other data sources. These tables contain valid event records that can be crunched and analyzed in a variety of ways. In the following sections, we will look at strategies for approaching some common types of analyses (e.g., event counting, data visualization, funnel analyses, and cohort definition).

Counting events[edit]

Counting events—how many users click a particular link or view an experimental treatment, for example—is one of the most common types of EventLogging analyses. Counts can be either ‘raw’ or ‘unique.’

Raw counts are used when there is either no need to deduplicate records (e.g., an analyst is concerned only with the overall number of edits, not the number of editors), or when the records themselves are unique because of how they are defined (e.g., records are only collected for new users viewing a page for the first time).

Unique counts are used when records must be deduplicated (e.g., we are interested in counting only one event per unique userID) or if the analysis depends on additional MediaWiki or page request data. The way in which records are deduplicated depends on the nature of the data and what makes them meaningfully unique. In many cases, records will be deduplicated by unique userIDs, though sometimes, it may make more sense to deduplicate by IP address or tokens.

Visualizing EventLogging data[edit]

Visually representing data so they can be quickly and more easily understood is another common type of EventLogging analysis. Typically, visualizations represent a time series, showing the number of raw or unique events that occur in a given time period (e.g., days or hours).

Many WMF teams use a dashboard to visualize data, the research team maintains a list of dashboards. Many of these render counts of events from EventLogging.

wikitech:Analytics/Dashboards has the steps involved in setting up a dashboard using the analyics team's infrastructure. Please send e-mail to analytics@wikimedia.org or ask in #wikimedia-analyticsconnect, you will need help.

Funnel analysis[edit]

A funnel analysis, which provides information about the number of people who complete and fail to complete a defined activity flow (e.g., selecting a page, making an edit, and then successfully saving the edit) is often used to describe user activity and/or to compare a test group to a control group. For example, a funnel analysis might be used to compare users presented with a new UI to those interacting with the existing one to see which group is more likely to complete a process of interest.

A ‘descriptive’ funnel analysis looks at the number of users who enter a funnel, and reports how many of those users complete the funnel (e.g., by successfully saving an edit) or who complete an intermediary step of the funnel (e.g., by opting to edit and then doing so successfully). Each completion rate (CMP) is specified as a ratio: the number of users who complete the funnel (or funnel step) divided by the total number of users who have an opportunity to do so. Note that the total number of users entering each funnel step will shrink along the way, as users drop out instead of finishing the process (instead of saving an edit, users might navigate to another page, for example, or they might attempt to save an edit and fail). The number of users lost at each step of the process is called the ‘bounce rate.’ The number of users who attempt each step of the process is called the ‘click-through rate’. The ‘conversion rate’ represents the number of users who successfully complete the funnel divided by the total number of funnel impressions (which reflects both the users who enter the funnel, and those who could have entered the funnel, but chose not to do so). When creating a schema for descriptive funnel analysis, it is helpful to create a consistent funnel id that allows one to track a single users movement through a funnel, this is separate from a user ID, as a single user may enter the funnel more than 1x (such as editing an article).

A ‘test’ funnel analysis compares two groups of users. For a test period, each user group interacts with a different version of a funnel (an existing UI and a test UI intended to improve performance, for example). At the end of the experiment, the EventLogging data is analyzed to see if the conversion rates for the two funnels differ, and if that difference is statistically significant. The funnel conversion rates can then be used to generate predictions (e.g., what would happen if the experimental UI were presented to all users—instead of just the test group—for the next six months? And how would the performance compare to the baseline, also captured by the experiment.

Defining a cohort from EventLogging data[edit]

Sometimes only a specific subset of users is of interest. For example, an outreach program might generate two hundred new users. In order to evaluate the success of the program, we would like to know how active those users are over time. To do this, we must identify the users of interest, measure their activity, and compare it to a baseline. For this, we use cohorts.

A cohort is a set of user IDs for users that share a trait, or a combination of traits, that are of interest. A cohort might consist of new users that joined in response to a particular campaign and that are (or are not) attached to another wiki. A cohort could consist only of active editors that joined at a specific time. A cohort could consist of users that were part of an experiment. A cohort can be defined by whatever common traits are interesting and relevant.

Current cohorts reside in MySQL (db1047.prod.usertags and db1047.prod.usertags_meta) and are available for use with the Metrics API. You can perform cohort analysis at https://metrics.wmflabs.org/

New cohorts can be generated by following these steps:

  •  ???

Best practices: Do's and don'ts[edit]

Parsimony[edit]

Don't capture data that are not require to answer the question, or that you can easily obtain from the database, as grabbing them adds an extra implementation cost and affects the readability of the schema. An analyst will know which values are more easily to reconstruct than to capture.

Be bold and prune[edit]

Don't be afraid to prune schema properties that are not directly relevant to the question, or used to control or validate data.

Redundancy[edit]

Always keep parsimony and pruning in mind, but know that there are cases for which it is less costly to have some redundancy, or for which redundancy is required to help validate unreliable data. For example, click-events are notoriously difficult to capture, and it is good practice to grab a targetTitle with each in to help validate the data. In other cases, a value (whether or not a user is new, for example) may prove costly to reconstruct from a database. In such cases, it is more efficient to simply add an isNew property to the schema as a control variable.

Focus on high-level description, not implementation details[edit]

Do not tie a description of a value to a current implementation. Do not use a button to describe an action (unless you are comparing buttons), but instead try to think of the user action that is relevant and will be comparable long after the specific button has been replaced or moved. Obscure variable names are confusing to analysts and others not directly involved in implementation, and they become meaningless if the codebase changes and the variable name is changed.

Standardization[edit]

Whenever possible, use consistently named properties (e.g., userID for userID). Don't reinvent a schema or format—use the schema library if appropriate, and use properties/meanings previously defined by your project, or by others in the organization (see #Data fields for some common fields). Standardization helps make schemas more readable, and permits analysts to better intersect and analyze data.

Enum is your friend[edit]

Use enum to require that a value be one of an array of specified values. See Creating a schema: Using JSON schema syntax for an example.

Be wary of platform differences[edit]

Implementing different schema on different platforms is often necessary, and different actions and layouts can lead to different structures. When planning schema, look to multiple platforms to see if a schema already exists, consider if counting or comparisons across platforms will be relevant, and consider if the same or similar schema, or a combined schema should be used.

Know the tradeoffs[edit]

There can be more than one way to capture an event, and it's good to be aware of the tradeoffs involved for each way. For example, when logging users, you may rely on token, username, or userId. Tokens are unique and totally anonymized, but they cannot be joined with production databases to give you more information about your users; usernames can come with special characters and encoding that break in analysis; userIds are usually ideal, but sometimes you may need to filter out a large group of inconsistently named test/staff user accounts that you won't recognize by number, in which case username is the better option.

Schema library[edit]

{placeholder} Repository for standardized, best practice, commonly used schema elements (e.g., funnels, buckets, user identification, etc),

For now, see

Data fields[edit]

Built-in data fields[edit]

In addition to the object you pass to it, EventLogging logs additional information about the request in both client-side mw.eventLog.logEvent() and the server-side EventLogging::logEvent(), such as isValid which indicates whether the object you pass in matched the schema+id you specified. Event stream processing logs additional information, such as whether an event was truncated in transmission, a server timestamp, and an obfuscated client IP address.

All these fields are described in m:Schema:EventCapsule.

Standard data fields[edit]

Here are names and idioms used by convention in similar fashion in current schemas.

isAnon
boolean; true if user has not logged in (opposite of "authenticated"). In JavaScript, call mw.user.isAnon()

If the user has logged in (isAnon is false), then we sometimes log:

editCount
integer how many edits a logged-in user has made. In JavaScript, mw.config.get( 'wgEditCount' ).
revId, revisionId
integer The revision ID of the current page (meaningless for special pages, actions like View history, etc.). Note that revisionId is unique in a wiki, so it alone is enough to identify a page. In JavaScript, mw.config.get( 'wgCurRevisionId' ).
pageTitle
string; the title of the page the user is editing. In JavaScript, mw.config.get( 'wgTitle' ).
note this doesn't work for Special pages and other namespaces.
pageId
integer the article ID of the current page. In JavaScript, mw.config.get( 'wgArticleId' ).
pageNs
integer the namespace of the current page. In JavaScript, mw.config.get( 'wgNamespaceNumber' ).
userId
integer the user ID of a logged in user. Privacy note: information about the activities of logged-in users is already available in Special:RecentChanges, Special:UserContributions, etc. In JavaScript, mw.config.get( 'wgUserId' ) (starting with MediaWiki 1.21).

Common data fields[edit]

There are no standard values for these, but different data models use the same field name for their own values.

action
string, enum; identifying different actions the data model logs , such as 'impression', 'click', 'submit', 'accept' (a task), 'create' (a page).
bucket
string, enum; this records which alternative is presented to a user. For example, the Account Creation User Experience randomly showed users either 'control_3' (original form) or 'acux_3' (fancy validating form). Sometimes the bucket is derived algorithmically and need not be logged; e.g. newly signed-up users with an even userId are presented the test, odd users are presented the control.
campaign
string; value of incoming query parameter identifying the source of an action. For example the Article Feedback Tool's "create an account" call to action links to the account creation with ?campaign=aftv5_cta4 in the query string.
token
string; a unique random token per browser, stored in the mediaWiki.user.sessionId cookie. In JavaScript, calling mw.user.id() will generate this for an anonymous user.
version
integer; a number representing changes to the conditions (not the data model), e.g. bump it when deploying code that presents a different experience. #Schemas and schema versions discusses when to use this versus creating a new schema.

Debugging[edit]

EventLogging provides a debugging tool, the eventlogging-devserver. It is often useful for monitoring event records during schema development, because it outputs captured event records to a terminal window (instead of to a data store).

Installing the eventlogging-devserver[edit]

  1. Ssh into vagrant
  2. Go to /vagrant/mediawiki/extensions/EventLogging
  3. run 'git submodule update --init'
  4. Go to /vagrant/mediawiki/extensions/EventLogging/server/

(note this also works just fine if you run the extension outside vagrant)

  1. Run 'sudo python setup.py install'
  2. After the install has finished, run 'eventlogging-devserver --port 8100 --verbose'
  3. First-time users should copy the configuration values output by the eventlogging-devserver and place them in the LocalSettings.php file.

If the installation is successful, this is what you will see after running 'eventlogging-devserver':

   ___                        _
  / (_)                    \_|_)                 o
  \__        _   _  _  _|_   |     __   __,  __,     _  _    __,
  /    |  |_|/  / |/ |  |   _|    /  \_/  | /  | |  / |/ |  /  |
  \___/ \/  |__/  |  |_/|_/(/\___/\__/ \_/|/\_/|/|_/  |  |_/\_/|/
-----------------------------------------/|---/|--------------/|----------
  (C) Wikimedia Foundation, 2013         \|   \|              \|

# Ensure the following values are set in LocalSettings.php:
require_once( "$IP/extensions/EventLogging/EventLogging.php" );
$wgEventLoggingSchemaIndexUri = 'http://meta.wikimedia.org/w/index.php';

# Listening to events.


Now when you log events, the eventlogging-devserver will output the URL-encoded JSON sent by the browser (under the heading 'request' ), the decoded event record (under the heading 'event') and whether or not the event validates against the schema (under the heading ‘validation’):


-- request ---------------------------------------------------------------
?%7B%22event%22%3A%7B%22userAgent%22%3A%22Mozilla%2F5.0%20(Macintosh%3B%20Intel%20Mac%20OS%20X%2010_8_2)%20AppleWebKit%2F537.33%20(KHTML%2C%20like%20Gecko)%20Chrome%2F27.0.1431.0%20Safari%2F537.33%22%2C%22isHttps%22%3Afalse%2C%22isAnon%22%3Atrue%2C%22waiting%22%3A485%2C%22receiving%22%3A1%2C%22rendering%22%3A3536%2C%22pageId%22%3A1%2C%22revId%22%3A153%2C%22action%22%3A%22view%22%7D%2C%22clientValidated%22%3Atrue%2C%22revision%22%3A5323808%2C%22schema%22%3A%22NavigationTiming%22%2C%22webHost%22%3A%22127.0.0.1%22%2C%22wiki%22%3A%22testwiki%22%7D; 1.0.0.127.in-addr.arpa 0 2013-03-18T20:13:04 127.0.0.1
-- event -----------------------------------------------------------------
{
  "clientIp": "8cb1d219fef9bf92b556c59951381e24c40a7aa7",
  "clientValidated": true,
  "event": {
    "action": "view",
    "isAnon": true,
    "isHttps": false,
    "pageId": 1,
    "receiving": 1,
    "rendering": 3536,
    "revId": 153,
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1431.0 Safari/537.33",
    "waiting": 485
  },
  "recvFrom": "1.0.0.127.in-addr.arpa",
  "revision": 5323808,
  "schema": "NavigationTiming",
  "seqId": 0,
  "timestamp": 1363637584,
  "webHost": "127.0.0.1",
  "wiki": "testwiki"
}
-- validation ------------------------------------------------------------
Valid.
--------------------------------------------------------------------------


See logging errors on console[edit]

A sample event:

a = {
 "type": "image",
 "contentHost": "pl.wikipedia.org",
 "isHttps": false,
 "total": 492,
 "urlHost": "upload.wikimedia.org",
 "status": 200,
 "XCache": "cp1063 miss (0), cp3016 hit (1), cp3018 frontend miss (0)",
 "varnish1": "cp1063",
 "varnish1hits": 0,
 "varnish2": "cp3016",
 "varnish2hits": 1,
 "varnish3": "cp3018",
 "varnish3hits": 0,
 "XVarnish": "265764709, 583884279 576752145, 4160384575",
 "contentLength": 111608,
 "age": 3968,
 "timestamp": 1425247200,
 "lastModified": 1383480710,
 "redirect": 0,
 "dns": 0,
 "tcp": 38,
 "request": 221,
 "response": 205,
 "cache": 28,
 "uploadTimestamp": "20070823000000",
 "imageWidth": 1024,
 "country": "PL"
}

Validate:

mw.loader.using('schema.MultimediaViewerNetworkPerformance', function () {
   mw.eventLog.validate(a, mw.eventLog.schemas.MultimediaViewerNetworkPerformance.schema);
} );

Log Event that will result in errors:

mw.eventLog.logEvent(a)


Errors should be visible in console.

Also, logging errors are sent to mw.track: https://gerrit.wikimedia.org/r/#/c/99547/6 With this javascript on the console you can see errors sent by eventLogging when processing events

mw.trackSubscribe('eventlogging.error', function(topic, data) {console.log(data)});

See logging in your browser[edit]

Events logged by javascript can be exposed both to the javascript console and as popup notifications. This feature is activated by updating your user options by running the following in your javascript console. When logged in this will be saved and apply to all future page loads. This will not persist for an anonymous user:

   mw.loader.using( 'mediawiki.api' ).then( function () {
       new mw.Api().saveOption( 'eventlogging-display-web', '1' );
   });

The popup notifications contain a shortened version of the event. Clicking the popup will bring up a modal box that contains the full event.