Extension:EventLogging/Guide

From mediawiki.org

What is EventLogging?[edit]

If you have a question about how visitors are interacting with our site, EventLogging can capture the data required to answer it. EventLogging can be used to send desired information to a backend event service, which can then be used to make that data easily available for analysis.

The goal of EventLogging is not to capture every action on the site, but to capture rich, well-documented data that can be efficiently analyzed and used over the long term. Data captured by EventLogging are validated, versioned, and carefully described so that error and misunderstanding are minimized.

Every event that EventLogging logs must have a well defined schema. Users can scrutinize the data modeling and collection process, offer insights, and discuss concerns. A schema does not automatically grab data; rather, it provides a way for analysts and engineers to explicitly model their data, and to integrate their data into downstream systems for later analysis. The schema becomes a contract, a consensus as to the meaning and implementation of a data model:

  • Developers write code to log events that match the schema.
  • The schema tells analysts what information is in the logged events.

EventLogging helps ensure that the data collected and, ultimately, analyzed to answer our questions, are the desired data. To ensure both reliable delivery and the validity of the collected data, events are POSTed to a backend event service. WMF uses EventGate. Collected data are eventually ingested into a data warehouse, where valid events are stored in SQL tables that are automatically created by the system.

Underlying technology[edit]

This section is about the general technology behind MediaWiki's EventLogging extension. If you'd like to know how WMF-specific EventLogging system is backed, you can read more here: wikitech:Analytics/EventLogging and wikitech:Event_Platform.

EventLogging is a MediaWiki extension that performs server and client side logging. It does not include "backend" code to transmit, process, or store the collected information; that is implemented by other services. EventLogging expects to POST events to an HTTP event intake service. This service accepts a JSON array of event objects. Each event must minimally specify the schema it conforms to, as well as the name of the event data stream it belongs in. A schema is like a type declaration for your event data. Schemas can be reused for multiple streams of events. (More on schemas and event streams later.)

To use EventLogging, you call a function and pass it the name of the event stream the event belongs, and the event data object.

To log an event on the server in PHP, call EventLogging::submit() (TBD). To log an event in client-side JavaScript code, call mw.eventLog.submit(). If you are unsure of whether to log an event on the client or server side, consider the data you are collecting. Transaction information (e.g., the addition, deletion, or modification of information stored in the MediaWiki database) is easiest to capture on the server-side. Information about how a user is interacting with the browser environment (e.g., a page view or notification) is easiest to capture client-side. NOTE: If you are attempting to model a core Mediawiki state change event, consider using Extension:EventBus instead.

Server-side logging:
EventLogging::submit( string $streamName, array $event )

$streamName

The name of the event stream. For example, "analytics.edit_button_click"

$event

An associative array of event data, conforming to the schema, including the $schema url. Example:
 [ "$schema" => "/analytics/button/click", "user_id" => 1234 ]
Client-side logging:
mw.eventLog.submit( streamName, event )

streamName

The name of the event stream. For example, "analytics.edit_button_click"

event

A plain JavaScript object of event data conforming to the schema, including the $schema URL. Example:
{ "$schema": "/analytics/button/click", "user_id": 1234 }

Events can only be logged as wholes. If a schema's properties span both server-only and client-only data, the schema can be split into two complementary schemas, or the developer can do some extra work to make sure all values are available on one side or another. The developer may also suggest an alternate combination of schema values that would also capture the required information.

Once the schema has been finalized and implemented, instrumentation code can be deployed to collect event records. A client or server side event will trigger the implemented code, which grabs the data and triggers EventLogging. EventLogging will POST the event to the backend event intake service endpoint configured by EventLoggingServiceUri.

For example, suppose the fundraising team has created a new banner and is interested in capturing information about whether or not users have clicked it. To use EventLogging to capture this information, the team would first create a schema file (perhaps at /analytics/banner/impression) that defines the event data to capture (e.g., Was the banner clicked? true/false):

We will look in more depth at schema creation and best practices in later sections.banner.

Note that a developer must implement the schema by creating the code that will programmatically assemble the event data and invoke EventLogging. Most of WMF's event instrumentation code lives in Extension:WikimediaEvents.

Once the schema has been merged and deployed and its implementation are complete, the instrumentation code can be deployed and EventLogging will begin to submit event records.

Hadoop[edit]

Because WMF's analytical tools are built around the Hadoop ecosystem, data is currently stored in Hive databases and sometimes loaded into Druid for exploratory analysis.

See #Analyzing EventLogging data for more information about working with EventLogging data.

Installing the EventLogging extension[edit]

Please see Extension:EventLogging for information about downloading and configuring EventLogging and setting up a developer environment.

About this document[edit]

This document is intended to capture the current best practices for using EventLogging. We encourage questions, comments, suggestions and concerns as part of the ongoing process of identifying how to most effectively use the system.

Events and schemas briefly defined[edit]

A couple terms to know:

Events: An event strongly typed and schemaed piece of data, usually representing something happening at a definite time, such as previewing an edit, or collapsing the "Toolbox" section in the sidebar. Though often triggered by a click, events themselves represent any item of interest (the state of an application or the presence of a notification, for example). Events are not necessarily triggered by a click, either. For example, EventLogging can capture event information, such as the time it takes to fetch an image from a local browser, directly from interactions with the browser itself. We capture and analyze event data in aggregate to better understand how readers and editors interact with our site, to identify usability problems, and to provide feedback for features engineers.

Schemas: A schema describes the structure of an event record by enumerating properties and specifying constraints on types of values.

Stream: More accurately an 'event stream', is a contiguous (often unending) collection of events (loosely) ordered by time.

Using EventLogging: The workflow[edit]

The EventLogging workflow is a flexible process that facilitates asynchronous collaboration among contributors. Analysts and engineers can work in parallel, implementing the code required to grab the data, and/or contributing to the process of data modeling.

The schema provides a centralized place for all data modeling development. Though conversations about how to specify or collect data may occur in person or over email, the dialogue will ultimately be reflected in the schema and schema review process itself and, if best practices are followed, in its documentation.

By clearly defining data, the schema helps all users understand which data need to be collected (page_title or page_id?) and, later, what the collected data represent (new users, or new users served an experimental treatment?). Data definitions help minimize error and ensure that the correct information is captured.

Posing a question[edit]

A question may require an experiment, or not. For example, a product manager might be interested in seeing how users, or a subset of users, currently interact with the site. ‘How well do Talk pages work for new users?’ or ‘How many new users successfully complete an edit within 24 hours?’ are questions that require only that the schema capture usage patterns that can be analyzed to provide an answer.

Other questions imply an experiment. For example, “Which Create Account button design is most effective?” or “Which onboarding experience better helps people edit?” In these cases, the inquiry would involve an experiment, and possibly multiple experimental iterations. We talk more about working with schemas and experiments in later sections.

Once you have identified your question, get the ball rolling by creating a JSONSchema patch review where the process can continue. See here for more information.

Identifying metrics[edit]

Once you have identified your question, it’s time to start thinking about the metrics that can answer it. When thinking about metrics, make sure to clearly define each one, and to limit the measurements to only those necessary for answering the posed question.

For example, to answer the question, “Which onboarding experience better helps people edit?” we would need to capture the number of users that are exposed to each onboarding experience, and the number that subsequently edit successfully.

Now is the time to even further refine these desired metrics. For example, the users we are interested in are newly registered users (not anonymous users, or users who registered last week). Successful editing means completing one edit (or five edits? or ten?) successfully within 24 hours (or 24 days? or 24 minutes?). The more clearly defined the desired metrics, the more obvious their implementation becomes.

Although you may feel the impulse to collect additional information because it seems interesting or exciting, resist the temptation. Collecting unnecessary data dilutes the focus and meaning of the schema, and adds additional implementation cost. Be particularly careful about personally identifiable information--any such information, such as IP or browsing behavior might trigger the need to throw out schema data on a rolling basis (e.g. 30 rolling window) to comply with law or internal policy.

If you have any questions about identifying appropriate metrics, please contact an analyst.

Drafting a schema[edit]

The schema is a JSON schema object collaboratively edited by the analysts and engineers contributing to the project. Over time, and as each of the developing parties contributes expertise, the schema begins to more precisely define the data necessary for each analysis.

As the schema is being developed, engineers work on developing the code required to grab the specified data. At any point in the schema-drafting process, the schema can be deployed locally to test the implementation.

For more information about schemas, please see Creating a schema.

Peer review[edit]

As the schema is developed, it benefits from the expertise of each of the contributing parties. Analysts, engineers, and product managers each bring their perspectives and knowledge to the schema, helping to ensure that it is viable and sound.

An analyst makes sure that the specified metrics are, in fact, appropriate and required for answering the posed question. For example, an early version of the GettingStarted schema specified that the system log the protection-level of editable pages. Although this information is interesting, it is not directly relevant to the original question (Which onboarding experience better helps people edit?), and the analyst knew to remove the property from the schema.

Engineers are experts on implementation cost, and can see which data will be easy to collect and which might require extensive processing, or even take down the website were the system to attempt to capture them. Engineers know which metrics can be collected reliably, and which not, and can make recommendations accordingly.

The product manager, who tracks how much time everyone puts into the project, has a good sense for when the cost of an engineering effort outweighs its value. The product manager will weigh in if an analysis, however interesting, is outside the needs of the organization.

Security/Privacy concerns[edit]

If you don't want all your data purged on a rolling 90 day basis, you will need to whitelist the table and the specific fields in the schema you want to keep. If you add new fields to an existing schema, you will need to whitelist those as well. Even given this purging, there are some fields we never collect.

This section is a work in progress, but here are some data policies to be aware of:

Analytics/EventLogging/Data retention and auto-purging

Analytics/EventLogging/Publishing

Finalizing a schema[edit]

Before a schema can be deployed, it must get a final review from an analyst, who ensures that it does not violate the Wikimedia Foundation’s privacy policies.

Deploying a event stream[edit]

Once a schema has been completed to the satisfaction of its contributors and approved by an analyst, it can be merged. Production code that logs events of that schema can then be deployed.

EventLogging uses Extension:EventStreamConfig to support dynamically configuring things like sampling rates of specific event streams. You must first add an entry for your new event stream to the wgEventStreams variable in mediawiki-config.

$wgEventStreams = [
    // ...,
    [
        'stream' => 'analytics.button_click',
        'schema_title' => '/analytics/button/click',
    ],

This 'stream config' is used by EventLogging as well as other systems to specify configuration of event streams. The schema_title MUST be specified and must match the title field of your event schema. The stream name will be the value that you pass to the mw.eventLog.submit() function.

Because wgEventStreams is used by many services, you need to tell EventLogging what streams it needs to know about. This is done by adding entries to the wgEventLoggingStreamNames config variable.

$wgEventLoggingStreamNames = [
    // ...,
    'analytics.button_click',

By adding 'analytics.button_click' to wgEventLoggingStreamNames EventLogging will request the stream configs for those streams on page load and use it to configure logging behaviors like sampling rate (TBD).

In your instrumentation code, you can then log this event either server side (TBD) or client side. NOTE: The event data must always set an explicit $schema URI. This is an explicit schema version URI, NOT the schema title. E.g. "/analytics/button/click/1.0.0". More on schema versions and URIs later.

EventLogging::submit( 'analytics.button_click', $event );

And the JavaScript:

mw.eventLog.submit( 'analytics.button_click', event );

You'll need to fill in the event with appropriate values for your schema's fields. Logging an event will transmit collected data to a central area where it is warehoused.

QA[edit]

EventLogging POSTs to EventGate which validates the events against their schemas to help ensure that the data collected are the correct ones. Although schema validation captures many errors, the data must also be reviewed by a human before they are deemed sound.

Events that fail validation are logged to an validate error event stream. For events from EventLogging at WMF, this stream is called 'eventgate-analytics-external.error.validation', which eventually is ingested into the Hive table event.eventgate_analytics_external_error_validation

Analysts also examine the data to catch the errors machines cannot. A value can conform to the letter of the schema (which specifies that it have a string type), but be meaningless (a string of gibberish). Unit tests should test the soundness of the data, though sometimes these are difficult to implement.

Because analysts are familiar with the known patterns of human activity on the site, they are able to determine if the collected data are unrealistic and indicate an error in implementation or assumption. Analysts will be able to identify edge cases—the times when the data are unexpected, either because users have behaved in a way that was not anticipated, or because of a glitch in an experiment. Sometimes, these edge cases can simply be flagged; other times, they require a change in implementation. The analyst will decide how best to respond to the inconsistencies.

Creating a schema[edit]

Schemas describe data in a human and machine readable way. They specify the names and data types (integer, string, boolean, etc.) of event data, and are used to clarify, understand, and validate information. Every schema includes documentation about the referenced data so that all users can understand what a given data set contains—both before and long after the data are collected.

It is important to note that schemas do not automatically generate data. Schemas are a definition of data and can be used to validate and describe data, but an engineer must still programmatically submit the event data. We encourage you to think of the schema as a contract between analysts, developers, and product managers. This contract, collaboratively developed, makes the choice of what data to collect explicit (title_id? Or page_title?) to minimize confusion both when implementing the model and, later, when analyzing the data that conforms to it.

Creating (or editing) a schema and event stream[edit]

See

Choosing which data to capture[edit]

One of the choices that is made and/or refined as a schema is developed is which specific data to capture. Some data are ‘expensive,’ requiring database queries and processing to obtain; other data are easy to capture, and so require fewer resources.

When thinking about your schema and which specific data to grab, a good place to start is this list of easily available information.

In addition to using low-cost data whenever possible, the data you choose to gather should be the most reliable data possible. A page_id is more reliable than an article title for example, as a page can be renamed, while the page_id is constant. Page impressions are more reliable than page clicks, and user_ids are more reliable than user strings, which represents a string of characters (English, Chinese, etc.) that can be difficult to handle.

Programming/Instrumenting your code[edit]

Extension:EventLogging/Programming has tips and suggestions for developers writing code to log events.

Best practices: Dos and don'ts[edit]

Parsimony[edit]

Don't capture data that are not require to answer the question, or that you can easily obtain from the database, as grabbing them adds an extra implementation cost and affects the readability of the schema. An analyst will know which values are more easily to reconstruct than to capture.

Be bold and prune[edit]

Don't be afraid to prune schema properties that are not directly relevant to the question, or used to control or validate data.

Redundancy[edit]

Always keep parsimony and pruning in mind, but know that there are cases for which it is less costly to have some redundancy, or for which redundancy is required to help validate unreliable data. For example, click-events are notoriously difficult to capture, and it is good practice to grab a target title with each in to help validate the data. In other cases, a value (whether or not a user is new, for example) may prove costly to reconstruct from a database. In such cases, it is more efficient to simply add an user_is_new property to the schema as a control variable.

Focus on high-level description, not implementation details[edit]

Do not tie a description of a value to a current implementation. Do not use a button to describe an action (unless you are comparing buttons), but instead try to think of the user action that is relevant and will be comparable long after the specific button has been replaced or moved. Obscure variable names are confusing to analysts and others not directly involved in implementation, and they become meaningless if the codebase changes and the variable name is changed.

Be wary of platform differences[edit]

Implementing different schema on different platforms is often necessary, and different actions and layouts can lead to different structures. When planning schema, look to multiple platforms to see if a schema already exists, consider if counting or comparisons across platforms will be relevant, and consider if the same or similar schema, or a new schema should be used.

Know the tradeoffs[edit]

There can be more than one way to capture an event, and it's good to be aware of the tradeoffs involved for each way. For example, when logging users, you may rely on token, user_name, or user_id. Tokens are unique and totally anonymized, but they cannot be joined with production databases to give you more information about your users; usernames can come with special characters and encoding that break in analysis; userIds are usually ideal, but sometimes you may need to filter out a large group of inconsistently named test/staff user accounts that you won't recognize by number, in which case username is the better option.