Extension:EventLogging/Guide

What is EventLogging?
If you have a question about how visitors are interacting with our site, EventLogging can capture the data required to answer it. EventLogging gathers the information each querier needs, and makes that data easily available for analysis.

The goal of EventLogging is not to capture every action on the site, but to capture rich, well-documented data that can be efficiently analyzed and used over the long term. Data captured by EventLogging are validated, versioned, and carefully described so that error and misunderstanding are minimized.

Every data collection job has a public-facing page, where the data to capture are specified and refined. Users can scrutinize the data-collection process, offer insights, and discuss concerns. This public page contains a new type of wiki content: a JSON schema, which provides a data description that is legible to both humans and machines. The schema does not automatically grab data, rather it provides a place for analysts, engineers, product managers and others to work collaboratively. The schema becomes a contract, a consensus as to the meaning and implementation of a data model.

EventLogging is designed to help ensure that the data collected and, ultimately, analyzed to answer our questions, are the desired data. The system validates data on the client-side (to immediately alert developers of errors) and server-side (to ensure that the collected data are truly valid). If a schema is updated and redeployed, the system assigns a new version number. Collected data are transmitted to a data warehouse, where valid data are stored in a mySQL table that is automatically created by the system. The data can be easily accessed by additional subscribers for other purposes, such as visualizations.

EventLogging replaces the previous event-logging system, ClickTracking, which uses a single format (a string) to describe data. Over time, we discovered that the string format failed to adequately describe collected data, and, in fact, made comparing and understanding the data costly and difficult. Simply parsing the format could present problems, as the ‘@’ delimiter sometimes appeared in the data themselves. Additionally, the system provided no mechanism for validating the data or describing them well. By using collaboratively edited schemas to define data, EventLogging provides a much more robust framework for creating and documenting data models, and performing iterative revisions.

Underlying technology
EventLogging is a MediaWiki extension that generates the code required to perform server- and client-side logging and to transmit the collected information to a data store.

The schemas describing the data are located in the ‘Schema:’ namespace on Meta-Wiki and represent JSON schema objects, a new content type for MediaWiki.

EventLogging uses ContentHandler, developed by Wikimedia Deutschland for Wikidata, to store and version the JSON content like wikitext.

EventLogging automatically wraps each schema-defined record in an event capsule that provides additional standard information (e.g., schema version, sever timestamp and obfuscated client IP address).

Collected data are transmitted to a data store, where the system unpacks and validates the events, and publishes valid records so they can be accessed by subscribers. The mySQL subscriber is always running and will automatically create a table for valid records. Other subscribers may use data in other ways.

About this document
This document is intended to capture the current best practices for using EventLogging. We encourage questions, comments, suggestions and concerns as part of the ongoing process of identifying how to most effectively use the system.

Events and schemas briefly defined
A couple terms to know:

Events: An event is a record of a user action on the site, such as previewing an edit, or collapsing the "Toolbox" section in the sidebar. We capture and analyze event data in aggregate to better understand how readers and editors interact with our site, to identify usability problems, and to provide feedback for features engineers.

Schemas: A schema describes the structure of an event record by enumerating properties and specifying constraints on types of values.

Using EventLogging: The workflow
The EventLogging workflow is a flexible process that facilitates asynchronous collaboration among contributors. Analysts, engineers, product managers, and others can work in parallel, implementing the code required to grab the data, and/or contributing to the process of schema refinement at any stage of the process.

The schema file provides a centralized place for all development. Though conversations about how to specify or collect data may occur in person or over email, the dialogue will ultimately be reflected in the schema itself and, if best practices are followed, in its documentation.

By clearly defining data, the schema helps all users understand which data need to be collected (pageTitle or pageID?) and, later, what the collected data represent (new users, or new users served an experimental treatment?). Data definitions help minimize error and ensure that the correct information is captured.

Posing a question
Though questions about how users are interacting with the site are most often posed by a product manager or UI developer, questions can be posed by anyone. One of the benefits of EventLogging is that the tool can be used by anyone—both within and outside of the Wikimedia Foundation—to initiate an inquiry.

A question may require an experiment, or not. For example, a product manager might be interested in seeing how users, or a subset of users, currently interact with the site. ‘How well do Talk pages work for new users?’ or ‘How many new users successfully complete an edit within 24 hours?’ are questions that require only that the schema capture usage patterns that can be analyzed to provide an answer.

Other questions imply an experiment. For example, “Which Create Account button design is most effective?” or “Which onboarding experience better helps people edit?” In these cases, the inquiry would involve an experiment, and possibly multiple experimental iterations. We talk more about working with schemas and experiments in later sections.

Once you have identified your question, get the ball rolling by creating a Schema page, where the process can continue. See [Creating a schema page] for more information.

Identifying metrics
Once you have identified your question, it’s time to start thinking about the metrics that can answer it. When thinking about metrics, make sure to clearly define each one, and to limit the measurements to only those necessary for answering the posed question.

For example, to answer the question, “Which onboarding experience better helps people edit?” we would need to capture the number of users that are exposed to each onboarding experience, and the number that subsequently edit successfully.

Now is the time to even further refine these desired metrics. For example, the users we are interested in are newly registered users (not anonymous users, or users who registered last week). Successful editing means completing one edit (or five edits? or ten?) successfully within 24 hours (or 24 days? or 24 minutes?). The more clearly defined the desired metrics, the more obvious their implementation becomes.

Although you may feel the impulse to collect additonal information because it seems interesting or exciting, resist the temptation. Collecting unnecessary data dilutes the focus and meaning of the schema, and adds additional implementation cost.

If you have any questions about identifying appropriate metrics, please contact an analyst.

Drafting a schema
The schema is a JSON schema object collaboratively edited by the analysts, engineers, and product managers contributing to the project. Over time, and as each of the developing parties contributes expertise, the schema begins to more precisely define the data necessary for each analysis.

As the schema is being developed, engineers work on developing the code required to grab the specified data. At any point in the schema-drafting process, the schema can be deployed locally to test the implementation. Use ORI’S TOOL to print collected data to a terminal window.

For more information about schemas, please see Creating a schema.

Peer review
As the schema is developed, it benefits from the expertise of each of the contributing parties. Analysts, engineers, and product managers each bring their perspectives and knowledge to the schema, helping to ensure that it is viable and sound.

An analyst makes sure that the specified metrics are, in fact, appropriate and required for answering the posed question. For example, an early version of the GettingStarted schema specified that the system log the protection-level of editable pages. Although this information is interesting, it is not directly relevant to the original question (Which onboarding experience better helps people edit?), and the analyst knew to remove the property from the schema.

Engineers are experts on implementation cost, and can see which data will be easy to collect and which might require extensive processing, or even take down the website were the system to attempt to capture them. Engineers know which metrics can be collected reliably, and which not, and can make recommendations accordingly.

The product manager, who tracks how much time everyone puts into the project, has a good sense for when the cost of an engineering effort outweighs its value. The product manager will weigh in if an analysis, however interesting, is outside the needs of the organization.

Finalizing a schema
Before a schema can be deployed, it must get a final review from an analyst, who ensures that it does not violate the Wikimedia Foundation’s privacy policies.

Deploying a schema
Once a schema has been completed to the satisfaction of its contributors and approved by an analyst, it can be deployed.

To deploy a schema, copy and paste the PHP code generated by the schema into the code base. To view the schema-generated code, click the red ‘<>’ button at the top right corner of the schema page. The code will look something like this:

This PHP code, in conjunction with the code necessary to grab the specified fields, will automatically transmit collected data to a central area where they are warehoused.

Note that the schema does not automatically indicate that it has been deployed. We recommend that you note this information on the schema’s Talk page (see Collaboration: Talk pages for more information). Deployed schemas will automatically create a MySQL table (named SchemaName_versionNumber) for the collected data on the data store.

QA
EventLogging uses both client- and server-side validation to help ensure that the data collected are the correct ones. Although machine validation captures many errors, the data must also be reviewed by a human before they are deemed sound.

Client-side validation is viewed in the JavaScript console, and immediately alerts developers that a schema is not working properly.

Server-side validation occurs on the server where the data are ultimately stored. EventLogging uses a tiered model of data handling:

Data that meet the requirements specified in the schema are unpacked and broadcast so that they are available to subscribers. Valid data are automatically stored in a MySQL table, where they can be easily queried and intersected with other data sets.

Data that do not conform to the schema are logged, but only in a raw bin.

Analysts also examine the data to catch the errors machines cannot. A value can conform to the letter of the schema (which specifies that it have a string type), but be meaningless (a string of gibberish). If possible, unit tests are used to test the soundness of the data, though sometimes these are difficult to implement.

Because analysts are familiar with the known patterns of human activity on the site, they are able to determine if the collected data are unrealistic and indicate an error in implementation or assumption. Analysts will be able to identify edge cases—the times when the data are unexpected, either because users have behaved in a way that was not anticipated, or because of a glitch in an experiment. Sometimes, these edge cases can simply be flagged; other times, they require a change in implementation. The analyst will decide how best to respond to the inconsistencies.

Analysis
Once the data have been collected and validated, they are ready to be analyzed. We are currently developing tools that will permit users to easily generate reports for funnel analyses and visualizations. We are also working on tools that will generate statistical tests, so that users can determine if an observed difference is significant, or random.

Creating a schema
Schemas describe data in a human and machine readable way. They specify the names and data types (integer, string, boolean, etc) of event data, and are used to clarify, understand, and validate information. Every schema includes documentation about the referenced data so that all users can understand what a given data set contains—both before and long after the data are collected.

Schemas represent a new type of MediaWiki content, JSON schema objects, which adhere to the JSON schema specification, version 3. Schemas are maintained in the ‘Schema:’ namespace on Meta-Wiki, and can be edited by all autoconfirmed users. The Talk page attached to each schema is unprotected and is used both to share information about the schema and its status, and as a forum for public discussion. Passers-by can, and are encouraged, to make good-faith improvements to schemas (improving the wording of a property description, for example). However, just because a passer-by can alter a schema does not mean that all changes are appropriate. Each schema requires much care and coordination to design, and drastically altering an existing schema to meet the needs of a different project can be disruptive.

It is important to note that schemas do not automatically generate data. Schemas are a definition of data and can be used to validate and describe data, but an engineer must still programmatically grab the event data. We encourage you to think of the schema as a contract between analysts, developers, and product managers. This contract, collaboratively developed, makes the choice of what data to collect explicit (titleID? Or pageTitle?) to minimize confusion both when implementing the model and, later, when analyzing the data that conforms to it.

Creating a schema wiki page
Schemas are maintained in the ‘Schema:’ namespace on Meta-Wiki, and can be created and edited by all autoconfirmed users.

When creating a schema wiki page, please adhere to the following naming conventions:


 * Use CamelCaps (e.g., GettingStarted, Echo, AccountCreation)
 * No spaces

The schema page name is the name of the schema, and is used when naming the MySQL table created by the system.

Once you have created the schema wiki page, please add the schema Talk template, which is availible here. The Talk template identifies a contact person, information about the project, and the status of the schema itself. Specifying this information in the template makes it easy to sort schemas by the defined criteria.

Using JSON schema syntax
Schemas are JSON schema objects, which define the required and optional properties of captured event data. The quickest way to get a sense for the syntax is to look at some existing schemas. When you first look at a schema, the human-readable version is displayed. Click the Edit tab to view and/or update the JSON schema code itself. You will notice a schema description followed by an array of schema properties.

Note that currently, only the JSON features most relevant to EventLogging have been implemented. These include:
 * description
 * type: boolean, integer, number, string, timestamp
 * required: true/false
 * enum

Using descriptions
Schemas use JSON descriptions to clarify the purpose of the schema itself as well as each of its individual properties.

The description of the schema should be brief—a line or two about the data the schema captures:


 * description - "Logs events related to tasks assigned to new registered users via the GettingStarted extension"

More detailed information about the project (information about experimental conditions, for example) can be noted in the schema Talk page, and we encourage you to include additional context there.

The description of schema properties should focus on meaning, not implementation details. Although describing a property by the name of a corresponding variable might seem like a good idea, the meaning of the variable is only understood by a subset of schema users, and the description becomes meaningless if and when the variable name changes.

In the case of a value that is a boolean type, the property description is often best posed as a question:
 * isAnon
 * type: "boolean"
 * required: true
 * description: "Is the user anonymous? Anonymous users are not logged in."

Using enum
JSON enums are used when a value is required to be one of a known set of values. A good example of a situation for which we recommend the use of enum is a funnel, which captures information about a known flow of possible user actions:


 * action
 * type: "string"
 * required: true
 * enum:
 * 0 - "page-impression"
 * 1 - "page-edit-impression"
 * 2 - "page-save-attempt"
 * 3 - "page-save-success"
 * description - "The actions involved in accepting a task and completing the corresponding edit funnel.”

For instances in which a value is required to be a single known value (e.g., that all users be newly registered, for example) it is better to use a boolean type to describe the field:


 * isNew
 * type: "boolean"
 * required: true
 * description: "Is the user new? True if the user just created an account.”

Useful tools
We recommend the following tools for working with JSON:

JSONLint is a web-based JSON validator and formatter.

JSONschema.net will create a JSON schema from an existing JSON object. Note that this tool generates schema elements that are not currently implemented by EventLogging (e.g., "id”).

Choosing which data to capture
One of the choices that is made and/or refined as a schema is developed is which specific data to capture. Some data are ‘expensive,’ requiring database queries and processing to obtain; other data are easy to capture, and so require fewer resources.

When thinking about your schema and which specific data to grab, a good place to start is [[Manual:Interface/JavaScript|this list] of easily available information.

In addition to using low-cost data whenever possible, the data you choose to gather should be the most reliable data possible. A pageID is more reliable than an articleTitle, for example, as a page can be renamed, while the pageID is constant. Page impressions are more reliable than page clicks, and userIDs are more reliable than userStrings, which represents a string of characters (English, Chinese, etc) that can be difficult to handle.

Choosing schema property names
EventLogging does not enforce a standard vocabulary, and you are welcome to create property names (i.e., schema keywords) that best describe the data you’d like to collect. That said, we strongly encourage you to standardize the property names used by your project, and, if relevant, used commonly across the organization. For example, many analyses require information about users (e.g., whether they are newly registered or anonymous). By using a consistent vocabulary to refer to these qualities (isNew, isAnon), the data collected over time and with multiple schemas, can be more easily compared and understood.

Currently, the best place to see how schema properties have been specified and used in the past is here.

When creating your own schema property names, please adhere to the following naming conventions:
 * use headlessCamelCase (e.g., pageTitle, userID, editCount)
 * no spaces

Editing a deployed schema
It’s perfectly fine to make edits to a deployed schema, and doing so will not compromise a currently running data collection job. When changes are saved to a schema, the system automatically assigns it a new version number. Any currently running data-collection jobs will continue to point to the previous schema version, unless the code is explicitly updated to refer to the newer version.

If a revised schema is deployed, the system will automatically create a new MySQL table for the new data, using the schemaName_versionNumber to identify the table.

Schemas and experiments
When running an experiment—testing to see which of two user interfaces is more effective, for example—you will often run multiple iterations to test different factors and/or interfaces, or to correct an error in the original implementation. Though the schema itself might not change drastically, the experimental conditions could be dramatically different.

We recommend that you create a new schema file for experimental iterations that reflect a substantial change to the schema or to the experiment/experimental conditions. This way, the information that is specific to the iteration can be documented in the schema Talk page, and the new iterations will not be cluttered with legacy data.

The page names of new schemas created for iterations should include the name of the original schema, along with a brief notation that reflects the specific iteration:

GettingStarted GettingStarted0B1 GettingStarted0B2 …

Small changes to the schema or its implementation do not necessarily warrant a new schema page, and can be identified with a schema property. For example, the GettingStarted schema uses a property called ExperimentID to identify minor bug fixes made over the course of a single experimental iteration. In this case, neither the schema nor the experiment change, but the analyst still wishes to capture information about the implementation change in case it impacts the data. If you are unsure whether a schema edit implies creating a new schema or not, or please ask an analyst.

JSON descriptions
Use clear concise descriptions so all people developing schema and referring to it to understand data can understand it.

Edit summary for Wiki schema edits
Document what you have changed in the edit summary field when making changes to the schema file.

Schema Talk page
(sample) to link to experiments using this, discuss details, etc. Always document what code in what circumstances logs the event. Extension:EventLogging/Schemas
 * Talk template includes a status (draft, active, inactive, deprecated); display each category.
 * Talk template will also include details about experimental design, if the schema is used for experiment.

It might be useful for you to have a quick look at Using talk pages. Talk pages for schemas are the same as talk pages for normal wiki articles: each talk page is associated with a specific article, and they're a venue for rather free-form discussion about the content of the article they are associated with. (Discussion happens through people writing blocks of wikitext and signing it.) We don't need to explain talk pages -- it's something our target audience would be intimately familiar with -- but we can simply encourage people to use them as a venue for raising questions / making comments about schema, and also as a place where any kind of prose description of the schema or the considerations that went into its design can go. Talk pages are also a good place to log and justify major changes to schemas.

Working with EventLogging data

 * Valid data is unpacked by system and broadcast so that it is available to subscribers
 * mySQL subsriber automatically creates a table/naming convention [schemaname_version]
 * Other subscribers (e.g., real-time visualizations/write to bug tracking system) can receive and use data in other ways
 * More to come when new tools are ready!

Creating a cohort from EL data
{placeholder}

Parsimony

 * don't add fields that you do not require to answer the question, or that you can easily obtain from database, as grabbing them produces an extra cost for engineers to implement and affects the readability of the schema.

Be bold and prune
e.g., EditCount in GettingStarted is not used to control or validate. [you might save a step in analysis when can group by edit count, but don't need to do this, and the info can be gotten easily from database (discretion of analyst)
 * prune all information not required to answer question (don't go butterfly collecting.

Redundancy
That said, there may be times when it is less costly to have some redundancy or redundancy may be required to help validate an unreliable fields, e.g. targetTitle and pageId with click event pageID only available when people hit article/more reliable

May be less costly to use some redundancy. e.g., Some extra fields not strictly needed/control/validation variable. IsNew in GettingStarted because it is costly to reconstruct whether someone new or not. Less costly to simply add isNew as control variable.

Focus on high-level description, not implementation details

 * Do not tie a description of a value to a current implementation.
 * Example: of isEditable. code base changes/variable name changes/implementation detail becomes meaningless. Also, obscure variable names are confusing for analyst and others not involved in implementation. Implementation-level details are not generic enough to represent something that we are measuring.

Standardization

 * whenever possible, use standardized fields (consistent keyword/meaning), so analists can intersect and analyze data, and to help with readibility
 * e.g., call a userID a userID consistently, then easy to find schemas that have that field. Retrospectively look at historic data and make sense of the data.
 * Don't reinvent schema or format—use the schema library if appropriate, or keyword/meanings previously defined in your project

Enum is your friend
Use enum to require that a value be one of an array of specified values.

Know the tradeoffs
There can be more than one way to capture an event, and it's good to be aware of the tradeoffs involved for each way. For example, when logging users, you may rely on token, username, or userId. Tokens are unique and totally anonymized, but they cannot be joined with production databases to give you more information about your users; usernames can come with special characters and encoding that break in analysis; userIds are usually ideal, but sometimes you may need to filter out a large group of inconsistently named test/staff user accounts that you won't recognize by number, in which case username is the better option.

Schema library
{placeholder} Repository for standardized, best practice, commonly used schema elements (e.g., funnels, buckets, user identification, etc)