Extension:EventLogging/Guide

What is EventLogging?
If you have a question about how visitors are interacting with our site, EventLogging can capture the data required to answer it. EventLogging gathers the information each querier needs, and makes that data easily available for analysis.

The goal of EventLogging is not to capture every action on the site, but to capture rich, well-documented data that can be efficiently analyzed and used over the long term. Data captured by EventLogging are validated, versioned, and carefully described so that error and misunderstanding are minimized.

Every data collection job has a public-facing page, where the data to capture are specified and refined. Users can scrutinize the data-collection process, offer insights, and discuss concerns. This public page contains a new type of wiki content: a JSON schema, which provides a data description that is legible to both humans and machines. The schema does not automatically grab data, rather it provides a place for analysts, engineers, product managers and others to work collaboratively. The schema becomes a contract, a consensus as to the meaning and implementation of a data model.

EventLogging is designed to help ensure that the data collected and, ultimately, analyzed to answer our questions, are the desired data. The system validates data on the client-side (to immediately alert developers of errors) and server-side (to ensure that the collected data are truly valid). If a schema is updated and redeployed, the system assigns a new revision ID. Collected data are transmitted to a data warehouse, where valid data are stored in a MySQL table that is automatically created by the system. The data can be easily accessed by additional subscribers for other purposes, such as visualizations.

EventLogging replaces the previous event-logging system, ClickTracking, which uses a single format (a string) to describe all data not captured by a small set of defined fields. Over time, we discovered that the string format failed to adequately describe collected data, and, in fact, made comparing and understanding the data costly and difficult. Simply parsing the format could present problems, as the ‘@’ delimiter sometimes appeared in the data themselves. Additionally, the system made too many API calls to scale, and provided no mechanism for validating the data or describing them well. By using collaboratively edited schemas to define data, EventLogging provides a much more robust framework for creating and documenting data models, and performing iterative revisions.

Underlying technology
EventLogging is a MediaWiki extension that performs server- and client-side logging. It also has "back-end" code to transmit, process, and store the collected information. The system uses schema files to define, describe, and validate collected data. Schemas are JSON schema objects that you store and edit in some wiki; schemas in use on WMF wikis are located in the ‘Schema:’ namespace on MetaWiki. Note that JSON is a new type of wiki content, and EventLogging uses ContentHandler, developed by Wikimedia Deutschland for Wikidata, to store and version the JSON content like wikitext.

To use EventLogging, you call a function and pass it a versioned schema name and a plain object that matches that schema. To log an event on the server in PHP, call. To log an event in client-side JavaScript code, call. If you are unsure of whether to log an event on the client or server side, consider the data you are collecting. Transaction information (e.g., the addition, deletion, or modification of information stored in the MediaWiki database) is easiest to capture on the server-side. Information about how a user is interacting with the browser environment (e.g., a page view or notification) is easiest to capture client-side.

Events can only be logged as wholes. If a schema's properties span both server-only and client-only data, the schema can be split into two complementary schemas, or the developer can do some extra work to make sure all values are available on one side or another. The developer may also suggest an alternate combination of schema values that would also capture the required information.

Once the schema has been finalized and implemented, EventLogging can be deployed to collect event records. A client- or server-side event will trigger the implemented code, which grabs the data and triggers EventLogging. EventLogging will then validate each event record and annotate it with additional information (via the m:Schema:EventCapsule). EventLogging handles the URL-encoding and decoding required to transmit and, ultimately, broadcast valid event records in JSON format. This stream of valid records produced by EventLogging is available to any and all subscribers (MySQL, MongoDB, Visualization, etc).



For example, suppose the fundraising team has created a new banner and is interested in capturing information about whether or not users have clicked it. To use EventLogging to capture this information, the team would first create a schema file (BannerImpression) that defines the event data to capture (e.g., Was the banner clicked? true/false):


 * BannerImpressionSchema.png

We will look in more depth at schema creation and best practices in later sections. For now, just note the property “wasClicked”, which defines a Boolean value that describes whether or not a user has clicked the fundraising banner.

Note that a developer must implement the schema by creating the code that will programmatically grab the event data and evoke EventLogging.

Once the schema file and its implementation are complete, the schema can be deployed and EventLogging will begin to capture event records. For example, if a user views the fundraising banner, but does not click the banner, the captured event will be:

 {wasClicked: false}

When EventLogging is evoked, JavaScript EventLogging code validates the captured record and annotates it with event capsule information (i.e., additional fields, such as schema version, sever timestamp, and obfuscated client IP address). EventLogging then URL-encodes the record so that it can be sent to the bits servers:

 http://bits.wikimedia.org/event.gif?%7B%22event%22%3A%7B%22wasClicked%22%3Afalse%7D%2C%22clientValidated%22%3Atrue%2C%22revision%22%3A5329872%2C%22schema%22%3A%22BannerImpression%22%2C%22webHost%22%3A%22127.0.0.1%22%2C%22wiki%22%3A%22enwiki%22%7D;

Note that EventLogging generates a request to an image (‘event.gif’) on the bits servers. Each request to ‘event.gif’ includes all of the parameters defined by the schema and the event capsule as URL-encoded JSON.

The Varnish software running on the bits servers has been configured to recognize and properly process and route EventLogging data. Extraneous information, such as the query string (http://bits.wikimedia.org/event.gif?), is removed; additional relevant information, such as the name of the bits server that processes the record, is added; and the record is then passed to Vanadium (on port 8422) using udp2log. The transmitted record looks like this:

 %7B%22event%22%3A%7B%22wasClicked%22%3Afalse%7D%2C%22clientValidated%22%3Atrue%2C%22revision%22%3A5329872%2C%22schema%22%3A%22BannerImpression%22%2C%22webHost%22%3A%22127.0.0.1%22%2C%22wiki%22%3A%22enwiki%22%7D; niobium.wikimedia.org 12363 2013-03-18T19:32:47 216.38.130.161

The above record includes the event data (wasClicked: false) as encoded by the user's browser, the event capsule data (webHost, clientValidated, wiki, etc.) as encoded by the user's browser, and the extra annotations added by the bits server.

Sever-side events, which are captured on MediaWiki machines via efLogServerSideEvent are also transmitted to Vanadium (on port 8421) and are processed in the same way as client-side events when they arrive.

EventLogging code running on Vanadium validates each record against its corresponding schema. Invalid records are written to a log file on Vanadium. Valid records are decoded and broadcast using ZeroMQ to any and all interested clients running on the cluster:

Clients that currently (or will soon) subscribe to the stream of decoded, validated events include:
 * json2sql-db1047, which writes all events into a MySQL database (db1047) on Vanadium;
 * mongo.py, which writes all events into a MongoDB database;
 * a mobile client that is generating real-time metrics about mobile app usage;
 * a Hadoop client that writes the data into Kraken / HDFS;
 * … your client here!

If you would like access to Vanadium, please speak to Ori or Dario.

MySQL
Because many of our existing analytical tools are designed to work with MySQL, and because the user and transaction data generated by MediaWiki are stored in MySQL tables, using MySQL for EL data is often useful. EventLogging data are currently stored in MySQL on the s1-analytics server.

The json2sql client, which writes EL events to MySQL, subscribes to the stream of valid JSON produced by ZeroMQ. When json2sql receives an event record, the client checks to see if a MySQL table exists for the data. If so, json2sql places the record in the existing table. If no table exists, json2sql generates one automatically using the schema from MetaWiki to construct a SQL statement instructing the database to create a table that has appropriate columns:

For example,

Once the table has been created (or if the table exists already), json2sql issues a SQL statement instructing the database to insert the event as a new record in the table:

See for more information about working with EventLogging data.

Installing the EventLogging extension
Please see Extension:EventLogging for information about downloading and configuring EventLogging and setting up a developer environment.

Installing the EventLogging devserver
The EventLogging devserver outputs captured event records to a terminal window (instead of to a data store). This tool is often useful for monitoring event records during schema development.

To get the EventLogging devserver working locally:


 * 1) Download the EventLogging extension from the MediaWiki source code repository (See Extension:EventLogging for instructions.)
 * 2) Go to the 'server/' sub-folder
 * 3)  Run 'python setup.py install'
 * 4)  After the install has finished, run 'eventlogging-devserver'
 * 5) First-time users should copy the configuration values output by the EventLogging devserver and place them in the LocalSettings.php file.

If the installation is successful, this is what you will see after running 'eventlogging-devserver':

Now when you log events, the EventLogging devserver will output the URL-encoded JSON sent by the browser (under the heading 'request' ), the decoded event record (under the heading 'event') and whether or not the event validates against the schema (under the heading ‘validation’):

 -- request --- ?%7B%22event%22%3A%7B%22userAgent%22%3A%22Mozilla%2F5.0%20(Macintosh%3B%20Intel%20Mac%20OS%20X%2010_8_2)%20AppleWebKit%2F537.33%20(KHTML%2C%20like%20Gecko)%20Chrome%2F27.0.1431.0%20Safari%2F537.33%22%2C%22isHttps%22%3Afalse%2C%22isAnon%22%3Atrue%2C%22waiting%22%3A485%2C%22receiving%22%3A1%2C%22rendering%22%3A3536%2C%22pageId%22%3A1%2C%22revId%22%3A153%2C%22action%22%3A%22view%22%7D%2C%22clientValidated%22%3Atrue%2C%22revision%22%3A5323808%2C%22schema%22%3A%22NavigationTiming%22%2C%22webHost%22%3A%22127.0.0.1%22%2C%22wiki%22%3A%22testwiki%22%7D; 1.0.0.127.in-addr.arpa 0 2013-03-18T20:13:04 127.0.0.1 -- event - { "clientIp": "8cb1d219fef9bf92b556c59951381e24c40a7aa7", "clientValidated": true, "event": { "action": "view", "isAnon": true, "isHttps": false, "pageId": 1, "receiving": 1, "rendering": 3536, "revId": 153, "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1431.0 Safari/537.33", "waiting": 485 }, "recvFrom": "1.0.0.127.in-addr.arpa", "revision": 5323808, "schema": "NavigationTiming", "seqId": 0, "timestamp": 1363637584, "webHost": "127.0.0.1", "wiki": "testwiki" } -- validation Valid. --

About this document
This document is intended to capture the current best practices for using EventLogging. We encourage questions, comments, suggestions and concerns as part of the ongoing process of identifying how to most effectively use the system.

Events and schemas briefly defined
A couple terms to know:

Events: An event is a record of a user action on the site, such as previewing an edit, or collapsing the "Toolbox" section in the sidebar. Though often triggered by a click, events themselves represent any item of interest (the state of an application or the presence of a notification, for example). Events are not necessarily triggered by a click, either. For example, EventLogging can capture event information, such as the time it takes to fetch an image from a local browser, directly from interactions with the browser itself. We capture and analyze event data in aggregate to better understand how readers and editors interact with our site, to identify usability problems, and to provide feedback for features engineers.

Schemas: A schema describes the structure of an event record by enumerating properties and specifying constraints on types of values.

Using EventLogging: The workflow
The EventLogging workflow is a flexible process that facilitates asynchronous collaboration among contributors. Analysts, engineers, product managers, and others can work in parallel, implementing the code required to grab the data, and/or contributing to the process of schema refinement at any stage of the process.

The schema file provides a centralized place for all development. Though conversations about how to specify or collect data may occur in person or over email, the dialogue will ultimately be reflected in the schema itself and, if best practices are followed, in its documentation.

By clearly defining data, the schema helps all users understand which data need to be collected (pageTitle or pageID?) and, later, what the collected data represent (new users, or new users served an experimental treatment?). Data definitions help minimize error and ensure that the correct information is captured.

Posing a question
Though questions about how users are interacting with the site are most often posed by a product manager or UI developer, questions can be posed by anyone. One of the benefits of EventLogging is that the tool can be used by anyone—both within and outside of the Wikimedia Foundation—to initiate an inquiry.

A question may require an experiment, or not. For example, a product manager might be interested in seeing how users, or a subset of users, currently interact with the site. ‘How well do Talk pages work for new users?’ or ‘How many new users successfully complete an edit within 24 hours?’ are questions that require only that the schema capture usage patterns that can be analyzed to provide an answer.

Other questions imply an experiment. For example, “Which Create Account button design is most effective?” or “Which onboarding experience better helps people edit?” In these cases, the inquiry would involve an experiment, and possibly multiple experimental iterations. We talk more about working with schemas and experiments in later sections.

Once you have identified your question, get the ball rolling by creating a Schema page, where the process can continue. See [Creating a schema page] for more information.

Identifying metrics
Once you have identified your question, it’s time to start thinking about the metrics that can answer it. When thinking about metrics, make sure to clearly define each one, and to limit the measurements to only those necessary for answering the posed question.

For example, to answer the question, “Which onboarding experience better helps people edit?” we would need to capture the number of users that are exposed to each onboarding experience, and the number that subsequently edit successfully.

Now is the time to even further refine these desired metrics. For example, the users we are interested in are newly registered users (not anonymous users, or users who registered last week). Successful editing means completing one edit (or five edits? or ten?) successfully within 24 hours (or 24 days? or 24 minutes?). The more clearly defined the desired metrics, the more obvious their implementation becomes.

Although you may feel the impulse to collect additonal information because it seems interesting or exciting, resist the temptation. Collecting unnecessary data dilutes the focus and meaning of the schema, and adds additional implementation cost.

If you have any questions about identifying appropriate metrics, please contact an analyst.

Drafting a schema
The schema is a JSON schema object collaboratively edited by the analysts, engineers, and product managers contributing to the project. Over time, and as each of the developing parties contributes expertise, the schema begins to more precisely define the data necessary for each analysis.

As the schema is being developed, engineers work on developing the code required to grab the specified data. At any point in the schema-drafting process, the schema can be deployed locally to test the implementation. Use the EventLogging devserver to print collected data to a terminal window.

For more information about schemas, please see Creating a schema.

Peer review
As the schema is developed, it benefits from the expertise of each of the contributing parties. Analysts, engineers, and product managers each bring their perspectives and knowledge to the schema, helping to ensure that it is viable and sound.

An analyst makes sure that the specified metrics are, in fact, appropriate and required for answering the posed question. For example, an early version of the GettingStarted schema specified that the system log the protection-level of editable pages. Although this information is interesting, it is not directly relevant to the original question (Which onboarding experience better helps people edit?), and the analyst knew to remove the property from the schema.

Engineers are experts on implementation cost, and can see which data will be easy to collect and which might require extensive processing, or even take down the website were the system to attempt to capture them. Engineers know which metrics can be collected reliably, and which not, and can make recommendations accordingly.

The product manager, who tracks how much time everyone puts into the project, has a good sense for when the cost of an engineering effort outweighs its value. The product manager will weigh in if an analysis, however interesting, is outside the needs of the organization.

Finalizing a schema
Before a schema can be deployed, it must get a final review from an analyst, who ensures that it does not violate the Wikimedia Foundation’s privacy policies.

Deploying a schema
Once a schema has been completed to the satisfaction of its contributors and approved by an analyst, production code that logs to it can be deployed.

Underlying technology summarized the code to log to a schema. The schema's wiki page helps by providing actual code that loads the schema and logs to it. To view it, click the red ‘<>’ button at the top-right corner of the schema page. The PHP code will look something like this:

And the JavaScript:

You'll need to fill in the event with appropriate values for your schema's fields. Logging an event will transmit collected data to a central area where it is warehoused.

Note that the schema's wiki page does not automatically indicate whether it is in use. We recommend that you note this information on the schema’s Talk page (see Collaboration: Schema Talk pages for more information). Deployed schemas will automatically create a MySQL table (named SchemaName_versionNumber) for the collected data on the data store.

QA
EventLogging uses both client- and server-side validation to help ensure that the data collected are the correct ones. Although machine validation captures many errors, the data must also be reviewed by a human before they are deemed sound.

If client-side validation fails, it displays a warning in the browser's JavaScript console (if it is open), immediately alerting developers that a schema is not working properly.

Server-side validation occurs on the server where the data are ultimately stored. EventLogging uses a tiered model of data handling:

Data that meet the requirements specified in the schema are unpacked and broadcast so that they are available to subscribers. Valid data are automatically stored in a MySQL table, where they can be easily queried and intersected with other data sets.

Data that do not conform to the schema are logged, but only in a raw bin.

Analysts also examine the data to catch the errors machines cannot. A value can conform to the letter of the schema (which specifies that it have a string type), but be meaningless (a string of gibberish). If possible, unit tests are used to test the soundness of the data, though sometimes these are difficult to implement.

Because analysts are familiar with the known patterns of human activity on the site, they are able to determine if the collected data are unrealistic and indicate an error in implementation or assumption. Analysts will be able to identify edge cases—the times when the data are unexpected, either because users have behaved in a way that was not anticipated, or because of a glitch in an experiment. Sometimes, these edge cases can simply be flagged; other times, they require a change in implementation. The analyst will decide how best to respond to the inconsistencies.

Analysis
Once the data have been collected and validated, they are ready to be analyzed. We are currently developing tools that will permit users to easily generate reports for funnel analyses and visualizations. We are also working on tools that will generate statistical tests, so that users can determine if an observed difference is significant, or random.

Creating a schema
Schemas describe data in a human and machine readable way. They specify the names and data types (integer, string, boolean, etc) of event data, and are used to clarify, understand, and validate information. Every schema includes documentation about the referenced data so that all users can understand what a given data set contains—both before and long after the data are collected.

Schemas represent a new type of MediaWiki content, JSON schema objects, which adhere to the JSON schema specification, version 3. Schemas are maintained in the ‘Schema:’ namespace on Meta-Wiki, and can be edited by all autoconfirmed users. The Talk page attached to each schema is unprotected and is used both to share information about the schema and its status, and as a forum for public discussion. Passers-by can, and are encouraged, to make good-faith improvements to schemas (improving the wording of a property description, for example). However, just because a passer-by can alter a schema does not mean that all changes are appropriate. Each schema requires much care and coordination to design, and drastically altering an existing schema to meet the needs of a different project can be disruptive.

It is important to note that schemas do not automatically generate data. Schemas are a definition of data and can be used to validate and describe data, but an engineer must still programmatically grab the event data. We encourage you to think of the schema as a contract between analysts, developers, and product managers. This contract, collaboratively developed, makes the choice of what data to collect explicit (titleID? Or pageTitle?) to minimize confusion both when implementing the model and, later, when analyzing the data that conforms to it.

Creating a schema wiki page
Schemas are maintained in the ‘Schema:’ namespace on Meta-Wiki, and can be created and edited by all autoconfirmed users.

The schema page name is the name of the schema. It determines things like the name of its module in JavaScript and the name of the SQL table holding processed events, so choose carefully. When creating a schema wiki page, please adhere to the following naming conventions:


 * Use CamelCase (e.g., GettingStarted, Echo, AccountCreation)
 * No spaces

Once you have created the schema wiki page, edit its Talk ("Discussion") page and add the template. This template identifies a contact person, information about the project, and the status of the schema itself. Specifying this information in a template automatically categorizes schemas and makes it easy to sort schemas by the template's parameters.

Using JSON schema syntax
Schemas are JSON schema objects, which define the required and optional properties of captured event data. The quickest way to get a sense for the syntax is to look at some existing schemas. When you first look at a schema, the human-readable version is displayed. Click the Edit tab to view and/or update the JSON schema code itself. You will notice a schema description followed by an array of schema properties.

Note that currently, only the JSON features most relevant to EventLogging have been implemented. These include:
 * description
 * type: boolean, integer, number, string, timestamp
 * required: true/false
 * enum

Using descriptions
Schemas use JSON descriptions to clarify the purpose of the schema itself as well as each of its individual properties.

The description of the schema should be brief—a line or two about the data the schema captures:
 * TemplateDescription.png

More detailed information about the project (information about experimental conditions, for example) can be noted in the schema Talk page, and we encourage you to include additional context there.

The description of schema properties should focus on meaning, not implementation details. Although describing a property by the name of a corresponding variable might seem like a good idea, the meaning of the variable is only understood by a subset of schema users, and the description becomes meaningless if and when the variable name changes.

In the case of a value that is a boolean type, the property description is often best posed as a question:
 * SchemaIsAnon.png

Using enum
JSON enums are used when a value is required to be one of a known set of values. A good example of a situation for which we recommend the use of enum is a funnel, which captures information about a known flow of possible user actions:


 * SchemaEnum.png

For instances in which a value is required to be a single known value (e.g., that all users be newly registered, for example) it is better to use a boolean type to describe the field:


 * SchemaBoolean.png

Useful tools

 * JSONLint is a web-based JSON validator and formatter. Use this to catch simple errors in your schema's syntax.
 * JSONschemaLint is a web-based tool that validates a JSON structure against a JSON schema. You can paste your schema into this and see if it validates a sample event such as { "wasClicked": false } (note the quotation marks, it is stricter than a simple JavaScript object).
 * JSONschema.net will create a JSON schema from an existing JSON object. Note that this tool generates schema elements that are not currently implemented by EventLogging (e.g. "id”).

Choosing which data to capture
One of the choices that is made and/or refined as a schema is developed is which specific data to capture. Some data are ‘expensive,’ requiring database queries and processing to obtain; other data are easy to capture, and so require fewer resources.

When thinking about your schema and which specific data to grab, a good place to start is [[Manual:Interface/JavaScript|this list] of easily available information.

In addition to using low-cost data whenever possible, the data you choose to gather should be the most reliable data possible. A pageID is more reliable than an articleTitle, for example, as a page can be renamed, while the pageID is constant. Page impressions are more reliable than page clicks, and userIDs are more reliable than userStrings, which represents a string of characters (English, Chinese, etc) that can be difficult to handle.

Choosing schema property names
EventLogging does not enforce a standard vocabulary, and you are welcome to create property names (i.e., schema keywords) that best describe the data you’d like to collect. That said, we strongly encourage you to standardize the property names used by your project, and, if relevant, used commonly across the organization. For example, many analyses require information about users (e.g., whether they are newly registered or anonymous). By using a consistent vocabulary to refer to these qualities (isNew, isAnon), the data collected over time and with multiple schemas, can be more easily compared and understood.

Currently, the best place to see how schema properties have been specified and used in the past is here.

When creating your own schema property names, please adhere to the following naming conventions:
 * use headlessCamelCase (e.g., pageTitle, userID, editCount)
 * no spaces

Editing a deployed schema
It’s perfectly fine to make edits to a deployed schema, and doing so will not compromise a currently running data collection job. When you update a schema, MediaWiki automatically gives it a new revision number, just like any other wiki page. Since event logging code always references the schema by name + revision, all current data-collection jobs will continue to point to the previous schema version, unless the code is explicitly updated to refer to the newer version. At that point the system will automatically create a new SQL table named SchemaName_revisionNNN.

Schemas and schema versions
When running an experiment—testing to see which of two user interfaces is more effective, for example—you will often run multiple iterations to test different factors and/or interfaces, or to correct an error in the original implementation. Though the schema itself might not change drastically, the experimental conditions could be dramatically different.

We recommend that you create a new schema file for experimental iterations that reflect a substantial change to the schema or to the experiment/experimental conditions. This way, the information that is specific to the iteration can be documented in the schema Talk page, and the new iterations will not be cluttered with legacy data.

The page names of new schemas created for iterations should include the name of the original schema, along with a brief notation that reflects the specific iteration:

GettingStarted GettingStarted0B1 GettingStarted0B2 …

Small changes to the schema or its implementation do not necessarily warrant a new schema page, and can be identified with a schema property. For example, the GettingStarted schema uses a property called ExperimentID to identify minor bug fixes made over the course of a single experimental iteration. In this case, neither the schema nor the experiment change, but the analyst still wishes to capture information about the implementation change in case it impacts the data. If you are unsure whether a schema edit implies creating a new schema or not, or please ask an analyst.

Collaboration
One of the strengths of EventLogging is that it facilitates collaboration via the schema file. Because many people are involved in creating a schema, it is also important that all parties be diligent about documenting their work. Documentation appears, or should appear, in each of the following places:

JSON descriptions
The JSON descriptions clarify both the purpose of the schema and the meaning of its individual values. Use clear concise language in each description so that people developing the schema and referring to it in the future can understand the meaning.

Edit summary for Wiki schema edits
Each time you make a change to a schema file, please document what you have changed in the edit summary field. Providing this information helps other users understand how and why the file was changed.

Schema Talk pages
The Talk page associated with each schema is a place for free-form discussion and a good venue in which to raise questions and make comments about a schema. We also encourage schema authors to use a schema's Talk page to
 * Provide additional details about the schema or the considerations that went into its design.
 * If the schema is used for an experiment, note details about the experimental conditions.
 * Log and justify major changes to a schema.

Put   at the top of each schema's Talk page (see sample). This template standardizes the presentation of the schema's status (draft, active, inactive, or deprecated) and the name of a contact person.

Analyzing EventLogging data
EventLogging captures event records and broadcasts them so that they are available to be processed and stored in the fashion that best meets your needs. Currently, data are stored in MySQL tables and MongoDB (in their native JSON format), but new clients can subscribe to the stream of EL records to process or store them in other ways. We are currently developing a Hadoop client, for example, and an existing mobile client is used to generate real-time metrics about mobile app usage.

Because many of our existing analytical tools are designed to work with MySQL tables, MySQL is often used for analyzing event records. Using MySQL for EventLogging records also permits them to be easily joined with the user and transaction records generated by MediaWiki, which are also stored in MySQL tables.

The MySQL tables for EventLogging data are created on the s1-analytics server. These tables contain valid event records that can be crunched and analyzed in a variety of ways. In the following sections, we will look at strategies for approaching some common types of analyses (e.g., event counting, data visualization, funnel analyses, and cohort definition).

Counting events
Counting events—how many users click a particular link or view an experimental treatment, for example—is one of the most common types of EventLogging analyses. Counts can be either ‘raw’ or ‘unique.’

Raw counts are used when there is either no need to deduplicate records (e.g., an analyst is concerned only with the overall number of edits, not the number of editors), or when the records themselves are unique because of how they are defined (e.g., records are only collected for new users viewing a page for the first time).

Unique counts are used when records must be deduplicated (e.g., we are interested in counting only one event per unique userID) or if the analysis depends on additional MediaWiki or page request data. The way in which records are deduplicated depends on the nature of the data and what makes them meaningfully unique. In many cases, records will be deduplicated by unique userIDs, though sometimes, it may make more sense to deduplicate by IP address or tokens.

Visualizing EventLogging data
Visually representing data so they can be quickly and more easily understood is another common type of EventLogging analysis. Typically, visualizations represent a time series, showing the number of raw or unique events that occur in a given time period (e.g., days or hours). In this section, we will look at how to take raw and unique event records from EventLogging and create a visual representation of them.

Step-by-step instructions are coming from Dario. Tools for doing this will come later.

Funnel analysis
A funnel analysis, which provides information about the number of people who complete and fail to complete a defined activity flow (e.g., selecting a page, making an edit, and then successfully saving the edit) is often used to describe user activity and/or to compare a test group to a control group. For example, a funnel analysis might be used to compare users presented with a new UI to those interacting with the existing one to see which group is more likely to complete a process of interest.

A ‘descriptive’ funnel analysis looks at the number of users who enter a funnel, and reports how many of those users complete the funnel (e.g., by successfully saving an edit) or who complete an intermediary step of the funnel (e.g., by opting to edit and then doing so successfully). Each completion rate (CMP) is specified as a ratio: the number of users who complete the funnel (or funnel step) divided by the total number of users who have an opportunity to do so. Note that the total number of users entering each funnel step will shrink along the way, as users drop out instead of finishing the process (instead of saving an edit, users might navigate to another page, for example, or they might attempt to save an edit and fail). The number of users lost at each step of the process is called the ‘bounce rate.’ The number of users who attempt each step of the process is called the ‘click-through rate’. The ‘conversion rate’ represents the number of users who successfully complete the funnel divided by the total number of funnel impressions (which reflects both the users who enter the funnel, and those who could have entered the funnel, but chose not to do so).

A ‘test’ funnel analysis compares two groups of users. For a test period, each user group interacts with a different version of a funnel (an existing UI and a test UI intended to improve performance, for example). At the end of the experiment, the EventLogging data is analyzed to see if the conversion rates for the two funnels differ, and if that difference is statistically significant. The funnel conversion rates can then be used to generate predictions (e.g., what would happen if the experimental UI were presented to all users—instead of just the test group—for the next six months? And how would the performance compare to the baseline, also captured by the experiment.

Defining a cohort from EventLogging data
Sometimes only a specific subset of users is of interest. For example, an outreach program might generate two hundred new users. In order to evaluate the success of the program, we would like to know how active those users are over time. To do this, we must identify the users of interest, measure their activity, and compare it to a baseline. For this, we use cohorts.

A cohort is a set of user IDs for users that share a trait, or a combination of traits, that are of interest. A cohort might consist of new users that joined in response to a particular campaign and that are (or are not) attached to another wiki. A cohort could consist only of active editors that joined at a specific time. A cohort could consist of users that were part of an experiment. A cohort can be defined by whatever common traits are interesting and relevant.

Current cohorts reside in MySQL (db1047.prod.usertags and db1047.prod.usertags_meta) and are available for use with the Metrics API. New cohorts can be generated by following these steps:

Parsimony
Don't capture data that are not require to answer the question, or that you can easily obtain from the database, as grabbing them adds an extra implementation cost and affects the readability of the schema. An analyst will know which values are more easily to reconstruct than to capture.

Be bold and prune
Don't be afraid to prune schema properties that are not directly relevant to the question, or used to control or validate data.

Redundancy
Always keep parsimony and pruning in mind, but know that there are cases for which it is less costly to have some redundancy, or for which redundancy is required to help validate unreliable data. For example, click-events are notoriously difficult to capture, and it is good practice to grab a targetTitle with each in to help validate the data. In other cases, a value (whether or not a user is new, for example) may prove costly to reconstruct from a database. In such cases, it is more efficient to simply add an isNew property to the schema as a control variable.

Focus on high-level description, not implementation details
Do not tie a description of a value to a current implementation. Obscure variable names are confusing to analysts and others not directly involved in implementation, and they become meaningless if the codebase changes and the variable name is changed.

Standardization
Whenever possible, use consistently named properties (e.g., userID for userID). Don't reinvent a schema or format—use the schema library if appropriate, or properties/meanings previously defined by your project, or by others in the organization. Standardization helps make schemas more readable, and permits analists to better intersect and analyze data.

Enum is your friend
Use enum to require that a value be one of an array of specified values. See Creating a schema: Using JSON schema syntax for an example.

Know the tradeoffs
There can be more than one way to capture an event, and it's good to be aware of the tradeoffs involved for each way. For example, when logging users, you may rely on token, username, or userId. Tokens are unique and totally anonymized, but they cannot be joined with production databases to give you more information about your users; usernames can come with special characters and encoding that break in analysis; userIds are usually ideal, but sometimes you may need to filter out a large group of inconsistently named test/staff user accounts that you won't recognize by number, in which case username is the better option.

Schema library
{placeholder} Repository for standardized, best practice, commonly used schema elements (e.g., funnels, buckets, user identification, etc),

For now, see
 * Schemas category on meta which indexes the talk pages of schemas in use on WMF wikis.