# Analytics/Wikimetrics/Help

Wikimetrics is a web application developed by the Wikimedia Foundation to measure user activity based on a set of standardized metrics. Using this system, a set of key metrics can be selected and applied to a cohort of users to measure their overall productivity. The system is designed for extensibility (creating new metrics, modifying metric parameters) and to support various types of cohort analysis and program evaluation in a user-friendly way. Reports are returned in both JSON and CVS format.

As of September 2013, Wikimetrics is used internally at the Wikimedia Foundation and by external customers, researchers, and community members. If you are interested in using Wikimetrics, you can try out the application here: http://metrics.wmflabs.org/.

The project home page is the main developer hub for the project, hosting updates and resources for developers and end users.

### Contributing to the project: code repository and bug reports

If you would like to contribute to the project, please see:

Code repository: https://phabricator.wikimedia.org/diffusion/ANWM/ (Internal git/gerrit repository)

https://github.com/wikimedia/analytics-wikimetrics (mirror on github)

Bug reports: Bugs and feature suggestions should be reported via Bugzilla.

Wikimetrics mailing list: https://lists.wikimedia.org/mailman/listinfo/wikimetrics

## Rationale

The Wikimetrics application grew out of a need to study data collected via user tagging, which is used to identify groups of users (i.e., ‘cohorts’) so that they can be studied collectively--all subjects of an experiment, for example, or all users who created accounts at an outreach event. The Wikimetrics application permits us to easily and efficiently generate reports that provide information about how a group of users behaves as a whole, for example, how quickly a group of users became productive editors, or how likely a group is to remain active over time.

A second important aim of the project is to develop a standardized set of metrics that permits everyone in the organization to have the same understanding of what we mean when we say “an editor has been retained” or “an editor is active.” Because the Wikimetrics application uses a standardized set of metrics, the reports generated by the system can be used together—either within a project (to compare an experimental group to a control group, for example) or across the organization (to give an overall sense of the productivity of various efforts). The system is designed to be both flexible and extensible, so that existing metrics can be customized as needed, and new metrics can be added over time.

A third aim of the project is to provide an intuitive workflow that can be used by any internal team to access and analyze the data required to evaluate an initiative. Metrics can be easily retrieved via the application’s user-friendly form interface.

### User Tagging

The Wikimetrics application is designed to leverage the information assigned and stored via user tags, which permit us to permanently associate an arbitrary set of metadata (e.g., “subject of e3 experiment”) to a registered user of a specific project (e.g., "enwiki"). Tags are associated with a userId at the time of account creation or at the time a user undergoes a specific treatment or participates in a given initiative, and are stored in a repository where they can be accessed by the Wikimetrics application and used to generate cohorts. Once a tag has been assigned, it cannot be removed or changed.

User tags can represent any number of user attributes. Tags can identify users as experimental subjects, or users who have created accounts in response to either calls to action or outreach events, or users who are part of a specific program (e.g., Global Education). User tags do not reflect any data that conflicts with our privacy policy.

### Metrics Standardization

Standardized metrics can be applied to any Wikimedia project to help evaluate the impact of initiatives in an unambiguous and consistent way. The set of metrics used by the Wikimetrics application can be used at the project level to measure the success of an experimental treatment or outreach initiative, or on the organizational level to compare the impact of projects across the organization. In each case, the qualities of interest—user retention or user contribution (quality, quantity, type)--are measured consistently and clearly defined so that all users can see what the numbers mean.

For more background information, please see: http://meta.wikimedia.org/wiki/Research:Metrics

### Workflow

The Wikimetrics application is designed to streamline the process of obtaining and analyzing the data needed to evaluate projects and initiatives. Any authorized Wikimetrics user can use the system to generate reports about the behavior of users in the cohorts she or he created. Broadly, the workflow can be described as follows:

1. Define cohorts. Cohorts can be defined by specifying custom lists of usernames/userIds. Examples of cohorts:
• Users in E3 experimental group
• Students enrolled in a Global Education class
• New users registered on mobile devices.
2. Measure the quality, productivity, or retention of these cohorts via a standard set of metrics:
• revert rate: proportion of reverted edits within 24 hours of registration
• threshold: reached if a user makes 1 edit to the main namespace within 24 hours of registration
• blocks: number of times user blocked within 24 hours of registration
• … or other metrics..
3. Compare cohorts against each other or against a baseline:

## Definitions

### Cohort

A cohort is a set of users sharing one or more property or attribute—the time of account creation, for example, or participation in an outreach event or experimental group. The users in a cohort can belong to the same wiki project, or to different projects (enwiki, arwiki, etc). Examples of useful cohorts might be Wikipedia editors that participated in an outreach event, Wikimedia Commons users that are also active on other wikis, or users that underwent a particular treatment.

The Wikimetrics application generates cohorts based on user tag information. Each cohort is identified by a single user tag (e.g., “e3_experimental_group”). At this time, all cohorts are private; if you create a cohort by uploading a list of users via the application, for example, only you will have access to that information.

### Metric

Metrics are well-defined values or sets of values that can be computed for any user registered in Wikimedia projects, and are typically used in aggregate to compare different user groups (i.e., cohorts) against each other. The metrics computed by the Wikimetrics application help us understand user activity and behavior--from the quality, quantity and type of user contribution, to how well our editors are retained. For example, we could look at the value of the “bytes_added” metric to see how many bytes of content a student has added to a given wiki in the last week, but if we are interested in evaluating the success of her class, we would more likely look at the number of bytes added by the entire class (i.e., the “enwiki_editing_class” cohort). In this case, the bytes_added metric is used to help determine if the class is successful. We could look at additional metrics to provide a fuller picture: the revert rate of student edits, for example, or the survival rate of users in the student cohort. We can’t directly measure the class’s “success,” but we can measure a number of more concrete quantities that help us determine it and compare it with other classes or other similar initiatives.

All metrics are standardized and clearly defined so that we can easily understand what their values mean and consistently use the same standards to evaluate the efficacy of programs and initiatives over time. Note that metrics are dependent on the context in which they are measured and therefore only make sense in these contexts. An editor with a high revert rate could be a vandal, or an advanced user removing vandalized text. In the case of our class of new enwiki users, a high revert rate is more likely vandalism. The value of a metric returned for each user may be defined (e.g., “true” to indicate that a user reached a threshold of 1 edit in her first 24 hours of account activity) or undefined, which would be the case if a user has not been active for a full 24 hours, and we do not yet know if she will reach the threshold or not. Defined values may be of different types: Boolean (a true or false value indicating whether a threshold has been reached, for example), integer (e.g., edit count), or float (e.g., proportion of reverted edits to total edits).

The value of a metric may change over time. As a user makes additional edits, for example, the size of his contribution changes and the value of metrics, such as ‘bytes_added’ will change accordingly. However, once the time over which a given metric is defined has elapsed (e.g., the first week after registration), the metric should also return the same value.

The set of metrics supported by the Wikimetrics application is in no way exhaustive. The system has been designed to be easily extensible, so that new metrics can be added and parameterized in different ways. Metrics are easy to implement if you develop python or if you can show that a new type of measurement might be useful. In the latter case, either the analytics team or community members are likely to help you implement the new metric. To contribute code, please have a look at our repository.

### Reports

The Wikimetrics application returns information in the form of a report. Reports contain the values of a selected metric for a specified cohort, as well as the settings used to generate the data. For example, a report might contain the number of new pages created by each member of a cohort over a two week period. The name of the cohort, metric, as well as the start and end date of the time interval will be included with the retrieved information.

Reports are available as either JSON or a CSV file. They are available for thirty days after generation, and can be accessed from the reports page (you must be logged in). If you would like to keep a report for longer than thirty days, please download and save it.

….To come...

….To come...

….To come...

## Wikimetrics

The Wikimetrics application allows users to easily run reports that contain information about how groups of users are interacting with the Wikimedia site. Metrics can be generated for any predefined group of users (i.e., cohort), and reports are returned as either formatted JSON or as a CSV file. Reports can be accessed from the Report queue (you must be logged in to access the screen), where they are archived for thirty days.

In this section, we will look more closely at how to create a report—how to select a cohort, how to use the available metrics and their parameters, and how to configure and understand the report output.

### Overview of workflow

Whether you are running an experiment and are interested in looking at the behavior of a test and control group, or heading an outreach program and are curious about how effective the program is in retaining productive users, the Wikimetrics application can be used to gather data that sheds light on how the relevant users are interacting with the Wikimedia site.

The entire workflow is carried out in the Wikimetrics application’s easy-to-use form interface:

2. Create an analysis report
3. Create and/or select a cohort
4. Select a metric and customize its parameters
5. Configure the report output
6. Run the report

When a report is submitted, the Wikimetrics application validates and processes it. The system will return the requested data as either a CSV file or formatted JSON once the data has been extracted from the relevant databases. Reports, like cohorts, are private, and may only be viewed (at least in the context of the application) by the user who generates them.

### Accessing Wikimetrics

Anyone can access and start using Wikimetrics by logging in here: http://metrics.wmflabs.org/. Wikimetrics uses OAuth to authenticate users, and you may choose to log in to the application with any Google account. Down the line, we will support logging in via Twitter accounts as well.

#### Staging

We deploy features to our staging area before we deploy them to production. The staging server runs the same software than production does but with additional features we have recently developed. Server can be found here: https://metrics-staging.wmflabs.org/. Please have in mind that data might get wiped out of staging once in a while.

### Defining and selecting cohorts

Cohorts are private, meaning that they are created by each user and available only to that person. Only the cohorts you create will be available to you.

New cohorts can easily be uploaded to the system via the cohort upload functionality, and your existing cohorts will be listed in the Wikimetrics interface, where they can be easily selected and used when running reports. All currently available (to you) cohorts appear on the Wikimetrics home page, the cohorts screen, and the Create Analysis Report screen.

Each cohort is identified by a single user tag, and information about it is stored in the UserTag repository. In addition, each cohort has a unique numerical Id as well as a human-legible name, which is displayed in the Wikimetrics interface. Additional information about each cohort—a description, as well as the name of the person using it, for example—can be found in the UserTag repository.

If you would like to add a new cohort of users, you can do so via the Wikimetric application’s cohort upload feature. Cohorts may consist of users of a single project (e.g., a list of enwiki users) or of multiple projects (e.g., a list of users of either enwiki or arwiki projects).

Currently, metrics can only be generated for a single project at a time, but future functionality will support multiple projects.

To add a cohort of users:

1. Create a CSV file that includes the usernames (or userIds) of cohort members, one user per line. e.g.,
13234584
18487945
If your cohort consists of users of multiple projects, create a CSV file that includes the usernames (or userIds) of cohort members and the name of the project (e.g., ‘enwiki’ or 'mediawikiwiki') to which each belongs. Username and project should be separated by a comma, one user/project per line. e.g.,
13234584, enwiki
835346, mediawikiwiki
2. Navigate to the "My Cohorts" screen then click the “Upload Cohort" button. The "Uploading " screen looks more or less like this:
3. Enter a cohort name. The cohort name can contain the following ASCII characters: A-Z, a-z, numbers, hyphens (-), and underscores (_). The Wikimetrics application will automatically confirm that the name is unique. If a cohort with the specified name exists already, you will be prompted to choose another name.
4. Add a description of the cohort. The description will be saved in the cohort repository.
5. Select a default wiki project (e.g., enwiki). If your cohort includes users from multiple projects, leave this field empty.
6. Click the “Choose File” button beside the CSV File setting to select the CSV file containing the list of users created in step 1.
7. Click the “Upload CSV” button. Wikimetrics will then validate the userIds. Invalid users will be flagged and displayed in an “Invalid Users” tab, where they can be reviewed. Note that invalid users will not be included in the new cohort. You may choose to upload a new CSV file, or to review the valid users by opening the “Valid Users” tab.
8. Under the “Valid Users” tab, click the “Upload Only These Valid Users” button to create a cohort of those users. Once you have created the cohort, you cannot further edit it via the Wikimetrics interface. If you made an error, you must create a new cohort.

The new cohort will appear wherever your cohorts are listed (Wikimetrics home page, cohorts page, etc) It is immediately available for use.

#### Combining multiple cohorts

Wikimetrics does not currently support ‘cohort combinations’--the ability to generate a single report from the union (e.g., all users who appear in either of two cohorts) or intersection (e.g., only those users who appear in two cohorts) of multiple cohorts. You may select multiple cohorts when you create a report; the Wikimetrics application will then generate a report based on the selected metric and output settings for each cohort.

We anticipate that this functionality will be added in the future. The documentation will be updated at that time.

Placeholder.

#### Tagging a cohort

Wikimetrics is planning a feature that will allow users to tag a cohort so similar cohorts can be easily searched. Tagging will rely on users to tag their own cohorts.

### Creating an Analysis Report

The Wikimetrics application returns information in the form of a report. Reports contain the values of a selected metric for a specified cohort, as well as the settings used to generate the data. To create a new report, log in to the Wikimetrics application and click the ‘Analyze’ button.

In order to create a report, you must:

1. Select a cohort
2. Select and configure a metric
3. Configure the output

The Wikimetrics application will display a sample report based on the selected configurations, which you can review before clicking the ‘Run Report’ button to generate the report.

#### Selecting a cohort

All cohorts that you have uploaded to the Wikimetrics application appear in the ‘Pick Cohorts’ section of the Create Analysis Report interface. Check the box beside the name of the cohort(s) that you would like to generate reports with. Note that Wikimetrics does not currently combine cohorts; if you select multiple cohorts, the application will generate a separate report for each cohort.

#### Selecting and configuring metrics

All currently supported metrics appear in the ‘Pick Metrics’ section of the Create Analysis Report interface. To select and configure a metric, click its linked name and then check the box beside “Compute different aggregations…” Each metric has default settings, which will then be displayed.

#### Configuring output

The data Wikimetrics reports depends on the way the report output is configured by the user. The application can output a value for each user in the cohort, or one aggregate value for the entire group. The output configuration settings will be accessible once a cohort and metric have been selected and configured. For more information about output configurations, please see Understanding different output configurations.

Placeholder

Placeholder

### Understanding different output configurations

The data Wikimetrics reports depends on the way the report output is configured by the user. Once a cohort has been selected and a metric chosen and configured, you can open the output configuration settings by clicking the linked name of the metric that appears beneath the Configure Output heading on the Create Analysis Report screen.

The basic output types are:

• Individual Results: report the value of the metric for each user in the cohort. Users are identified by userId.
• Aggregate Results: report an aggregate value for the entire cohort (e.g., the total number of edits made by all cohort users). You must select one or more aggregator (Sum, Average, Standard Deviation).
• Time-series: ‘slice’ the specified time interval (the last month, for example) by day, week, hour, or whatever unit is most relevant to the analysis. An aggregate metric value is reported for each time slice.
• Single-user: PLACEHOLDER
• All-user: PLACEHOLDER

In this section, we will look at each type of report in more depth.

#### Individual Results

When the “Individual Results” output configuration is selected, Wikimetrics reports the value of the specified metric for each user in the cohort.

To create an “Individual Results” report, you must specify a cohort, select and configure a metric, and configure the output configuration to be “Individual Results.”

For example, if fifty users create accounts at an outreach event, and you’d like to see how many bytes of content each of those users has added or modified over the last month, you would

1. Select the ‘outreach_event’ cohort. If this cohort does not yet exist, or is not yet known to Wikimetrics, it will have to be created and/or added to the system first.
2. Select the ‘Bytes Added’ metric and configure it to return values for the desired time interval.
3. Set the output configuration to “Individual Results,” which instructs Wikimetrics to return a value for each cohort member.
```${\displaystyle R_{r}}$ {‘outreach_event’ cohort, ‘Bytes Added’ metric, ‘Individual Results’}
```

Report return (abstract):

```{
…

}
```

Sample Report (JSON format)

```"result": {
"Individual Results":
{
"13234584": {
"net_sum": 1525.0,
"positive_only_sum": 1528.0,
"absolute_sum": 1531.0,
"negative_only_sum": -3.0
},
"18487945": {
"net_sum": null,
"positive_only_sum": null,
"absolute_sum": null,
"negative_only_sum": null
}
```

The above data is excerpted from a report in JSON format, and reflects for each userId (e.g., ‘13234584’), the net bytes added (bytes added minus bytes removed, e.g., ‘1525.0’), the bytes added (‘1528.0’), the absolute sum of all bytes added (bytes added plus bytes removed, e.g., ‘1531.0’), and the bytes removed (‘-3.0’). The metric is measured over the time interval configured in the metric settings.

We look at full JSON reports in more depth in the Understanding the report section.

#### Aggregate Results

When a report is configured to output ‘Aggregate Results’, the Wikimetrics application will generate a single aggregate result for the cohort. You must select at least one aggregator: Sum, Average, or Standard Deviation. If you select multiple aggregators, Wikimetrics will output an aggregate result for each of the selected aggregators.

 Aggregator Value Sum Add the metric values returned for each cohort member and return that amount. Implemented for the Bytes Added, Edits, and Pages Created metrics. Average Average the metric values returned for each cohort member and return that amount. Implemented for the Bytes Added, Edits, and Pages Created metrics. Standard Deviation Return the standard deviation. A high standard deviation indicates variability among individual values; a lower deviation indicates that values are closer to the mean.

To create an Aggregate report, you must:

1. specify a cohort
2. select and configure a metric
3. configure the output as “Aggregate Results”
4. select an aggregator.

For example, if we would like to look at the total number of pages created by all members of the ‘outreach_event’ cohort, we would specify:

Report components:

```${\displaystyle R_{r}}$ {‘outreach_event’ cohort, ‘Pages Created’ metric, ‘Aggregate Results’ aggregator=Sum}
```

Report returned (abstract):

```{
‘Sum’’: [sum of the number of pages created by all cohort members]
}
```

Sample Report (JSON format)

```"result": {
"Sum": {
"pages_created": 1.0
},
```

The above report is excerpted from a JSON report, and reflects the aggregated data for the entire cohort. We look at full JSON responses in more depth in the Understanding the report section.

Placeholder

Placeholder

Placeholder

### Understanding the report

Wikimetrics returns reports in JSON format and as CSV files. In either case, the report contains the metric data as well as information about the report itself (i.e., the configurations used to generate it).

(Placeholder until JSON response finalized)

### Accessing reports

Reports are stored in Wikimetrics for thirty days and are available via links from the Current Report screen. If you would like to save a report for longer than 30 days, or to share a report with other users (Wikimetrics reports are available only to the user who generated them), please download the CSV file.

Placeholder

## Available metrics

Currently, Wikimetrics supports three standardized metrics that provide information about user retention, the volume of user contribution, the quality of the contribution, and the type of contribution.

Each metric has default settings that can be overridden with configurations set via the Create Analysis Report interface. In addition, each metric has a set of aggregators (e.g., ‘proportion’ or ‘sum’) that can be used with a request to return an aggregate value for a cohort. Note that by default, metrics do not reflect the activity stored in the archive tables (e.g., edits made to pages that have been deleted will not be included in the value of the bytes_added metric).

### Volume metrics

Volume metrics help measure the quantity of an editor’s wiki work: Bytes Added, Edits, Pages Created.

By default, the Bytes Added metric returns an array of numerical values that reflect the amount of content an editor has added, removed, and modified within a given time period. The default time period is reflected in the settings that are displayed (and may be modified) when the metric is selected from the ‘Create Analysis Report’ interface and a user has checked the box beside “Compute different aggregations of the bytes contributed or removed from a mediawiki project.”

Users may configure the metric’s Start Date, End Date, and Namespaces parameters via the application interface. In addition, a user may choose to include Positive Only Sum, Negative Only Sum, Absolute Sum, Net Sum values in the report. By default, all four sums are included in the report, and each provides different insight into user behavior. Please see the table below for more information.

Currently, the following aggregators are implemented for this metric: Sum, Average

 Default parameter values for the Bytes Added metric Parameter Value Namespace The Namespace parameter specifies a category of pages (e.g., ‘Talk’ pages or ‘User’ pages). Note that each Namespace is identified by a numerical ID, which is how it is specified in the application. By default, the Namespace is ‘'0' (the main namespace). Examples of Namespaces can be found here: Start Date/End Date The Start and End Date parameters specify a time interval over which the metric will be measured. Use the date selector (the downward arrow at the far right of the field) to choose a date from a drop-down calendar, or enter a date manually (MM/DD/YYYY). The Wikimetrics application will check to ensure that specified dates are valid. To specify a single day, select the same date for both start and end. Dates and times are all expressed in UTC. Positive Only Sum The Positive Only Sum is the number of bytes a user added to the specified Namespace over the time interval. This value reflects the amount of new content added by the user. Negative Only Sum The Negative Only Sum is the number of bytes a user has removed from the Namespace over the time interval. This value reflects the amount of content a user has changed and/or deleted. If a user edits a page, for example, the Negative Only Sum value might reflect the amount of redundant or erroneous content removed. Absolute Sum The Absolute Sum is the sum of the number of bytes added and the number of bytes removed. The value reflects a user’s total impact on the content of the selected Namespace over the time period (e.g., both the new content added, and the existing content edited). Net Sum The Net Sum is the number of bytes added minus the number of bytes removed. For example, if a user creates a 400 word article, and then removes 350 words from it, the Net Sum would reflect the (much smaller) length of the final version.

#### Edits

By default, the Edits metric returns the number of edits made by a user or cohort of users within a given time period. The default time period is reflected in the settings that are displayed (and may be modified) when the metric is selected from the ‘Create Analysis Report’ interface and a user has checked the box beside “Compute the number of edits in a specific namespace of a mediawiki project.”

Users may configure the metric’s Start Date, End Date, and Namespaces parameters via the application interface. Please see the table below for more information about these parameters.

Currently, the following aggregators are implemented for this metric: Sum, Average

 Default parameter values for the Edits metric Parameter Value Namespace The Namespace parameter specifies a category of pages (e.g., ‘Talk’ pages or ‘User’ pages). Note that each Namespace is identified by a numerical ID, which is how it is specified in the application. By default, the Namespace is ‘'0' (the main namespace). Examples of Namespaces can be found here: Start Date/End Date The Start and End Date parameters specify a time interval over which the metric will be measured. Use the date selector (the downward arrow at the far right of the field) to choose a date from a drop-down calendar, or enter a date manually (MM/DD/YYYY). The Wikimetrics application will check to ensure that specified dates are valid. To specify a single day, select the same date for both start and end. Dates and times are all expressed in UTC.

#### Pages Created

By default, the Pages Created metric returns the number of pages created by a user or cohort of users within a given time period. The default time period is reflected in the settings that are displayed (and may be modified) when the metric is selected from the ‘Create Analysis Report’ interface and a user has checked the box beside “Compute the number of edits in a specific namespace of a mediawiki project.”

Note that pages that were created and then deleted within the time interval will not be counted. By default, Wikimetrics does not query the archive tables where deleted pages are stored.

The Pages Created metric can also be used to count the number of files uploaded to Commons. Because a new page is created only the first time a file is uploaded, uploaded revisions to existing files will not be counted. To return a count of uploaded files, specify the Commons namespace (e.g., namespace=6) in the request.

Users may configure the metric’s Start Date, End Date, and Namespaces parameters via the application interface. Please see the table below for more information about these parameters.