Analytics/Wikimetrics/Help

About
Wikimetrics is a web application developed by the Wikimedia Foundation to measure user activity based on a set of standardized metrics. Using this system, a set of key metrics can be selected and applied to a cohort of users to measure their overall productivity. The system is designed for extensibility (creating new metrics, modifying metric parameters) and to support various types of cohort analysis and program evaluation in a user-friendly way. Reports are returned in both JSON and CVS format.

As of September 2013, Wikimetrics is used internally at the Wikimedia Foundation and by external customers, researchers, and community members. If you are interested in using Wikimetrics, you can try out the application here: http://metrics.wmflabs.org/.

Project and application home page
The project home page and Wikimetrics home page are here:

Project home page: https://www.mediawiki.org/wiki/Analytics/Wikimetrics

The project home page is the main developer hub for the project, hosting updates and resources for developers and end users.

Wikimetrics application home page: http://metrics.wmflabs.org/

The Wikimetrics home page is the public URL of the application. To use the application, log in using one of the supported services (e.g., a Google account). See Accessing Wikimetrics for more information.

Contributing to the project: code repository and bug reports
If you would like to contribute to the project, please see:

Code repository: http://git.wikimedia.org/summary/analytics%2Fwikimetrics.git (Internal git/gerrit repository)
 * https://github.com/wikimedia/analytics-wikimetrics (mirror on github)

Bug reports: Bugs and feature suggestions should be reported via Bugzilla.

https://bugzilla.wikimedia.org/buglist.cgi?component=Wikimetrics&product=Analytics

Contact us
To reach us and/or obtain help, please join our mailing list.

Wikimetrics mailing list: https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Rationale
The Wikimetrics application grew out of a need to study data collected via user tagging, which is used to identify groups of users (i.e., ‘cohorts’) so that they can be studied collectively--all subjects of an experiment, for example, or all users who created accounts at an outreach event. The Wikimetrics application permits us to easily and efficiently generate reports that provide information about how a group of users behaves as a whole, for example, how quickly a group of users became productive editors, or how likely a group is to remain active over time.

A second important aim of the project is to develop a standardized set of metrics that permits everyone in the organization to have the same understanding of what we mean when we say “an editor has been retained” or “an editor is active.” Because the Wikimetrics application uses a standardized set of metrics, the reports generated by the system can be used together—either within a project (to compare an experimental group to a control group, for example) or across the organization (to give an overall sense of the productivity of various efforts). The system is designed to be both flexible and extensible, so that existing metrics can be customized as needed, and new metrics can be added over time.

A third aim of the project is to provide an intuitive workflow that can be used by any internal team to access and analyze the data required to evaluate an initiative. Metrics can be easily retrieved via the application’s user-friendly form interface.

User Tagging
The Wikimetrics application is designed to leverage the information assigned and stored via user tags, which permit us to permanently associate an arbitrary set of metadata (e.g., “subject of e3 experiment”) to a registered user of a specific project (e.g., "enwiki"). Tags are associated with a userId at the time of account creation or at the time a user undergoes a specific treatment or participates in a given initiative, and are stored in a repository where they can be accessed by the Wikimetrics application and used to generate cohorts. Once a tag has been assigned, it cannot be removed or changed.

User tags can represent any number of user attributes. Tags can identify users as experimental subjects, or users who have created accounts in response to either calls to action or outreach events, or users who are part of a specific program (e.g., Global Education). User tags do not reflect any data that conflicts with our privacy policy.

For more information about user tagging, please see: http://www.mediawiki.org/wiki/Usertagging.

Metrics Standardization
Standardized metrics can be applied to any Wikimedia project to help evaluate the impact of initiatives in an unambiguous and consistent way. The set of metrics  used by the Wikimetrics application can be used at the project level to measure the success of an experimental treatment or outreach initiative, or on the organizational level to compare the impact of projects across the organization. In each case, the qualities of interest—user retention or user contribution (quality, quantity, type)--are measured consistently and clearly defined so that all users can see what the numbers mean.

For more background information, please see: http://meta.wikimedia.org/wiki/Research:Metrics

Workflow
The Wikimetrics application is designed to streamline the process of obtaining and analyzing the data needed to evaluate projects and initiatives. Any authorized Wikimetrics user can use the system to generate reports, using cohorts she or he created. Broadly, the workflow can be described as follows:


 * 1) Define cohorts. Cohorts can be defined by specifying custom lists of usernames/userIds. Examples of cohorts:
 * 2) *Users in E3 experimental group
 * 3) *Students enrolled in a Global Education class
 * 4) *VisualEditor adopters
 * 5) *New users registered on mobile devices.
 * 6) Measure the quality, productivity, or retention of these cohorts via a standard set of metrics:
 * 7) *revert rate: proportion of reverted edits within 24 hours of registration
 * 8) *threshold: reached if a user makes 1 edit to the main namespace within 24 hours of registration
 * 9) *blocks: number of times user blocked within 24 hours of registration
 * 10) *… or other metrics..
 * 11) Compare cohorts against each other or against a baseline:

Cohort
A cohort is a set of users sharing one or more property or attribute—the time of account creation, for example, or participation in an outreach event or experimental group. The users in a cohort can belong to the same wiki project, or to different projects (enwiki, arwiki, etc). Examples of useful cohorts might be Wikipedia editors that participated in an outreach event, Wikimedia Commons users that are also active on other wikis, or users that underwent a particular treatment.

The Wikimetrics application generates cohorts based on user tag information. Each cohort is identified by a single user tag (e.g., “e3_experimental_group”). At this time, all cohorts are private; if you create a cohort by uploading a list of users via the application, for example, only you will have access to that information.

Metric
Metrics are well-defined values or sets of values that can be computed for any user registered in Wikimedia projects, and are typically used in aggregate to compare different user groups (i.e., cohorts) against each other. The metrics computed by the Wikimetrics application help us understand user activity and behavior--from the quality, quantity and type of user contribution, to how well our editors are retained. For example, we could look at the value of the “bytes_added” metric to see how many bytes of content a student has added to a given wiki in the last week, but if we are interested in evaluating the success of her class, we would more likely look at the number of bytes added by the entire class (i.e., the “enwiki_editing_class” cohort). In this case, the bytes_added metric is used to help determine if the class is successful. We could look at additional metrics to provide a fuller picture: the revert rate of student edits, for example, or the survival rate of users in the student cohort. We can’t directly measure the class’s “success,” but we can measure a number of more concrete quantities that help us determine it and compare it with other classes or other similar initiatives.

All metrics are standardized and clearly defined so that we can easily understand what their values mean and consistently use the same standards to evaluate the efficacy of programs and initiatives over time. Note that metrics are dependent on the context in which they are measured and therefore only make sense in these contexts. An editor with a high revert rate could be a vandal, or an advanced user removing vandalized text. In the case of our class of new enwiki users, a high revert rate is more likely vandalism. The value of a metric returned for each user may be defined (e.g., “true” to indicate that a user reached a threshold of 1 edit in her first 24 hours of account activity) or undefined, which would be the case if a user has not been active for a full 24 hours, and we do not yet know if she will reach the threshold or not. Defined values may be of different types: Boolean (a true or false value indicating whether a threshold has been reached, for example), integer (e.g., edit count), or float (e.g., proportion of reverted edits to total edits).

The value of a metric may change over time. As a user makes additional edits, for example, the size of his contribution changes and the value of metrics, such as ‘bytes_added’ will change accordingly. However, once the time over which a given metric is defined has elapsed (e.g., the first week after registration), the metric should also return the same value. The set of metrics supported by the Wikimetrics application is in no way exhaustive. The system has been designed to be easily extensible, so that new metrics can be added and parameterized in different ways. Metrics are easy to implement if you develop python or if you can show that a new type of measurement might be useful. In the latter case, either the analytics team or community members are likely to help you implement the new metric. To contribute code, please have a look at our repository.

Reports
The Wikimetrics application returns information in the form of a report. Reports contain the values of a selected metric for a specified cohort, as well as the settings used to generate the data. For example, a report might contain the number of new pages created by each member of a cohort over a two week period. The name of the cohort, metric, as well as the start and end date of the time interval will be included with the retrieved information.

Reports are available as either JSON or a CSV file. They are available for thirty days after generation, and can be accessed from the reports page (you must be logged in). If you would like to keep a report for longer than thirty days, please download and save it.

Technical overview
….To come...

The UserTag repository
….To come...

The Wikimetrics application
….To come...

Wikimetrics
The Wikimetrics application allows users to easily run reports that contain information about how groups of users are interacting with the Wikimedia site. Metrics can be generated for any predefined group of users (i.e., cohort), and reports are returned as either formatted JSON or as a CSV file. Reports can be accessed from the Report queue (you must be logged in to access the screen), where they are archived for thirty days.

In this section, we will look more closely at how to create a report—how to select a cohort, how to use the available metrics and their parameters, and how to configure and understand the report output.

Overview of workflow
Whether you are running an experiment and are interested in looking at the behavior of a test and control group, or heading an outreach program and are curious about how effective the program is in retaining productive users, the Wikimetrics application can be used to gather data that sheds light on how the relevant users are interacting with the Wikimedia site.

The entire workflow is carried out in the Wikimetrics application’s easy-to-use form interface:
 * 1) Log in to Wikimetrics
 * 2) Create an analysis report
 * 3) Create and/or select a cohort
 * 4) Select a metric and customize its parameters
 * 5) Configure the report output
 * 6) Run the report

Once a report has been submitted, the Wikimetrics application validates and processes all valid requests. The system will return a report once the data has been extracted from the relevant databases. Reports, like cohorts, are private, and may only be viewed (at least in the context of the application) by the user who generates them.

Accessing Wikimetrics
Anyone can access and start using Wikimetrics by logging in here: http://metrics.wmflabs.org/. Wikimetrics uses OAuth to authenticate users, and you may choose to log in to the application with any Google account. Down the line, we will be supporting logging in via Twitter accounts as well.

When you choose to log in with Google, you will be prompted to select an account (or to enter the name of an account, if the one you wish to use is not listed). The first time you log in to Wikimetrics with the selected account, you will be prompted to enter your account password and then to accept the access terms. The Wikimetrics application will be able to view your email address and associate you with your public Google+ profile. Note, however, that we will only access your email address and name to verify your identity. We will not share any of your data with Google or anyone else. Since Wikimetrics is an open source project, you can see exactly how we interact with Google and the other providers. As of this writing, that logic lives here.

Defining and selecting cohorts
Cohorts are private, meaning that they are created by each user and available only to that person. Only the cohorts you create will be available to you.

New cohorts can easily be uploaded to the system via the cohort upload functionality, and your existing cohorts will be listed in the Wikimetrics interface, where they can be easily selected and used when running reports. All currently available (to you) cohorts appear on the Wikimetrics home page, the cohorts screen, and the Create Analysis Report screen.

Each cohort is identified by a single user tag, and information about it is stored in the UserTag repository. In addition, each cohort has a unique numerical Id as well as a human-legible name, which is displayed in the Wikimetrics interface. Additional information about each cohort—a description, as well as the name of the person using it, for example—can be found in the UserTag repository.

Adding a custom cohort
If you would like to add a new cohort of users, you can do so via the Wikimetric application’s cohort upload feature. Cohorts may consist of users of a single project (e.g., a list of enwiki users) or of multiple projects (e.g., a list of users of either enwiki or arwiki projects).

Currently, metrics can only be generated for a single project at a time, but future functionality will support multiple projects.

To add a cohort of users:
 * 1) Create a CSV file that includes the usernames (or userIds) of cohort members, one user per line. e.g.,
 * 13234584
 * 18487945
 * If your cohort consists of users of multiple projects, create a CSV file that includes the usernames (or userIds) of cohort members and the name of the project (e.g., ‘enwiki’ or 'mediawikiwiki') to which each belongs. Username and project should be separated by a comma, one user/project per line. e.g.,
 * 13234584, enwiki
 * 835346, mediawikiwiki
 * 1) Navigate to the "Create a Cohort by Uploading a CSV" screen (from the Wikimetrics home page, select the “Cohort” link near the bottom of the screen, then click the “Upload Cohort" button). The "Create a Cohort by Uploading a CSV" screen looks like this:
 * [[File:WikimetricsCohortUpload.png]]
 * 1) Enter a cohort name. The cohort name can contain the following ASCII characters: A-Z, a-z, numbers, hyphens (-), and underscores (_). The Wikimetrics application will automatically confirm that the name is unique. If a cohort with the specified name exists already, you will be prompted to choose another name.
 * 2) Add a description of the cohort. The description will be saved in the cohort repository.
 * 3) Select a default wiki project (e.g., enwiki). If your cohort includes users from multiple projects, leave this field empty.
 * 4) Click the “Choose File” button beside the CSV File setting to select the CSV file containing the list of users created in step 1.
 * 5) Click the “Upload CSV” button. Wikimetrics will then validate the userIds. Invalid users will be flagged and displayed in an “Invalid Users” tab, where they can be reviewed. Note that invalid users will not be included in the new cohort. You may choose to upload a new CSV file, or to review the valid users by opening the “Valid Users” tab.
 * 6) Under the “Valid Users” tab, click the “Upload Only These Valid Users” button to create a cohort of those users. Once you have created the cohort, you cannot further edit it via the Wikimetrics interface. If you made an error, you must create a new cohort.
 * 1) Click the “Choose File” button beside the CSV File setting to select the CSV file containing the list of users created in step 1.
 * 2) Click the “Upload CSV” button. Wikimetrics will then validate the userIds. Invalid users will be flagged and displayed in an “Invalid Users” tab, where they can be reviewed. Note that invalid users will not be included in the new cohort. You may choose to upload a new CSV file, or to review the valid users by opening the “Valid Users” tab.
 * 3) Under the “Valid Users” tab, click the “Upload Only These Valid Users” button to create a cohort of those users. Once you have created the cohort, you cannot further edit it via the Wikimetrics interface. If you made an error, you must create a new cohort.

The new cohort will appear wherever your cohorts are listed (Wikimetrics home page, cohorts page, etc) It is immediately available for use.

Combining multiple cohorts
Wikimetrics does not currently support ‘cohort combinations’--the ability to generate a single report from the union (e.g., all users who appear in either of two cohorts) or intersection (e.g., only those users who appear in two cohorts) of multiple cohorts. You may select multiple cohorts when you create a report; the Wikimetrics application will then generate a report based on the selected metric and output settings for each cohort.

We anticipate that this functionality will be added in the future. The documentation will be updated at that time.

Static (fixed-membership) cohorts vs dynamic cohorts
Placeholder.

Creating an Analysis Report
The Wikimetrics application returns information in the form of a report. Reports contain the values of a selected metric for a specified cohort, as well as the settings used to generate the data. To create a new report, log in to the Wikimetrics application and click the ‘Analyze’ button.

In order to create a report, you must:
 * 1) Select a cohort
 * 2) Select and configure a metric
 * 3) Configure the output

The Wikimetrics application will display a sample report based on the selected configurations, which you can review before clicking the ‘Run Report’ button to generate the report.

Selecting a cohort
All cohorts that you have uploaded to the Wikimetrics application appear in the ‘Pick Cohorts’ section of the Create Analysis Report interface. Check the box beside the name of the cohort(s) that you would like to generate reports with. Note that Wikimetrics does not currently combine cohorts; if you select multiple cohorts, the application will generate a separate report for each cohort.

You may also upload a new cohort from this interface. For more information about uploading cohorts, please see Adding a custom cohort.

Selecting and configuring metrics
All currently supported metrics appear in the ‘Pick Metrics’ section of the Create Analysis Report interface. To select and configure a metric, click its linked name and then check the box beside “Compute different aggregations…” Each metric has default settings, which will then be displayed.

For more information about the metrics and their settings, please see Available Metrics.

Configuring output
The data Wikimetrics reports depends on the way the report output is configured by the user. The application can output a value for each user in the cohort, or one average value for the entire group. The output configuration settings will be accessible once a cohort and metric have been selected and configured. For more information about output configurations, please see Understanding different output configurations.

Configuring metrics
Placeholder

Understanding different output configurations
The data Wikimetrics reports depends on the way the report output is configured by the user. Once a cohort has been selected and a metric chosen and configured, you can open the output configuration settings by clicking the linked name of the metric that appears beneath the Configure Output heading on the Create Analysis Report screen.

The basic output types are:


 * Individual Results: report the value of the metric for each user in the cohort. Users are identified by userId.
 * Aggregate Results: report an aggregate value for the entire cohort (e.g., the total number of edits made by all cohort users). You must select one or more aggregator (Sum, Average, Standard Deviation).
 * Time-series: ‘slice’ the specified time interval (the last month, for example) by day, week, hour, or whatever unit is most relevant to the analysis. An aggregate metric value is reported for each time slice.
 * Single-user: PLACEHOLDER
 * All-user: PLACEHOLDER

In this section, we will look at each type of report in more depth.

Individual Results
When the “Individual Results” output configuration is selected, Wikimetrics reports the value of the specified metric for each user in the cohort.

To create an “Individual Results” report, you must specify a cohort, select and configure a metric, and configure the output configuration to be “Individual Results.”

For example, if fifty users create accounts at an outreach event, and you’d like to see how many bytes of content each of those users has added or modified over the last month, you would


 * 1) Select the ‘outreach_event’ cohort. If this cohort does not yet exist, or is not yet known to Wikimetrics, it will have to be created and/or added to the system first.
 * 2) Select the ‘Bytes Added’ metric and configure it to return values for the desired time interval.
 * 3) Set the output configuration to “Individual Results,” which instructs Wikimetrics to return a value for each cohort member.

$$R_r$$ {‘outreach_event’ cohort, ‘Bytes Added’ metric, ‘Individual Results’}

Report return (abstract):

Sample Report (JSON format) The above data is excerpted from a report in JSON format, and reflects for each userId (e.g., ‘13234584’), the net bytes added (bytes added minus bytes removed, e.g., ‘1525.0’), the bytes added (‘1528.0’), the absolute sum of all bytes added (bytes added plus bytes removed, e.g., ‘1531.0’), and the bytes removed (‘-3.0’). The metric is measured over the time interval configured in the metric settings.

We look at full JSON reports in more depth in the  Understanding the report section.

Aggregate Results
When a report is configured to output ‘Aggregate Results’, the Wikimetrics application will generate a single aggregate result for the cohort. You must select at least one aggregator: Sum, Average, or Standard Deviation. If you select multiple aggregators, Wikimetrics will output an aggregate result for each of the selected aggregators.

To create an Aggregate report, you must:
 * 1) specify a cohort
 * 2) select and configure a metric
 * 3) configure the output as “Aggregate Results”
 * 4) select an aggregator.

For example, if we would like to look at the total number of pages created by all members of the ‘outreach_event’ cohort, we would specify:

Report components: $$R_r$$ {‘outreach_event’ cohort, ‘Pages Created’ metric, ‘Aggregate Results’ aggregator=Sum}

Report returned (abstract): Sample Report (JSON format)

The above report is excerpted from a JSON report, and reflects the aggregated data for the entire cohort. We look at full JSON responses in more depth in the Understanding the report section.

Time series
Placeholder

Single user
Placeholder

All user (magic cohorts!)
Placeholder

Understanding the report
Wikimetrics returns reports in JSON format and as CSV files. In either case, the report contains the metric data as well as information about the report itself (i.e., the configurations used to generate it).

(Placeholder until JSON response finalized)

Accessing reports
Reports are stored in Wikimetrics for thirty days and are available via links from the Current Report screen. If you would like to save a report for longer than 30 days, or to share a report with other users (Wikimetrics reports are available only to the user who generated them), please download the CSV file.

Response status
Placeholder

Available metrics
Currently, Wikimetrics supports three standardized metrics that provide information about user retention, the volume of user contribution, the quality of the contribution, and the type of contribution.

Each metric has default settings that can be overridden with configurations set via the Create Analysis Report interface. In addition, each metric has a set of aggregators (e.g., ‘proportion’ or ‘sum’) that can be used with a request to return an aggregate value for a cohort. Note that by default, metrics do not reflect the activity stored in the archive tables (e.g., edits made to pages that have been deleted will not be included in the value of the bytes_added metric).

Volume metrics
Volume metrics help measure the quantity of an editor’s wiki work: Bytes Added, Edits, Pages Created.

Bytes Added
By default, the Bytes Added metric returns an array of numerical values that reflect the amount of content an editor has added, removed, and modified within a given time period. The default time period is reflected in the settings that are displayed (and may be modified) when the metric is selected from the ‘Create Analysis Report’ interface and a user has checked the box beside “Compute different aggregations of the bytes contributed or removed from a mediawiki project.”

Users may configure the metric’s Start Date, End Date, and Namespaces parameters via the application interface. In addition, a user may choose to include Positive Only Sum, Negative Only Sum, Absolute Sum, Net Sum values in the report. By default, all four sums are included in the report, and each provides different insight into user behavior. Please see the table below for more information.

For more information, see: http://metrics.wmflabs.org/metrics/.