UserMetrics/Guide

About
UserMetrics is the name of a platform developed by the Wikimedia Foundation to measure user activity based on a set of standardized metrics. Using this platform, a set of key metrics can be selected and applied to a cohort of users to measure their overall productivity. The platform is designed for extensibility (creating new metrics, modifying metric parameters) and to support various types of cohort analysis and program evaluation in a user-friendly way. It accepts requests via a RESTful API and returns responses in JSON format.

As of May 2013, the UserMetrics API is used internally at the Wikimedia Foundation by the Editor Engagement, Global Education and Grantmaking programs. The scope of the project is being extended to include external customers, researchers, and community members. If you are interested in using the UserMetrics API, please contact [mailto:usermetrics@wikimedia.org usermetrics@wikimedia.org].

Project home page and API home
The project home page and UserMetricsAPI home are here:

Project home page: http://mediawiki.org/wiki/UserMetrics

The project home page is the main developer hub for the project, hosting updates and resources for developers and end users.

UserMetrics API home page: http://metrics.wikimedia.org

The UserMetrics API home page is the public URL of the API. Access to the API is currently restricted to internal users and early testers. To obtain credentials, please contact us at [mailto:usermetrics@wikimedia.org usermetrics@wikimedia.org].

Additional information: code repository and bug reports
If you would like additional information about the project, please see:

Code repository: https://github.com/wikimedia/user_metrics

Bug reports: Bugs and feature suggestions should be reported via Bugzilla. https://bugzilla.wikimedia.org/buglist.cgi?component=User%20Metrics&product=Analytics

Contact us
To reach us and/or obtain help, please write to: [mailto:usermetrics@wikimedia.org usermetrics@wikimedia.org].

To receive updates about any UserMetrics service or data quality issues, please join the umapi-alerts list.

Rationale
The UserMetrics API grew out of a need to study data collected via user tagging User Tagging, which is used to identify groups of users so that they can be studied collectively--all subjects of an experiment, for example, or all users who created accounts at an outreach event. The API permits us to easily and efficiently generate reports that provide information about how groups of users behave, for example, how quickly a group of users became productive editors, or how likely a group is to remain active over time.

A second important aim of the project is to develop a standardized set of metrics Metrics Standardization that permits everyone in the organization to have the same understanding of what we mean when we say “an editor has been retained” or “an editor is active.” Because the UserMetrics API uses a standardized set of metrics, the reports generated by the system can be used together—either within a project (to compare an experimental group to a control group, for example) or across the organization (to give an overall sense of the productivity of various efforts). The system is designed to be both flexible and extensible, so that existing metrics can be customized as needed, and new metrics can be added over time.

A third aim of the project is to provide an intuitive workflow Workflow that can be used by any internal team to access and analyze the data required to evaluate an initiative. Metrics can be easily retrieved from the UserMetrics API home page, or via a client, that can automatically generate reports based on the metrics of interest.

User Tagging
The UserMetrics API is designed to leverage the information assigned and stored via userTags, which permit us to permanently associate an arbitrary set of metadata (e.g., “subject of e3 experiment”) to a registered user of a specific project (e.g., "enwiki"). Tags are associated with a userId at the time of account creation or at the time a user undergoes a specific treatment or participates in a given initiative, and are stored in a repository where they can be accessed by the UserMetricsAPI and used to generate   cohorts. Once a tag has been assigned, it cannot be removed or changed.

UserTags can represent any number of user attributes. Tags can identify users as experimental subjects, or users who have created accounts in response to either calls to action or outreach events, or users who are part of a specific program (e.g., Global Education). UserTags do not reflect data that is already captured by MediaWiki or any data that conflict with our privacy policy.

For more information about userTagging, please see: http://www.mediawiki.org/wiki/Usertagging.

Metrics Standardization
Standardized metrics can be applied to any Wikimedia project to help evaluate the impact of initiatives in an unambiguous and consistent way. The set of metrics used by the UserMetricsAPI can be used at the project level to measure the success of an experimental treatment or outreach initiative, or on the organizational level to compare the impact of projects across the organization. In each case, the qualities of interest—user retention or user contribution (quality, quantity, type)--are measured consistently and clearly defined so that all users can see what the numbers mean.

For more background information, please see: http://meta.wikimedia.org/wiki/Research:Metrics

Workflow
The UserMetricsAPI is designed to streamline the process of obtaining and analyzing the data needed to evaluate projects and initiatives. Any authorized UserMetrics user can access the API to generate reports, depending on permissions associated with her user account. Broadly, the workflow can be described as follows:


 * 1) 	Define cohorts. Cohorts can be defined by specifying custom lists of usernames/userIds, or by selecting and combining existing userTags. Examples of cohorts: Users in E3 experimental group, students enrolled in a Global Education class, VisualEditor adopters, new users registered on mobile devices.
 * 2) 	Measure the quality, productivity, or retention of these cohorts via a standard set of metrics:
 * 3) *	revert rate: proportion of reverted edits within 24 hours of registration.
 * 4) *	threshold: reached is a user makes 1 edit to the main namespace within 24 hours of registration
 * 5) *	blocks: number of times user blocked within 24 hours of registration
 * 6) *… or other metrics..
 * 7) 	Compare cohorts against each other or against a baseline:

Cohort
A cohort is a set of users sharing one or more property or attribute—the time of account creation, for example, or participation in an outreach event or experimental group.

The UserMetricsAPI generates cohorts based on userTag information. At its most basic, a cohort can be identified by a single userTag (e.g., “e3_experimental_group”). Cohorts can also be generated from a combination of multiple tags (“e3_experimental_group” and “e3_control_group”). Tags are combined using Boolean operators to reflect either the union or intersection of the groups. For more information, see multiple-tag cohorts.

Metric
Metrics are well-defined values or sets of values that can be computed for any user registered in Wikimedia projects, and are typically used in aggregate to compare different user groups (i.e., cohorts) against each other. The metrics computed by the UserMetricsAPI help us understand user activity and behavior--from the quality, quantity and type of user contribution, to how well our editors are retained. For example, we could look at the value of the “bytes_added” metric to see how many bytes of content a student has added to a given wiki in the last week, but if we are interested in evaluating the success of her class, we would more likely look at the number of bytes added by the entire class (i.e., the “enwiki_editing_class” cohort). In this case, the bytes_added metric is used to help determine if the class is successful. We could look at additional metrics to provide a fuller picture: the revert rate of student edits, for example, or the  survival rate of users in the student cohort. We can’t directly measure the class’s “success,” but we can measure a number of more concrete quantities that help us determine it and compare it with other classes or other similar initiatives.

All metrics are standardized and clearly defined so that we can easily understand what their values mean and consistently use the same standards to evaluate the efficacy of programs and initiatives over time. Note that metrics are dependent on the context in which they are measured and therefore only make sense in these contexts. An editor with a high revert rate could be a vandal, or an advanced user removing vandalized text. In the case of our class of new enwiki users, a high revert rate is more likely vandalism.

The value of a metric returned for each user may be defined (e.g., “true” to indicate that a user reached a threshold of 1 edit in her first 24 hours of account activity) or undefined, which would be the case if a user has not been active for a full 24 hours, and we do not yet know if she will reach the threshold or not. Defined values may be of different types: Boolean (a true or false value indicating whether a threshold has been reached, for example), integer (e.g., edit count), or float (e.g., proportion of reverted edits to total edits). The value of a metric may change over time. As a user makes additional edits, for example, the size of his contribution changes and the value of metrics, such as ‘bytes_added’ will change accordingly. However, once the time over which a given metric is defined has elapsed (e.g., the first week after registration), the metric should also return the same value.

The set of metrics supported by the UserMetricsAPI is in no way exhaustive. The system has been designed to be easily extensible, so that new metrics can be added and parameterized in different ways.

The backend
The backend of the UserMetrics API consists of two main pieces: the UserTag repository, which contains information about each userTag and the users to which it has been applied, and the UserMetrics engine, which receives metric request URLs and returns metric data as JSON objects.

The UserTag repository
The UserTag repository contains information about each user tag and the users to which it has been applied. In addition, the repository stores the name of the team using the tag, as well as the name of the cohort owner (the person responsible for maintaining cohort membership and keeping cohort information up-to-date). The repository consists of four MySQL tables (usertags_meta, ut_tags, api_user, and api_group). Because these tables may contain sensitive information, they are not publicly accessible. If you need access to specific information stored in the repository, please e-mail [mailto:usermetrics@wikimedia.org usermetrics@wikimedia.org].

The definition of each usertag, as well as other relevant metadata, is stored in the usertags_meta table:



Note that each tag has a unique Id (e.g., 64, in the above example) and a human-readable name (e3_ob4a). All tags are project-specific (e.g., ‘enwiki’). Currently, each usertag is applied to only one project; in the future, tags may be applied to several projects. In this case, the value of the ‘utm_project’ field would contain the names of all relevant projects. The usertags_meta table also contains a description of each tag and numerical codes representing the owner and the team using the tag. Owner and team names can be looked up in the api_user and api_group tables, respectively. The ‘utm_touched’ column represents the date the tag was most recently applied to a user. This information—the Ids of users associated with each tag—is stored in the ut_tag table. Finally, the ‘utm_enabled’ value indicates whether or not the tag is current. A value of ‘1’ indicates that the tag is relevant and should be included in the UserMetrics API application (where it can be selected from a menu and used to define a cohort). A value of ‘0’ indicates that the tag is no longer relevant, and should be archived. All tags—both current and archived—remain in the usertags_meta table.

The usertags table contains the name of the project (e.g., ‘enwiki’ or ‘arwiki’) and the userId of the users associated with each tag. Tags are identified by a unique numeric identifier referenced in the usertags_meta table; further information about each tag can be referenced in the usertags_meta table.

The api_user and api_group tables identify the groups and individuals using the repository. Each group (e.g., ‘e3’ or ‘mobile’) is assigned a unique numerical identification.

The UserMetrics engine and API
The UserMetrics engine is a Python package that includes several modules to process and handle "jobs" or "requests" and return a response and the associated information. The engine takes care of creating metric request objects and querying the databases (primarily the revision, user, page, and logging MySQL tables). The UserMetrics engine can be run as a standalone package, or with the UserMetrics API, which translates request URLs from authorized users into calls to the engine, creates and manages a job queue to best utilize system resources, and caches the response so that it is available for reuse.



When a request is submitted to the UserMetrics engine via the API, the request is first validated by the API to ensure that the URL is well formed and the parameters valid. The API also looks to the cache to see if the query has been submitted previously. If the query already exists, the API returns the cached response. If the request does not yet exist, the API creates a unique object representing the request and schedules a new job.

All new jobs are handled by the Job Scheduler, which queues jobs on arrival and sends those jobs to a helper process to be run. The Job Scheduler knows how many jobs are currently running and, if the maximum number has been reached, will add the new job to a wait queue. As existing jobs complete, the Job Scheduler selects new jobs from the wait queue in the order in which they were added. The Job Scheduler communicates with the Request Notifications Listener, which maintains the job queue, gets jobs, gets the status of jobs, adds jobs, and flags jobs as complete. When a job is complete, the Response Builder constructs a response for caching, caches the response, and notifies the system that the job is complete.

In addition to the job-handling components, the API consists of four primary modules: Views, Session, RequestMeta, and Data. The Views module handles the Flask URL-routing, view implementation, and subsequent template rendering. The Session module extends the Flask log-in package to create an interface for registering and authenticating users. The RequestMeta (RM) module parses request-URLs and creates from each a uniquely identified object that can be passed among the various components of the UserMetrics engine. The Data module handles reading and writing data to the cache, constructing hashes from the RM objects, constructing URLs from RM objects, and rebuilding JSON HTTP responses from cached responses.

Once the Job Scheduler has determined that a job should run, the relevant RM object is passed to a separate helper process, process_metrics (PM), which constructs and validates a cohort from the given information. The PM takes care of mapping the RM parameters to user metrics objects, specifying the parameters that determine the number of worker threads to be used by the metrics object, and whether or not to log the status of the call. Further processing depends on the request type: raw, aggregate, or time series.

Raw requests are handled by a metric object’s ‘process’ method, which implements the logic necessary to generate numerical results by userId. This method is defined for all UserMetric derived classes and takes a list of userIds as an argument. If metric parameters are not passed to the UserMetrics engine as keyword arguments, the metrics object applies its own default values.

Aggregate requests are handled in much the same way as raw requests, only an ‘aggregator’ wrapper method is used to combine the numerical results in a specified way (e.g., sum, mean, median). Aggregators are defined for each metric using method definitions in the Aggregator module.

Time series requests are handled with the ‘timeseries_method’ module. PM makes a call to this module with the time-series request data. See 'Understanding the different types of requests’  for more information about time-series requests.

All metrics objects use the Query module (QM) to extract data from the backend datastore (currently MySQL, though the system may be extended to use additional data sources, as the query logic is abstracted from the metric classes). The Query module formats and builds MySQL queries, enforces security on MySQL requests, and abstracts away the details of obtaining data from the MediaWiki databases. The Query module provides a single hub through which all queries to sensitive data are run, making it possible to secure the databases more easily.

The API writes the metric data objects to cache, both to a persistent backend store (HDFS) and to memory as an ordered dictionary. The system can be easily extended to use memcached or another caching system. API responses are returned via HTTP as JSON objects.

The UserMetrics API
The UserMetrics API receives request URLs and returns metrics data in formatted JSON. Metrics can be generated for any registered user or predefined group of users (i.e., cohort). Each request URL specifies the name of the cohort of interest as well as a metric handle—the friendly name of the metric to be computed, e.g., ‘threshold’--and, optionally, request parameters.

The type of data returned depends on the nature of the metric of interest as well as the type of request submitted (raw, aggregate, time series). All responses are stored in cache, and can be viewed as soon as the data has been retrieved, or at a later date.

In this section, we will look more closely at how to build a request—how to select a cohort, how to use the available metrics and their parameters, and how to use additional parameters to customize the request—as well as how to understand the response.

Overview of workflow
Whether you are running an experiment and are interested in looking at the behavior of a test and control group, or heading an outreach program and are curious about how effective the program is in retaining productive users, the UserMetrics API can be used gather data that sheds light on how the relevant users are interacting with the Wikimedia site.

If one hundred new users create accounts at an outreach event, for example, and you would like to evaluate how productive those users are, you could use the UserMetric API to retrieve the number of edits made by each individual in the group (or, alternatively, the total number of edits made by all individuals in the group) to give a measure of productivity. If the outreach event is focused on training new users, you could look at how quickly those users became active users (or reached some predefined threshold of activity) as an indication of how effective the training was. If you are interested in examining the quality of the contribution, or have concerns that a group of users is doing vandalism, you could look at the revert rate for the group. Whether your question has to do with the volume of user contribution, the quality of that contribution, the user retention, or the contribution type, user metrics can be used—and customized—to help answer it.

Once you have settled on the metric that will best answer your question about a particular cohort of users, you can customize your request with parameters. Parameters are used to specify the time period of interest (one day, one week, three month, one year, etc.), the start and end dates of a time interval of interest, the namespace and project of interest, or an event count (e.g., the number of edits that represent a threshold of interest). A program administrator usually presets the subset of metrics and default parameters most relevant to the project. Unless you need to (or have been asked to) override the defaults, you don't need to make changes before generating a request.

Parameters are also used to indicate the type of the request itself. By default, the system returns the metric value for each user in the cohort (a raw response). Specifying an aggregator parameter (‘sum’ or ‘average,’ for example) instructs the UserMetrics engine to return an aggregate value for the entire cohort. Similarly, the time series parameters instruct the system to return a series of aggregate values, one for each time slice of the defined interval.

The UserMetrics API validates and processes all valid requests. If the request has been submitted previously, the data is immediately pulled from the cache and returned as JSON objects. If a request is new, the engine will process the job, and return a report once the data has been extracted from the relevant databases. Metric data can then be viewed and used by any authenticated UserMetrics user.

Accessing the API
All users can read about the UserMetrics project and available metrics on the | UserMetrics API home page, which is publically accessible.

Only authenticated users may make requests, view cohorts, or retrieve the response generated by a previous request. Currently, authentication credentials are available to internal Wikimedia Foundation staff, early beta testers, and a number of trusted individuals from chapters and other organizations affiliated with the Foundation.

To obtain credentials, please contact [mailto:usermetrics@wikimedia.org usermetrics@wikimedia.org].

Defining and selecting cohorts
Cohorts, which are constructed from data stored in the UserTag repository, are exposed in the UserMetrics application as selectable items. Available cohorts may be used singly (to identify members of a particular test group, for example) or in combination (to identify members of a test group who also responded to a particular outreach event, for example). We are currently working on an interface for adding new cohorts to the system, and the documentation will be updated to reflect that functionality when it becomes available.

Information about cohorts is stored in the UserTag repository, which contains information about the tags that define cohorts. Each userTag has a unique numerical Id as well as a human-legible name, which is displayed in the UserMetrics interface. Additional information about each tag—a description, as well as the name of the person and team using it, for example—can be found in the usertags_meta table.

Single-tag cohorts
A single-tag cohort, as its name implies, is defined by a single userTag. All currently available single-tag cohorts are included in the “List of cohorts” menu in the UserMetrics application. To use one of these cohorts (e.g., the ‘test2’ cohort), simply select it from the cohort menu. The request URL will then reflect the name of the selected cohort:

http://metrics.wikimedia.org/cohorts/test/

Multiple-tag cohorts
A multiple-tag cohort is based on a combination of two or more userTags. Multiple-tag cohorts are defined in the request string using Boolean operators, which combine single-tag cohorts in the specified way (either ‘union’ or ‘intersection’, though other operators may be supported in the future). Note that each userTag in a multiple-tag cohort must be identified by its unique numerical Id (not the human-legible name displayed in the application).

To create a multiple-tag cohort that is the union of two cohorts (i.e., that includes users who appear in either of the two single-tag cohorts) use the ‘~’ syntax. A multiple-tag cohort derived from the union of ‘newUI’ (userTag Id ‘20’) and ‘revisedNewUI’ (userTag Id ‘29’), for example, would be represented by ‘20~29’ in the request URL:

http://metrics.wikimedia.org/cohorts/20~29/

To create a multiple-tag cohort that is the intersection of two cohorts (i.e., that includes only those users that appear in both cohorts), use the ‘&’ syntax:

http://metrics.wikimedia.org/cohorts/20&29/

Multiple-tag cohorts may consist of any number of individual tags and may contain a combination of ‘&’ and ‘~’ operators. If both operators are used, the intersection operator (‘&’) will take precedence. Currently, there is no support for nesting precedence.

Future functionality will include support for working with dynamic cohorts and adding new cohorts via the UserMetrics interface.

Static (fixed-membership) cohorts vs dynamic cohorts
PLACEHOLDER

Adding a custom cohort
PLACEHOLDER

Understanding different types of requests
The data the UserMetrics API returns depend on the type of request submitted to the system:


 * Raw request: By default, the system returns ‘raw data,’ which consists of the userId of the specified users, as well as the value of the metric for each user.
 * Aggregate request: Requests may also contain an aggregator, such as ‘average’ or ‘sum’. When an aggregator is applied, the system returns the aggregated data (e.g., the average edit rate for an entire cohort, rather than individual values for each user).
 * Time-series request: Aggregated data can be further processed via a time-series request. A time-series request returns data from a specified interval (the last month, for example) sliced by day, week, hour, or whatever unit is most relevant to the analysis. The UserMetrics engine calculates an aggregate metric value for each time slice. The aggregate values can be further customized to reflect either the activity by all registered users during the time slice, or only the activity of those users who registered within each slice.
 * Single-user request: A single-user request returns the value of the specified metric for one user. The user is identified by username (userId will be supported in the future) and project (e.g., ‘enwiki’ or ‘arwiki’). If no project is specified, the system returns results from ‘enwiki’ by default.
 * All-user request: An all-user request returns the value of the specified metric for all users of a project (‘enwiki,’ by default). An all-user request is created with the ‘all’ magic keyword, which specifies that metrics be returned for all project users instead of for a cohort drawn from the UserTag repository.

All requests contain the name of the user or cohort of interest, as well as the name of the metric to process and report. Additional parameters that further qualify the request (e.g., an aggregator or time series) or the metric itself (e.g., the number of edits that define a threshold: 1, 5, 10, etc.) may also be specified. If no additional parameters are specified, the UserMetrics API will use default values, which we look at in more depth in the ‘Available Metrics’ section.

In this section, we will look at the components of each type of request abstractly and also provide a working example for each type of request.

Raw
A raw request returns the value of the specified metric for each user. A raw request consists of a cohort name and the name of the desired metric. Metric-specific parameters may be included as well: $$R_r$$ {cohort, metric(params)}

For example, if a number of users (50, in this example) created accounts at an outreach event, and you’d like to see how many edits each of those users made in their first 24 hours, you would specify the ‘outreach_event’ cohort. If this cohort does not yet exist, or is not yet known to UserMetrics, it will have to be created and/or added to the system first. See “How to define and select a cohort” for more information. The edit count for each user in the cohort is returned by the ‘bytes_added’ metric:

Request components: $$R_r$$ {outreach_event_cohort, bytes_added}

Request return: { ‘user_ID1’: [bytes_added data] ‘user_ID2’: [bytes_added data] ‘user_ID3’: [bytes_added data] ‘user_ID4’: [bytes_added data] … ‘user_ID50’: [bytes_added data] }

Example of a raw request (using the ‘test2’ cohort, which consists of 3 users):

Request URL: http://metrics.wikimedia.org/cohorts/test2/bytes_added Returned data:  "type": "raw", "data": {"15972203": [ 683, 1133, 908, -225, 5 ], "13234584": [0, 0, 0, 0, 0],   	 	  "15972135": [0, 0, 0, 0, 0]}

The above data is excerpted from the actual JSON response, and reflect the userId (e.g., ‘15972203’) the net bytes added (‘683’), the absolute number of bytes added (‘1133’), positive bytes added (‘908’), negative bytes added (‘-225’), and edit count (‘5’). The metric is measured over the default time period: the first 24 hours after registration for each user. We look at full JSON responses in more depth in the “Understanding the response” section. If you are an authenticated UserMetrics API user, you can view the full JSON response for this request by clicking the request URL.

If you would like to see the total edit count for the group—perhaps to compare that sum to a baseline to determine if the cohort is more or less productive—you would use an aggregator. We will extend this example to use an aggregator in the next section.

Aggregate
An aggregator combines the data associated with individual cohort members in the specified way—returning an aggregate response like an average, for example, or a sum. Aggregators are metric-specific. See ‘Metric-specific parameters and aggregators’ for a list of aggregators currently supported by each metric. If you specify an aggregator that is not supported by a metric, or misspell the aggregator keyword, the UserMetrics API will return a raw response for the request.

To create an aggregate request, supply the cohort name, the name of the metric (and, optionally, any metric-specific parameters), and an aggregator:

$$R_a$$ {cohort, metric(params), aggregator}

If we would like to look at the sum of the bytes_added values returned for the users in the ‘outreach_event_cohort’, we would specify: Request components: $$R_a$$ {outreach_event_cohort, bytes_added(params), aggregator=sum}

Request return: { ‘bytes_added_sum’: [sum of bytes_added by cohort members] }

The UserMetrics system returns one value, the ‘bytes_added_sum’ for the entire cohort.

Example of an aggregate request (using the ‘test2’ cohort, which consists of 3 users):

Request URL: http://metrics.wikimedia.org/cohorts/test2/bytes_added?aggregator=sum Returned data:  "type": "aggregator", "data": [ 683, 1133, 908, -225, 5 ]

The above data is excerpted from the actual JSON response, and reflect the aggregated data for the entire cohort (not the metric values for each individual user, which are not included in the response). In this example, the UserMetrics API returns the sum of the ‘bytes_added’ metric for all cohort users: net bytes added (‘683’), absolute number of bytes added (‘1133’), positive bytes added (‘908’), negative bytes added (‘-225’), and edit count (‘5’).

We look at full JSON responses in more depth in the “Understanding the response” section. If you are an authenticated UserMetrics API user, you can view the full JSON response for this request by clicking the request URL.

Time series
A time series is used to look at the behavior of users over time. A time-series request returns an aggregate metric value for each slice of a specified time interval. The interval is specified with a ‘start’ and ‘end’ date parameter, and the time slice is specified with a ‘slice’ parameter. Two methods are currently available to specify how activity within each slice should be counted: a time series request can be run with a ‘group’ parameter to return either data that reflects all user activity for each slice ( group=‘activity’), or only the activity of users who registered within that slice (group= ‘registered’). By default, all activity for each slice is returned.

For example, if we are interested in looking at how much content a cohort of users has contributed to each namespace every month over the course of a year, we would use a time series request, which is indicated with a ‘time_series’ parameter. To build the request:

See 'Parameters used in time series requests'  for more information about each time-series parameter.
 * 1) Select the cohort of interest and the metric to use. In this case, we would use the ‘namespace_edits’ metric, which breaks down editor contributions by namespace.
 * 2) Select an aggregator. Because we are interested in total contribution, we would use a ‘sum’ aggregator to generate total contributions for the cohort; the UserMetrics engine will generate one aggregate metric value (i.e., sum, in this case) for each slice of the time interval.
 * 3) Define the time interval and time slice. The time series parameters define the interval start and end date (e.g., 01/01/2011 and 01/01/2013) as well as the length of the slice to examine (e.g., a year, which is roughly equivalent to 8760 hours). From Dario: I think we will reconsider how slice parameters are passed (they should be represented as keywords – e.g. "day", not numeric values).
 * 4) Specify the group-by method. The ‘group’ parameter (group=”activity”) indicates that all activity for each slice should be included in the response. (If we were interested only in the activity of users who registered within each time slice, we would specify the ‘registered’ group instead.).
 * 5) Include a ‘time_series’ parameter. The request must also contain the ‘time_series’ parameter to identify the request as a time series.

Request components: $$R_t$$ {cohort_name, metric=namespace_edits, aggregator=sum, time series(start =01/01/2011, end =”01/01/2013”, slice=”8760”, group=”activity”)}

Request return: { ‘1012-01-01 00:00:00’: [namespace_edits_sum {aggregate namespace data for time slice}] ‘1012-01-01 00:00:00’: [namespace_edits_sum {aggregate namespace data for time slice} ‘1012-12-31 00:00:00’: [namespace_edits_sum {aggregate namespace data for time slice}] }

Example of a Time Series request (using the ‘test2’ cohort): Note: this request is returning ‘registration’ instead of ‘activity’

Request URL: http://metrics.wikimedia.org/cohorts/test2/namespace_edits?project=enwiki&aggregator=sum&time_series&slice=8760&start=20110101&end=20130101&group=activity Returned data:  "type": "time_series", "data": { "2011-01-01 00:00:00": [      "namespace_edits_sum", {        "-1": 0,          "-2": 0,          "0": 5,          "1": 1,          "2": 1,          "3": 2,          "4": 2,          "5": 0,          "6": 0,          "7": 0,          "8": 0,          "9": 0,          "10": 0,          "11": 0,          "12": 0,          "13": 0,          "14": 0,          "15": 0,          "100": 0,          "101": 0,          "108": 0,          "109": 0       }     ],      "2012-01-01 00:00:00": [       "namespace_edits_sum", {        "-1": 0, …

Note that the time series response consists of the aggregated data for each time slice (one year period). Each slice is identified in the response by its start time (e.g., "2011-01-01 00:00:00"), and each namespace is identified by its unique numerical Id. We look at full JSON responses in more depth in the “Understanding the response” section. If you are an authenticated UserMetrics API user, you can view the full JSON response for this request by clicking the request URL.

Single user
A single-user request returns the value of the specified metric for one user. The user is identified by username (userId will be supported in the future) and project (e.g., enwiki or arwiki). If no project is specified, the system returns results from enwiki by default. [Note from Dario: the single-user endpoint is likely to change to have a dedicated "view" in the future, e.g. metrics.wikimedia.org/user/Foo]

$$R_s$$ {user_name, project, metric(params)}

For example, if you are interested in seeing how many bytes you added to the main namespace of the MediaWiki wiki over the first three months of the year, you would specify your username, the project ‘mediawikiwiki’, the ‘bytes_added’ metric, and the start and end date of the interval of interest. You would also need to set an ‘is_user’ parameter to “True” to alert the UserMetrics system that the request is for a single user:

Request components: $$R_s$$ {your_username, project=mediawikiwiki, bytes_added(start date, end date) is_user)}

Request return: {‘your_userId’: [bytes data]}

Example of a single-user request:

Request URL: http://metrics.wikimedia.org/cohorts/Kmenger/bytes_added?project=mediawikiwiki&start=20130101000000&end=20130401000000&is_user=True Returned data:  "data":{ "835346": [ 43, 43, 43, 0, 1 ]   }

The above data is excerpted from the actual JSON response, and reflect the userId (‘835346’) of the specified user, the net bytes added (‘43’), the absolute number of bytes added (‘43’), positive bytes added (‘43’), negative bytes added (‘0’), and edit count (‘1’). We look at full JSON responses in more depth in the “Understanding the response” section. If you are an authenticated UserMetrics API user, you can view the full JSON response for this request by clicking the request URL.

All user (magic cohorts!)
An all-user request returns the value of the specified metric for all users of a project (‘enwiki,’ by default). An all-user request is created with the ‘all’ magic keyword, which tells the UserMetrics engine to select all users instead of using a cohort drawn from the UserTag repository. Currently, the ‘all’ keyword is the only magic keyword, but future functionality will include keywords that generate a random sample of users from within the specified project.

An all-user request can be used to retrieve information about project activity over a quarter or year (returning the number of active new editors who created accounts over that time period, for example). In the following example, we look at how many arwiki users made five edits in the month following registration. To return this information, we use a ‘threshold’ metric and specify the ‘all’ magic keyword, as well as the time and event-count parameters of interest:

Request components: $$R_m$$ {'all' magic keyword, project=arwiki, threshold(time=720 hours, event_count=5 edits}

Request return: { ‘arwiki_user_ID1’: [true, 5 edits made within 1st month after registration ] ‘arwiki_user_ID2’: [false, the user did not reach the threshold ] ‘arwiki_user_ID3’: […] ‘arwiki_user_ID4’: […] … ‘arwiki_user_IDlast’: […] }

Example of an all-user request:

Request URL: http://metrics.wikimedia.org/cohorts/all/threshold?project=arwiki&t=720&n=5 Returned data: "type": "raw", "data": { "xxx": [ 0 ], "xxx": [ 0 ], "xxx": [1]  …},

The returned data reflects the userId of each arwiki user as well as the value of the threshold metric for that user: true (‘1’) or false (‘0’) depending on whether or not that user met the threshold (5 edits within the first 720 hours after registration). We look at full JSON responses in more depth in the “Understanding the response” section. If you are an authenticated UserMetrics API user, you can view the full JSON response for this request by clicking the request URL.

Building a request
Requests are built and passed to the UserMetrics engine as HTTP requests. Each request (whether single-user, raw, aggregate, or time series) includes the name of a cohort (or username, in the case of a single-user request) as well as the name of the metric to return. Additional parameters—metric-specific or global parameters, aggregators, and time-series parameters—can be included as well.

An example of a basic raw request:


 * http://metrics.wikimedia.org/cohorts/test2/threshold

The above request returns the value of the ‘threshold’ metric for every member of the ‘test2’ cohort. If no additional parameters are set, the threshold metric returns true (‘1’) for every user who has reached the default editing-threshold (1 edit) within the default time period (24 hours since registration). Users who failed to reach the threshold receive a false (‘0’).

To change the threshold—to see how many users have made 5 edits in the first month since registration (i.e., approximately 720 hours)—use global parameters to specify the time period of interest (‘t’) and event count (‘n’) in the request URL:


 * http://metrics.wikimedia.org/cohorts/test2/threshold?t=720&n=5

The value of the ‘t’ parameter (720 hours) and ‘n’ parameter (5 edits) in the request URL override the default metric settings. For a complete list of metrics and their default settings and parameters, please see ‘Available Metrics.’ For more information about the global parameters, please see ‘Global parameters.’

The UserMetrics API archives all executed requests, which are available here: http://metrics.wikimedia.org/all_requests. This archive is a good place to browse individual requests, and to view their responses. Click an archived request link to view its JSON response.

A few additional examples of requests:

Number of bytes added by all users in the ‘test2’ cohort for a given time period (6/1/2011-7/1/2012). This is a raw request, and the metric value will be returned for each user in the cohort:


 * http://metrics.wikimedia.org/cohorts/test2/bytes_added?start=20110601&end=20120701&group=activity

Total number of bytes added (summed) each month over the given time period (11/1/2011-2/1/2012) for all users in the ‘test2’ cohort [i.e., a time series request]:


 * http://metrics.wikimedia.org/cohorts/test2/bytes_added?time_series&start=20111101&end=20120201&group=activity&slice=720&aggregator=sum

Proportion of users in the ‘test2’ cohort making at least 1 edit (n=1) in the given time period (2/1/2012-3/1/2012). Note that only users who registered in the given month are included in the response (i.e., ‘group=registration’):


 * http://metrics.wikimedia.org/cohorts/test2/threshold?start=20111201&end=20120301&group=registration&aggregator=proportion

Global parameters
Global parameters can be used in any request URL to customize how it is processed. Global parameters are used to specify the start and end dates of the time period of interest, for example, as well as the namespace in which activity occurs (by default, the UserMetrics API returns metrics representing activity in the main namespace).

In addition to global parameters, request URLs may contain metric-specific parameters and aggregators. For more information about these, please see ‘Metric-specific parameters and aggregators’.

Metric-specific parameters and aggregators
Each metric has a corresponding set of metric-specific parameters and aggregators that can be used with it. Though all metrics can use an aggregator, “average” or “median”, for example, not every aggregator is relevant (or currently implemented) for each metric. The following aggregators are used in the UserMetric system: sum, mean, std, median, min, max, proportion and dist.

Group-by methods
The group-by methods, which are specified with the “group” parameter, determine how to measure activity within a slice in a time-series request or when specifying an interval via start/end parameters. As of now, the UserMetrics API supports two group-by methods: ‘activity’ and ‘registration.’ If no group-by method is specified, ‘activity’ is used by default.

When the ‘activity’ group-by method is used, the API returns all user activity relevant to the request. For example, the following raw request will return the value of the ‘bytes_added’ metric over the month of December 2011 for each user in the ‘test2’ cohort:


 * http://metrics.wikimedia.org/cohorts/test2/bytes_added?start=20111201&end=20120101&group=activity

To better understand how this request is working, we must take a closer look at the ‘test2’ cohort, which consists of three users, two of whom created accounts in December 2011, and one of who created an account in 2010. The ‘activity’ method tells the API to ignore the date of account creation, and to measure all of the bytes-added activity by these three users between the start and end date of the interval.

Grouping by “registration” returns a different set of values. The following request is identical to the previous one except for the group-by method, which is now ‘registration’:


 * http://metrics.wikimedia.org/cohorts/test2/bytes_added?start=20111201&end=20120101&group=registration

In the above example, the API returns only the activity of the users who registered within the specified interval (i.e., December 2011). Two of the three users in the ‘test2’ cohort registered in December of 2011, and their bytes-added activity will be returned; the third user, who registered before the specified time interval, will be excluded for the response. Note that when the ‘registration’ group-by method is used, the activity is measured from the time of account creation. In the above example, the response will reflect the bytes added over the first 24 hours of account activity (the default) for all users who registered in December. To measure the activity over the first week of account activity (or more), specify a ‘t’ parameter with the desired number of hours. Note that when grouping by registration, the interval ‘t’ is honored even when it ends past the interval set by the start/end parameters.

Understanding the response
The UserMetrics API receives each submitted request and returns either the cached response (for requests that have been previously processed and that do not specify a ‘refresh’ parameter) or a new response generated by the UserMetrics engine (for new requests). In either case, the data is returned as JSON objects. Each JSON object contains the metric data as well as information about the request itself (i.e., request metadata).

The metric data consists of a header and the data itself:

Request metadata consists of the values of global and local parameters used in the request, as well as information about the request itself, such as the time it was generated:

Below is an example of a JSON response returned for the following simple request:


 * http://metrics.wikimedia.org/cohorts/test2/threshold

The submitted request is a raw request, meaning the API will return the value of the metric for every user in the specified cohort, ‘test2’ (which consists of three users). As no parameters are explicitly specified, default parameter values will be used to processes the request. The values of the default parameters are reflected in the response:

At the top of the response is information that describes the time the response was generated by the UserMetrics API [2], when the cohort was last modified [3], the name of the metric returned [4], the name of the cohort [5], and the type of request, in this case ‘raw’ [6]. The values of the parameters used in the request [7]-[17] are also returned. If a parameter value was not specified explicitly, the default value for the parameter is returned. As no parameters were specified in the example request string, the parameter values in this case are all default values.

The “header” section [19]-[22] identifies the columns of data, in this case ‘userId’ and ‘has_reached_threshold'. The data itself is in the “data” section [24]-[34]. Because the request is raw, data is returned for each user: the userId, and a Boolean value that indicates whether each user has (‘1’) or has not (‘0’) reached the threshold.

Adding an aggregator to the previous example, returns an aggregate response:

In an aggregate response, the name of the aggregator used [8] is included along with the request parameters and their values [7]-[17]. Note that the type of request is now “aggregator” [6].

The “header” [19]-[23] identifies the fields of aggregated data returned: total_users (total number of users in the cohort), threshold_reached (the number of users who reached the threshold) and rate (i.e., threshold_reached/total_users). The “data” section [25]-[29] contains only the aggregate values for the cohort (no individual values are included).

A time-series request returns an aggregate value, (e.g., a sum or proportion), for each slice of a time interval:

http://metrics.wikimedia.org/cohorts/test2/threshold?aggregator=proportion&time_series=present&project=enwiki&start=20111201&end=20120301&slice=720&group=registration

The time-series parameters in the above request identify the request as a time series and specify the start and end date of the time interval, as well as the length of the time slice to use. In this example, the request specifies a three-month interval (between 12/01/2011 and 03/01/2012) sliced into one-month chunks (720 hours). The response will contain the aggregate data for each one-month slice of the interval (there are three full slices in this case, plus a fourth “left-over” chunk of about a day.):

From Dario: Can you add a todo in the draft so we can expand it later? We need to better document how slices are treated (left-closed vs right-closed) and how labels are assigned.

Each slice is identified by its start date (lines [27], [32], [37], [42]) and the data consists of the number of users who registered during that slice [28], the number of those users who reached the threshold [29], and the proportion of users who successfully reached the threshold [30].

Accessing cached requests
The UserMetrics API caches all previously run requests and their responses. If a user submits a request that has been previously processed, the system will, by default, return the cached response rather than rerunning the job. This behavior can be overridden with a ‘refresh’ request parameter, which specifies that the request be processed anew, regardless of whether or not it has been processed in the past. For example, if you are interested in how many new users reach a certain threshold of activity each week, you could rerun the same request URL every week to retrieve the relevant data provided that you add the refresh parameter (otherwise, the cached data from the previous week will be returned by default):


 * http://metrics.wikimedia.org/cohorts/test2/threshold?n=1&t=154&group=registration&refresh

Cached requests can also be accessed via their archived request links, which are available on the “Recently completed requests” page (http://metrics.wikimedia.org/all_requests).

Response status
The API returns a ‘-1’ value each time a metric is undefined. If a threshold metric is used to measure the number of new users who make 1 edit in the first 24 hours since registration, for example, the API will return ‘-1’ for any user who has not yet been around the full 24 hours. The API will never return a NULL response.

NOTE SEE: https://mingle.corp.wikimedia.org/projects/analytics/cards/573

Available metrics
Currently, the UserMetrics API supports nine standardized metrics that provide information about user retention, the volume of user contribution, the quality of the contribution, and the type of contribution.

Each metric has default settings that can be overridden with global or metric-specific parameters. In addition, each metric has a set of aggregators (e.g., ‘proportion’ or ‘sum’) that can be used with a request to return an aggregate value for a cohort. Please see “Building a request” for more information about using parameters.

Retention metrics
Retention metrics help measure whether an editor is active and how active an editor is during or after a given timespan: threshold, survival, live_account.

threshold
The threshold metric returns a Boolean value that indicates whether a user has reached a given level of activity within a specified amount of time. By default, the metric will return true (‘1’) for a user if that user has made one edit within 24 hours of registration, and false (‘0’) otherwise. If the metric is not defined for a user (e.g., 24 hours has not yet elapsed since the time of registration), the API will return (‘-1’) for the user.

Currently, the following aggregators are implemented for this metric: proportion.

Example: The following request will return a true value (‘1’) for all users in the ‘test2’ cohort who have made 10 edits to the main namespace of the enwiki wiki (the default) in their first month (i.e., 720 hours after registration). Note that the values assigned the ‘t’ and ‘n’ parameters in the request replace the metric’s default settings:


 * http://metrics.wikimedia.org/cohorts/test2/threshold?t=720&n=10

If you are an authenticated UserMetrics user, click the request URL to see the response.

survival
The survival metric returns a Boolean value that indicates whether or not a user has participated on Wikipedia beyond a specified time period (e.g., a week, month, year, etc). By default, the time period begins at the time of account creation and lasts 24 hours. Users are considered "surviving" if they continue to edit after the time period has elapsed.

Currently, the following aggregators are implemented for this metric: proportion

Example: The following request will return a true value (‘1’) for all users in the ‘test2’ cohort who have made edits to the main namespace of the enwiki wiki (the default) after the specified time period has elapsed, in this case 30 days (or 720 hours) after account creation:


 * http://metrics.wikimedia.org/cohorts/test2/survival?t=720

If you are an authenticated UserMetrics user, click the request URL to see the response.

live_account
The live_account metric returns a Boolean value that indicates whether or not a user has clicked the edit button within a specified amount of time. By default, the time period begins at the time of account creation and ends 24 hours later. The UserMetrics API will return a value of true (‘1’) if the edit button was clicked within the specified time period, false (‘0’) if the edit button was clicked after the specified time period, and undefined (‘-1’) if the edit button was not clicked at all after registration.

Currently, the following aggregators are implemented for this metric: proportion

Example: The following request will return a true value (‘1’) for all users in the ‘test2’ cohort who clicked the edit button within the default time period (within the first 24 hours after account creation):


 * http://metrics.wikimedia.org/cohorts/test2/live_account

If you are an authenticated UserMetrics user, click the request URL to see the response.

Volume metrics
Volume metrics help measure the quantity of an editor’s wiki work: edit_rate, bytes_added, time_to_threshold.

edit_rate
The edit_rate metric returns an array of numerical values that reflect the quantity of an editor’s contributions over a given time period (i.e., edits/time period). The time period defaults to one day (24 hours, beginning at the time of account creation). When the metric is used with a raw request, the system returns for each user the edit count, edit rate, and the length of the time period over which the edits have been measured.

Currently, the following aggregators are implemented for this metric: mean, dist

Example: The following request will return the edit count, edit rate, and the time period of the measurement for all users in the ‘test2’ cohort. As no parameters are stated explicitly in this request, the UserMetrics system will use the default values for all parameters:


 * http://metrics.wikimedia.org/cohorts/test2/edit_rate

If you are an authenticated UserMetrics user, click the request URL to see the response.

What’s this? "time_unit": 0; "time_unit_count": 1, [returned in the response for this metric]

Ryan—can you expand?

bytes_added
The bytes_added metric returns an array of numerical values that reflect the amount of content an editor has added, removed, and modified within a given time period. The time period defaults to one day (24 hours, beginning at the time of account creation). When the metric is used with a raw request, the system returns for each user the net bytes contributed, bytes added (positive), absolute bytes contributed, bytes removed (negative), and edit count (e.g., [100, 150, 200 ,-50, 2])

Currently, the following aggregators are implemented for this metric: sum, mean, std, median, min, max

Example: The following request will return an array of bytes-added values for each user in the ‘test2’ cohort. The ‘start’ and ‘end’ parameters specify a time period of three months (01/01/2013-01/04/2013). Note that when ‘start’ and ‘end’ date parameters are specified, the system ignores the ‘t’ parameter.


 * http://metrics.wikimedia.org/cohorts/test2/bytes_added?start=20130101000000&end=20130401000000

If you are an authenticated UserMetrics user, click the request URL to see the response.

time_to_threshold
The time_to_threshold metric measures the time a user takes to reach a given level of activity. By default, the metric will return the number of minutes that passes between the time of account creation and a user’s first edit. These defaults can be overridden with the metric’s ‘first_edit’ and ‘threshold_edit’ parameters, which take integer values that reflect a number of edits (e.g., ‘1’ or ‘5’).

Ryan: Also required feedback on implementation by Ryan, the metric should honor the generic n parameter for representing threshold.

Currently, the following aggregators are implemented for this metric: mean, dist

Example: The following request returns the number of minutes that passes between each ‘test2’ cohort user’s first and second edit. The metric-specific ‘first-edit’ and ‘threshold_edit’ parameters are used to override the default settings.


 * http://metrics.wikimedia.org/cohorts/test2/time_to_threshold?first_edit=1&threshold_edit=2

If you are an authenticated UserMetrics user, click the request URL to see the response.

Ryan: Is that a sound example? It doesn’t appear to work, at least the response has "threshold_edit": 1,   "first_edit": 0, despite my params in request URL

Qualitiy of contribution metrics
Quality of contribution metrics allow us to measure the quality of an editor's wiki work: revert_rate, block_rate.

revert_rate
The revert_rate metric is expressed as a ratio: ‘number of reverted edits/total edits’, measured over a given period of time. When used with a raw request, the metric returns each user’s revert rate as well as the total number of reverted edits made by that user in the specified time period. By default, the time period is 24 hours, beginning at the time of account creation.

The metric can be customized with the ‘look_ahead’ and ‘look_back’ parameters, which specify how far ahead and back in the revision history the system should look for revisions that exactly duplicate a previous revision (i.e., that ‘revert’ a page to its previous condition). By default, 15 revisions.

Currently, the following aggregators are implemented for this metric: mean

Example: The following request returns the revert rate and the number of reverted edits made by each ‘test2’ cohort user over the default time period (the first twenty-four hours following account creation).


 * http://metrics.wikimedia.org/cohorts/test2/revert_rate

If you are an authenticated UserMetrics user, click the request URL to see the response.

blocks
The blocks metric returns an array representing the number of times a user has been blocked, as well as the date of the first and the most recent block, if relevant. If the user has been banned, the date of the ban will be included as well. If the user has not been blocked or banned within the specified time period, the system will return ‘-1’.

Currently, the following aggregators are implemented for this metric: proportion

Example: The following request returns an array of ‘blocks’ data for each user in the ‘test2’ cohort.


 * http://metrics.wikimedia.org/cohorts/test2/blocks

If you are an authenticated UserMetrics user, click the request URL to see the response.

Type of contribution metrics
Type of contribution metrics allow us to measure the diversity of an editor's wiki work: namespace_edits

namespace_edits
The namespace_edits metric measures the breakdown of an editor’s contribution by namespace. When the metric is used with a raw request, the UserMetrics engine returns for each user an ordered dictionary containing the namespaces (identified by numerical Id) and the number of edits made to each over the specified time period. The default time period is 24 hours, beginning at the time of account creation.

Currently, the following aggregators are implemented for this metric: sum

Example:  The following request returns the number of the edits made by each user in the ‘test2’ cohort to each namespace. The ‘t’ parameter specifies that activity is measure over the first 30 days (720 hours) following account creation.


 * http://metrics.wikimedia.org/cohorts/test2/namespace_edits?t=720

If you are an authenticated UserMetrics user, click the request URL to see the response.

The UserMetrics client
The UserMetrics client generates and submits UserMetrics requests automatically, and can be used to retrieve metrics at regular intervals—daily or weekly—without need for human intervention.

The client is currently under development. More information can be found here: https://github.com/wikimedia/umapi_client

How to contribute
If you would like to contribute to the project, please see the technical documentation http://stat1.wikimedia.org/rfaulk/pydocs/_build/ for detailed information about installing and configuring the UserMetrics API.