Revtagging

What's revtagging
Revtagging is the ability to associate a set of metadata (or tags) to a revision at the time of its creation for analytics purposes. Every edit to a page is stored in MediaWiki with a unique rev_id: revtagging allows us to attach supplementary, schema-free data to a rev_id.

What is it for
MediaWiki stores by default a series of metadata about edits (or "revisions") in the revision table. The metadata held in this table is needed to operate the website and, in particular, to populate article revision histories. This metadata (for example the revision timestamp, length, deletion status or SHA1 hash) can also be used for analytics purposes, to study edit survival or the quantity and quality of edits generated by an experiment. With several projects and product experiments generating new edits, we need to store a broader and different set of revision metadata to be able to assess the impact of these projects.

The following are examples of revision metadata that are not included in the revision table and that we may need to capture as part of analytics for product experimentation and community outreach activities:


 * revision 1234 was generated via an "edit call to action" served to all anonymous readers who posted feedback via the Article Feedback Tool;
 * revision 2345 was a part of a bulk-revert generated by a bot;
 * revision 3456 was generated via a power tool like Twinkle adding maintenance tags to an article;
 * revision 4567 was created by a participant in an editathon;
 * revision 5678 was created as a result of the bulk-upload of media donated to Commons by a GLAM institution;
 * revision 6789 was generated by a gadget enabled by default for all registered users on the English Wikipedia;
 * revision 7890 was a microcontribution generated via a mobile app;

What revtagging is not
The following types of data fall outside of the scope of revtagging
 * metadata that is needed by MediaWiki to operate the website and that should be stored in the revision table or other MediaWiki tables. This includes revision tags and metadata generated by the abuse filter, which is supported in MediaWiki via a dedicated extension.
 * metadata attached retrospectively to a rev_id: revtagging (at least in its preliminary version) will only support metadata attached to a revision by the process that creates it and at the time of its creation.

Why do we need it
Revtagging is a fundamental part of our community and product analytics as it will help us understand:


 * 1) given a feature (or a project), what edits are generated by this feature (or project)
 * 2) given a set of revisions, what features (or projects) generated them

At a single-project/single-feature level (1), storing additional metadata with article revisions will allow us to answer the following questions:

How many edits are generated from a specific source?
For example: how many edits were created via a specific call to action displayed to article feedback users in a given month?.

What is the revert rate of edits generated from a specific source?
A new feature or an outreach program may generate a large number of revisions but a large part of these revisions may be vandalism or abusive edits that need to be reverted or redacted. Revtagging will allow us to answer questions such as "how does the revert rate of edits created via a new mobile interface compares to the revert rate of regular edits?".

What type of edits are generated from a specific source?
We may need to study via qualitative handcoding whether a new feature generates a specific type of edits (such as typo corrections, text expansion, image inclusion, inclusion in a category etc.) to assess whether it's fit to its purpose. A feature or a program generating an inappropriate type of edits may need to be discontinued.

At an aggregate level (2), revtagging will allow us to break down the volume of edits generated by a feature, a series of experiments or a program. If we observe a peak in activity in a given month (see figure 1) we want to be able to drill down to the specific program, project, feature or experimental condition (when applicable) that generated this increase in edits instead of speculating about its possible causes.

Is revtagging supposed to replace the revision table?
No, in order to answer the above questions we will need to combine revtagging data and data from the MediaWiki database. The purpose of revtagging is solely to identify the source associated with a set of revisions and to filter MediaWiki's revision table to analyze the quantity/quality/longevity of these revisions.

Bots

 * Motivation
 * Mark edits by bots in page histories, not just in RecentChanges
 * Simplify data analysis using the revision table: do not rely on edit comments for parsing bot edits
 * Filter bot-edits from high-level reports


 * Required data
 * Campaign: Bot
 * Subcampaign: bot name
 * Additional info: bot version
 * Refs
 * Mark bot edits in histories 1118
 * Interface support to mark bot edits in histories 13516

MediaWiki features

 * Motivation
 * This use case applies to edits generated by a variety of MediaWiki features, including: the Visual editor; Undo and Rollback links; Article Feedback Tool and associated calls-to-action; editor engagement experiments.
 * Avoid storing undo/rollback actions in edit summaries.
 * Measure the revert rate of edits generated by new interfaces that lower contribution barriers (e.g. Visual Editor)


 * Required data
 * Type: "feature"
 * Name: (e.g. "Article Creation Wizard")
 * Version (if applicable)
 * Optional data
 * Campaign:  (e.g. "AFT Phase 3")
 * Subcampaign:  (e.g. "a" or "b")
 * Refs
 * Quality of AFT-generated contributions

Third-party tools

 * Motivation
 * Identify which tool produced a given set of revisions (third-party editor interfaces, offline editors, or photo upload tools)
 * Avoid tracking uploads via categories (e.g. https://commons.wikimedia.org/wiki/Category:Uploaded_with_Commonist or https://commons.wikimedia.org/wiki/Category:Uploaded_with_UploadWizard
 * Avoid parsing revision summaries in order to identify Huggle or AWB-generated edits, e.g. https://pt.wikipedia.org/?diff=prev&oldid=28916224&uselang=en


 * Required data
 * Type: "3rd party tool"
 * Name: (e.g. Huggle, AWB)
 * Version:

User scripts

 * Motivation
 * Measure how many edits are generated using a gadget (e.g. WikiGnome/Twinkle) and identify experimental conditions associated with gadget-based experiments


 * Required data
 * Type: "User script"
 * Name: (e.g. Wikignome)
 * Script:  (e.g. wiki/User:EpochFail/wikignome.js)
 * Version:
 * Optional data
 * Campaign:  (e.g. Wikignome-1)
 * Subcampaign:  (e.g. "a" or "b")

Mobile contributions

 * Motivation
 * Measure edits from mobile devices to compare against edits by other means
 * Measure survival of edits from mobile sources, but also numbers of types of edits (for example, photo upload vs. text corrections)


 * Required data
 * User agent to identify browser/app and ideally device
 * Type of contribution or edit, e.g., photo upload
 * Application or context of edit
 * Locale, if available

Secondary use cases
Third-party edits by authenticated apps
 * Diagnostics
 * Mass-revert

Revtagging solutions
There is no support in the revision table to store arbitrary metadata (such as an experiment id or bucket id used in the context of an A/B test). Modifying the schema of this table is costly and impacts the functioning of our projects. As a result, revision metadata that are solely needed for data analysis purposes should not be stored in the revision table.

Other solutions?
 * clicktracking log
 * dedicated mediawiki table
 * dedicated key-value store