Revtagging

What's revtagging
Revtagging is the ability to associate a set of metadata (or tags) to a revision at the time of its creation for analytics purposes. Every edit to a page is stored in MediaWiki with a unique rev_id: revtagging allows us to attach supplementary data to a rev_id.

What is it for
MediaWiki stores by default a series of metadata about edits (or "revisions") in the revision table. The metadata held in this table is needed to operate the website and, in particular, to populate article revision histories. This metadata (for example the revision timestamp, length, deletion status or SHA1 hash) can also be used for analytics purposes, to study edit survival or the quantity and quality of edits generated by an experiment. With several projects and product experiments generating new edits, we need to store a broader and different set of revision metadata to be able to assess the impact of these projects.

The following are examples of revision metadata that are not included in the revision table and that we may need to capture as part of product experimentation or community outreach activities:


 * revision 1234 was generated via an "edit call to action" served to all anonymous readers who posted feedback via the Article Feedback Tool;
 * revision 2345 was a part of a bulk-revert generated by a bot;
 * revision 3456 was generated via a power tool like Twinkle adding maintenance tags to an article;
 * revision 4567 was created by a participant in an editathon;
 * revision 5678 was created as a result of the bulk-upload of media donated to Commons by a GLAM institution;
 * revision 6789 was generated by a gadget enabled by default for all registered users on the English Wikipedia;
 * revision 7890 was a microcontribution generated via a mobile app;

What revtagging is not
The following types of data fall outside of the scope of revtagging
 * metadata that is needed by MediaWiki to operate the website and that should be stored in the revision table or other MediaWiki tables. This includes revision tags and metadata generated by the abuse filter, which is supported in MediaWiki via a dedicated extension.
 * metadata attached retrospectively to a rev_id: revtagging (at least in its preliminary version) will only support metadata attached to a revision by the process that creates it and at the time of its creation.

Why do we need it
Revtagging is a fundamental part of our community and product analytics as it will help us better understand:


 * 1) given a feature (or a project), identify edits generated by this feature (or project)
 * 2) given a set of revisions, identify what features (or projects) that generated them

At a single-project/single-feature level (1), storing additional metadata with article revisions will allow us to answer the following questions:

How many edits are generated from a specific source?
For example: how many edits were created via a specific call to action displayed to article feedback users in a given month?.

What is the revert rate of edits generated from a specific source?
A new feature or an outreach program may generate a large number of revisions but a large part of these revisions may be vandalism or abusive edits that need to be reverted or redacted. Revtagging will allow us to answer questions such as "how does the revert rate of edits created via a new mobile interface compares to the revert rate of regular edits?".

What type of edits are generated from a specific source?
We may need to study via qualitative handcoding whether a new feature generates a specific type of edits (such as typo corrections, text expansion, image inclusion, inclusion in a category etc.) to assess whether it's fit to its purpose. A feature or a program generating an inappropriate type of edits may need to be discontinued.

At an aggregate level (2), revtagging will allow us to break down the volume of edits generated by a feature, a series of experiments or a program. Given a peak in activity in a given month we want to be able to drill down to the specific program, project, feature or experimental condition (when applicable) that generated this increase in edits.

Primary use cases
General use cases Measure the revert and survival rate for edits, based on the tool that was used to generate them Quantify how many revisions are produced by different tools/devices Allow mass-reversion for edits produced by abusive/disruptive tools, if needed QUESTION: why is this part of RevisionTags? Filter recent changes by specific tags (similar to AbuseFilter)

Question how does http://en.wikipedia.org/wiki/Wikipedia:Tags fit into this?

Bots Motivation Mark edits by bots in page histories, not just in RecentChanges Simplify data analysis using the revision table Do not rely on edit comments for parsing bot edits Required data Campaign: Bot Subcampaign: bot name Additional info: bot version Refs Mark bot edits in histories https://bugzilla.wikimedia.org/show_bug.cgi?id=11181 Interface support to mark bot edits in histories https://bugzilla.wikimedia.org/show_bug.cgi?id=13516

MediaWiki features (e.g. "Undo", "Rollback", "Visual editor") Motivation This category could apply to Visual editor Undo and Rollback Article Creation Wizard Article Feedback Tool AbuseFilter Flag edits that are generated via AFT to measure their survival/revert rate on top of conversions, this is a requirement for phase 3 of the AFT test plan Measure how many new articles are created via a specific funnel of the article creation workflow. New articles are only represented in the DB as plain revisions so we need to tag the revision associated with the first edit that creates an article. Use of mediawiki features (e.g. Visual editor) is currently difficult to track. Editors should be able to tag certain AbuseFilter's log entries as false positives, to be able to track the accuracy of each filter (and disable them in of too many wrongly filtered edits). https://bugzilla.wikimedia.org/show_bug.cgi?id=28213 Required data Type: "feature" Name: (e.g. "Article Creation Wizard") Version (not sure if this is relevant --Aaron) Optional data Campaign:  (e.g. "AFT Phase 3") Subcampaign:  (e.g. "a" or "b") Number of edits at time of revision Refs Undo and rollback are noted in the default edit summaries. AFT data and metrics: Phase 3 - effects on edits http://meta.wikimedia.org/wiki/Research:Article_feedback/Data_and_metrics

3rd party tools (e.g. Huggle, AWB and Pywikipediabot) Motivation Measure which tool produces a revision and whether those revisions include a particular template that is part of an A/B test We're likely to see an increase of contributions through third-party apps using the API in the future. They can be third-party editors, offline editors, or custom mass upload tools for various uses (generic, photogapher-specific, GLAM-specific, etc.) In both cases, it's currently difficult to track the volume of contributions made with each tool. Tracking would allow us to identify promising tools that get traction, and also track misbehaving / buggy / expensive tools. Required data Type: "3rd party tool" Name: (e.g. Huggle, AWB) Version: Optional data Campaign:  (e.g. Wikignome-1) Subcampaign:  (e.g. "a" or "b") Refs Currently found via parsing revision comments E.g. of edit summary: https://pt.wikipedia.org/?diff=prev&oldid=28916224&uselang=en subcampaign (or whatever we want to call it) names are found only by parsing revision text, it would be extremely helpful to have these tagged also to avoid this for future experiments involving these tools In some cases, upload tracking is done through categories (e.g. https://commons.wikimedia.org/wiki/Category:Uploaded_with_Commonist or https://commons.wikimedia.org/wiki/Category:Uploaded_with_UploadWizard ), but in most cases (esp. editing), there's no tracking at all.

User script (e.g. WikiGnome/Twinkle) Motivation Measure how many edits are made (and by whom) using a gadget. Identify experimental conditions. Similar to power tools above. Required data Type: "User script" Name: (e.g. Wikignome) Script:  (e.g. wiki/User:EpochFail/wikignome.js) Version: Optional data Campaign:  (e.g. Wikignome-1) Subcampaign:  (e.g. "a" or "b") Refs Currently identified by notes at the end of the edit summary E.g.: https://pt.wikipedia.org/?diff=prev&oldid=28884558&uselang=en Experimental group conditions are identified via logging a based jsonp request. (Cross-server scripting... ew...)

Mobile Motivation Measure edits from mobile devices to compare against edits by other means In particular to measure durability of edits from mobile sources, but also numbers of types of edits (for example, photo upload vs. text correction) Required data User agent to identify browser/app and ideally device Type of contribution or edit, e.g., photo upload Application or context of edit Locale, if available Time of edit is also potentially useful Refs Generally dependent on main site projects, such as Article Feedback

Translate Motivation Tagging fuzzy translations Tagging which revision of the source text the translation was done against Required data Tagname, revision tagname, revision, pageid (for finding the latest tagged revision for given page), payload (the revision id of the source text) Refs https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Translate/sql/revtag.sql?revision=101801&view=markup (this was originally intented to end up in core, but I gave up on that idea few months ago, but now it seems to have surface again!)

Notes: We could merge the spec for 3rd party tools and user-scripts since they both interact with Wikipedia using the API. The only difference with user-scripts is that they user the browser itself as the external tool rather than building one from the ground up (e.g. Huggle). In the same way, I don't see why huggle couldn't be implemented as a user-script. --Aaron I've made a merged category for MediaWiki "features" since there are a lot that need tracking. --Aaron Could we wrap "mobile" into 3rd party tools? It seems like there should be a difference between Huggle and the Wikipedia mobile app, but they interact in similar ways and will have similar metadata. Maybe we should have a seperate category for "power tools" and "browsers". --Aaron

Secondary use cases
Third party generated edits (OAuth) Diagnostics Mass-revert

Revtagging solutions
There is no support in the revision table to store arbitrary metadata (such as an experiment id or bucket id used in the context of an A/B test). Modifying the schema of this table is costly and impacts the functioning of our projects. As a result, revision metadata that are solely needed for data analysis purposes should not be stored in the revision table.

(overview of revtagging solutions, from clicktracking to the mw logging table to an ad-hoc key value store)