Revtagging

What's revtagging
Revtagging is the ability to associate a set of metadata or tags with a revision at the time of its creation for analytics purposes.

What is it for

 * MediaWiki stores a number of metadata in the revision table.
 * Two types of revision metadata.

What revtagging is not

 * Metadata that should be stored in the revision table
 * Post-creation tagging
 * Abuse filter-generated metadata

Why do we need it
Revtagging is an essential part of product analytics as it will help us better understand


 * given a feature, identify edits generated by this feature
 * given a set of revisions, identify the features that generated them

Revtagging will allow us to answer the following questions:
 * 1) How many edits are generated via a specific source ? For example: how many edits were created via a specific call to action displayed to article feedback users in a given month?
 * 2) What is the revert rate of edits generated from a specific source? A feature may generate a large number of revisions but many of them can be vandalism or abusive edits. Revtagging will allow us to answer questions such as "how does the revert rate of edits created via a mobile interface compares to the revert rate of regular edits?".
 * 3) What type of edits are generated from a specific source? An experiment may generate

The purpose of revtagging is solely to identify the source of a set of revisions and to filter MediaWiki's revision table for the purpose of analytics.

At an aggregate level, revtagging will allow us to quantify the volume of edits (or non-reverted edits) generated by a feature, a series of experiments or a program.

Primary use cases
General use cases Measure the revert and survival rate for edits, based on the tool that was used to generate them Quantify how many revisions are produced by different tools/devices Allow mass-reversion for edits produced by abusive/disruptive tools, if needed QUESTION: why is this part of RevisionTags? Filter recent changes by specific tags (similar to AbuseFilter)

Question how does http://en.wikipedia.org/wiki/Wikipedia:Tags fit into this?

Bots Motivation Mark edits by bots in page histories, not just in RecentChanges Simplify data analysis using the revision table Do not rely on edit comments for parsing bot edits Required data Campaign: Bot Subcampaign: bot name Additional info: bot version Refs Mark bot edits in histories https://bugzilla.wikimedia.org/show_bug.cgi?id=11181 Interface support to mark bot edits in histories https://bugzilla.wikimedia.org/show_bug.cgi?id=13516

MediaWiki features (e.g. "Undo", "Rollback", "Visual editor") Motivation This category could apply to Visual editor Undo and Rollback Article Creation Wizard Article Feedback Tool AbuseFilter Flag edits that are generated via AFT to measure their survival/revert rate on top of conversions, this is a requirement for phase 3 of the AFT test plan Measure how many new articles are created via a specific funnel of the article creation workflow. New articles are only represented in the DB as plain revisions so we need to tag the revision associated with the first edit that creates an article. Use of mediawiki features (e.g. Visual editor) is currently difficult to track. Editors should be able to tag certain AbuseFilter's log entries as false positives, to be able to track the accuracy of each filter (and disable them in of too many wrongly filtered edits). https://bugzilla.wikimedia.org/show_bug.cgi?id=28213 Required data Type: "feature" Name: (e.g. "Article Creation Wizard") Version (not sure if this is relevant --Aaron) Optional data Campaign:  (e.g. "AFT Phase 3") Subcampaign:  (e.g. "a" or "b") Number of edits at time of revision Refs Undo and rollback are noted in the default edit summaries. AFT data and metrics: Phase 3 - effects on edits http://meta.wikimedia.org/wiki/Research:Article_feedback/Data_and_metrics

3rd party tools (e.g. Huggle, AWB and Pywikipediabot) Motivation Measure which tool produces a revision and whether those revisions include a particular template that is part of an A/B test We're likely to see an increase of contributions through third-party apps using the API in the future. They can be third-party editors, offline editors, or custom mass upload tools for various uses (generic, photogapher-specific, GLAM-specific, etc.) In both cases, it's currently difficult to track the volume of contributions made with each tool. Tracking would allow us to identify promising tools that get traction, and also track misbehaving / buggy / expensive tools. Required data Type: "3rd party tool" Name: (e.g. Huggle, AWB) Version: Optional data Campaign:  (e.g. Wikignome-1) Subcampaign:  (e.g. "a" or "b") Refs Currently found via parsing revision comments E.g. of edit summary: https://pt.wikipedia.org/?diff=prev&oldid=28916224&uselang=en subcampaign (or whatever we want to call it) names are found only by parsing revision text, it would be extremely helpful to have these tagged also to avoid this for future experiments involving these tools In some cases, upload tracking is done through categories (e.g. https://commons.wikimedia.org/wiki/Category:Uploaded_with_Commonist or https://commons.wikimedia.org/wiki/Category:Uploaded_with_UploadWizard ), but in most cases (esp. editing), there's no tracking at all.

User script (e.g. WikiGnome/Twinkle) Motivation Measure how many edits are made (and by whom) using a gadget. Identify experimental conditions. Similar to power tools above. Required data Type: "User script" Name: (e.g. Wikignome) Script:  (e.g. wiki/User:EpochFail/wikignome.js) Version: Optional data Campaign:  (e.g. Wikignome-1) Subcampaign:  (e.g. "a" or "b") Refs Currently identified by notes at the end of the edit summary E.g.: https://pt.wikipedia.org/?diff=prev&oldid=28884558&uselang=en Experimental group conditions are identified via logging a based jsonp request. (Cross-server scripting... ew...)

Mobile Motivation Measure edits from mobile devices to compare against edits by other means In particular to measure durability of edits from mobile sources, but also numbers of types of edits (for example, photo upload vs. text correction) Required data User agent to identify browser/app and ideally device Type of contribution or edit, e.g., photo upload Application or context of edit Locale, if available Time of edit is also potentially useful Refs Generally dependent on main site projects, such as Article Feedback

Translate Motivation Tagging fuzzy translations Tagging which revision of the source text the translation was done against Required data Tagname, revision tagname, revision, pageid (for finding the latest tagged revision for given page), payload (the revision id of the source text) Refs https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Translate/sql/revtag.sql?revision=101801&view=markup (this was originally intented to end up in core, but I gave up on that idea few months ago, but now it seems to have surface again!)

Notes: We could merge the spec for 3rd party tools and user-scripts since they both interact with Wikipedia using the API. The only difference with user-scripts is that they user the browser itself as the external tool rather than building one from the ground up (e.g. Huggle). In the same way, I don't see why huggle couldn't be implemented as a user-script. --Aaron I've made a merged category for MediaWiki "features" since there are a lot that need tracking. --Aaron Could we wrap "mobile" into 3rd party tools? It seems like there should be a difference between Huggle and the Wikipedia mobile app, but they interact in similar ways and will have similar metadata. Maybe we should have a seperate category for "power tools" and "browsers". --Aaron

Secondary use cases
Third party generated edits (OAuth) Diagnostics Mass-revert

Revtagging solutions
(overview of revtagging solutions, from clicktracking to the mw logging table to an ad-hoc key value store)