Article feedback/Public Policy Pilot/Workgroup

The goal of this work group is to assess the Article feedback experimental tool during its trial period on Public Policy articles. See the blog post and the Frequently asked questions.

Sign-up
Please sign up below if you'd like to join the workgroup. You don't need to be a developer, we're mostly looking for users. Thanks!


 * « Saper // @talk »
 * Fetchcomms (Not a developer, but working with the Public Policy Initiative.)
 * Eloquence
 * guillom
 * DGG 23:30, 29 September 2010 (UTC) (also not a developer, but interested in improving article quality)
 * ARoth (Public Policy Initiative) 00:11, 30 September 2010 (UTC)
 * Howief 00:08, 1 October 2010 (UTC)
 * Jorm (WMF) 01:44, 1 October 2010 (UTC)
 * Alolitas 22:32, 6 October 2010 (UTC)
 * pjoef ~ (I'm a programmer, but I never wrote or debugged programs for Wikipedia or other Wikimedia projects. (I do not have the rights nor the time). I'm also participating in the Public Policy Initiative project. If you need help, or if you have any questions, or if you want to bring something to my attention, please contact me at my talk page on the English Wikipedia with the interwiki link to the page where you want to receive my attention. I'm always busy, but I hope I can help.) –p joe f (talk • contribs) 08:49, 8 October 2010 (UTC)

Draft action plan

 * Write down questions we want to answer
 * Collect pointers & scattered pieces of information
 * talk page
 * Responses to the announcement on foundation-l
 * Survey data
 * summarize feedback

Questions
"Users" means people who use the tool, whether they're readers or editors.


 * What motivates users to use the tool?
 * Possible sources of answers:
 * Data from the survey


 * How useful are the ratings for readers and editors?
 * Possible sources of answers:
 * Data from the survey
 * Voluntary general qualitative feedback
 * Voluntary feedback qualitative provided by editors of public policy articles
 * Additional qualitative research such as interviews / UX study (deferred for now)


 * How understandable is the Likert scale used for the feature?
 * Voluntary general qualitative feedback
 * Data from the Phase 2 survey (deferred for now)


 * Does the feature have an impact on account creation and editing?
 * Possible sources of answers:
 * Raw data from the DB can provide the following metrics:
 * percentage of raters who create an account within a month after the rating (baseline: ?)
 * percentage of people who edit an article within a month after the rating (baseline: ?)
 * percentage of people who edit an article within a month of an account creation within a month of the rating (baseline: percentage of people who edit an article within a month of an account creation)


 * How does this reader rating compares to other rating systems (existing Wikipedia article assessment system, expert opinion, Public Policy Initiative system)
 * Possible sources of answers:
 * outreachwiki:Public Policy Initiative Evaluation and Research


 * More generally, how to improve the tool so it better helps users achieve their goals?
 * Data from the survey
 * Voluntary general qualitative feedback

Bugs

 * Possible errors in calculations

Feature requests

 * Compatibility with Lupin's pop-ups
 * Explicit reference to the Public Policy Initiative to provide context
 * Charts of the evolution of ratings over time
 * Graceful degradation for users without JavaScript
 * More visible than the bottom of articles (maybe in the sidebar) but not too intrusive (smaller?)
 * More intuitive UI, current one may scare users away
 * Ability to query articles based on their ratings
 * Ability to leave comments
 * Change colors to avoid Communist stars
 * Use tags like TED's

Project requests

 * Run pilots on other wikis
 * Use better metrics

On Stale and Expired Ratings
Hi guys!

(Apologies in advance if this is unclear. It's a difficult topic, and I recognize that my ability to communicate to those who are non-native English speakers may be lacking. I will gladly attempt to be clearer if need arises.)

I'd like to start a conversation about a couple specific issues involving reader feedback. I have my own ideas about how this should work (and I'm gonna tell you them) but I want to hear everyone else's ideas as well. This concerns the idea of "stale" and "expired" ratings (the tool does not currently handle "expired" ratings). There are two vectors to this conversation, and while they can be addressed independently they must be understood holistically:


 * 1) The user experience of Fresh, Stale, and Expired ratings, and
 * 2) The mechanism of calculation of those three states.

A bit of background that may be re-treading some ground: Wikipedia articles are moving targets. Since they are dynamic, a user can only ever rate an article according to its state when it was rated. The article's quality could easily be improved (hopefully) or worsened in a short timeframe after the ratings were applied. This means that, in reality, a user's ratings are truly only valid for a specific version of the article.

Obviously, this creates a large gap between the user's expectation of the tool's behavior and what I am going to call "rating accuracy." As the article moves forward in time (gets edited), the accuracy of the user's rating decreases. However, the user will (forever) expect that the ratings they applied to the article will remain - that we will remember that they rated the article.

To combat this, we introduced the idea of stale ratings. A user's ratings become "stale" when the article has been modified sufficiently that we believe that the ratings are reflective of an invalid or old version of the article. When a user's ratings are "stale", they are prompted to re-evaluate the article on its current merits.

The aggregate rating for an article, by necessity, includes "stale" ratings (if we did not include stale ratings in our calculations, the article would have values of "0" in all categories whenever it gets edited, and the function [and value] of the tool would be entirely negated). To this end, we have chosen to utilize a formation of a "moving average" (though that idiom is imprecise).

To me, a "moving average" implies that the aggregate is calculated purely on a time-based scale. Since many (most) articles have infrequent edits, basing the "freshness" of a rating on its age is inaccurate. Instead, we must base our averages on article edit activity, which is time-irrelevant.

For phase 2, we intend to introduce the concept of expired ratings. These are ratings that were applied to the article at one point but, over time, have ceased to be relevant (the article has been modified too much). These ratings are still stored, and the user will see that the ratings exist, but they will not be included in the aggregate calculation. An easy-to-understand example as to why this is important is described thusly:


 * 1) On August 1st, 2011, Joe reads the article on Pokemon, and gives it a 1 in "accuracy." He leaves the page.
 * 2) Between August 1, 1999 and August 10, 2011, the article undergoes some revision. Joe's ratings are now "stale", but still used in the calculation of the aggregate.
 * 3) On August 11, 2011, the article is completely rewritten by an expert. Joe's ratings are now effectively useless: he rated an article that, for all intents and purposes, never existed.

Clearly, Joe's ratings should be thrown out at this time. They are no longer relevant. However, Joe, as a user, wants to know that he did, at one point in time, rate the article. To this end, we are required to surface to Joe that yes, he did rate the article, but no, his ratings are no longer considered accurate.

So, we have four states for a user's ratings:


 * 1) Never Rated.  The user has never rated the article.
 * 2) Fresh.  The user's ratings have been placed on the current (or close to current) revision of the article.  These ratings are used in the aggregate calculation.
 * 3) Stale. The user's ratings were placed on a revision of the article which approximates the current revision, but should be revisited.   These ratings are still used in the aggregate calculation.
 * 4) Expired. The user's ratings were placed on a revision of the article which is no longer represents the article in its current form. These ratings are not used in the aggregate calculation.

As with all things Wikipedia, there is no easy, clear-cut answer to what the term "stale" should mean. Currently, with revision 1 of the Article Feedback Tool, a user's rating is considered "stale" when there have been 5 or more edits to the article in question. Up until then, the ratings are fresh. This formula was chosen for expedience: it was easy to code and understand, and we believe that it solved the immediate need (which was to prevent moving averages from disappearing on articles that do not get traffic). I do not feel this is the correct formula; we can do better.

The number of edits to an article is not a valuable metric to determine article change. If I were to add five categories to an article in five edits, the article's content would not change but our mechanism for stale detection would fire.

So that brings us to the following questions (and I'll give my stabs at the answers below):


 * 1) How should we indicate to a user that their ratings have expired, and
 * 2) What should our formula be for determining Fresh, Stale, and Expired?

The answer to the first, in my opinion, is very similar to the way that we express stale ratings. A different color with specific language. This is actually, in my opinion, the easiest of the problems.

The answer to the second, however, is the stickiest wicket. My proposal for them is this:


 * A user's ratings are considered "stale" when:
 * The article has achieved 10 revisions or
 * The article has been modified +/- 20% of its size or
 * The article has achieved 5 revisions and +/- 15% of its size


 * A user's ratings are considered "expired" when:
 * The article has achieved 30 revisions or
 * The article has been modified +/- 35% of its size or
 * The article has achieved 15 revisions and +/- 20% of its size

Those seem like big numbers (and I think they are) but we cannot think about articles in terms of the largest and most trafficked pages. We have to look at the average article and its type of traffic. These numbers (and the formula used) can be modified based on the traffic and/or size of a given page but I think we should start with a baseline and work from there.

Please feel free to tear my ideas apart. I'd love to hear dissenting opinions.

(There are also other factors to take into consideration about this, such as "when do averages get calculated", but these are performance questions, and can be dealt with when they come into scope.)

Thoughts appreciated. -Jorm (WMF) 06:57, 9 October 2010 (UTC)


 * Hi Brandon! I think that an all-inclusive historical data plus the current data will be a better solution. Only two states, Fresh and Expired (plus Never Rated, of course), that are more easily manageable. About the second question, I was thinking of something similar to that. Time and number of edits are not good parameters. Some articles are edited rarely, while others are edited continuously. The first thing that comes to my mind is to use a scale ratio [X / total number of edits]. If the result is above a predefined value, then the state for a user's ratings is changed. Some FAs (Featured Articles) (and some GAs, As, and Bs too) are edited continuously, but it is a sequence of an edit and a revert, an edit and a revert, and so on. The end result is that the article has changed very-very little in its form or not at all. Is there the possibility of adding the class value of a page to the other parameters? I think it could help. The articles' size would work fine (vandalism and blanking of a page excluded). But, please read this when you have the time. I wrote it yesterday and there are other suggestions and thoughts. I hope it will help. Isn't better to move this discussion to the talk page? Sorry, but I'm new of MediaWiki. Cheers. –p joe f (talk • contribs) 14:52, 9 October 2010 (UTC)
 * I think that there is a clear need for an "expired" state. Without one, the aggregate ratings will trend towards the center, whether they deserve to be there or not.  Consider the following scenario:
 * An article is found to be in extremely poor condition and starts with several low ratings. For ease of the scenario, we'll only address 1 metric - say, Readability. It gets 10 ratings:  one 3, four values of 2, and five values of 1, resulting in an aggregate rating of 1.4.
 * Several editors see this and bring the article into a project. They begin improving it.  During this time, it gets a ten more ratings:  two values of 2, six values of 3, and two values of 4 (so now it has 5 x 1, 6 x 2, 7 x 3, 2 x 4).  Now the article's aggregate value is 2.3
 * After a while, the article is dramatically improved - perhaps even reaching Featured Article status. It gets an additional 10 ratings:  four values of 4, and six values of 5 (so now it has 5 x 1, 6 x 2, 7 x 3, 6 x 4, 6 x 5).  Now the article's aggregate value is nearly 3.0 - and it's unlikely that it will get better because the early low values will always pull down the average.  It will never be able to reach a perfect 5, even if everyone rates it that way.
 * Obviously, the opposite can occur: well written articles can decline in quality.  One of the values of this system could be that it serves as a warning system for editors, and retaining clearly outdated ratings in the formula works against that utility.
 * I think that retaining ratings that are clearly outdated greatly reduces the accuracy of the tool.-Jorm (WMF) 19:33, 10 October 2010 (UTC)