Talk:Article feedback/Public Policy Pilot/Workgroup

__NEWSECTIONLINK__  This page is a place for you to tell the Wikimedia Tech team what issues you encounter when using the Article feedback experimental tool during its trial period on Public Policy articles. See also the Frequently asked questions.

Please help us and leave a comment below and be done with it, or join the workgroup if you'd like to be further involved down the road.

We welcome your ideas, but please focus on the issues rather than on possible solutions.

→ Add your story


 * Archives:
 * September 24, 2010 – October 11, 2010

On Stale and Expired Ratings
Hi guys!

(Apologies in advance if this is unclear. It's a difficult topic, and I recognize that my ability to communicate to those who are non-native English speakers may be lacking. I will gladly attempt to be clearer if need arises.)

I'd like to start a conversation about a couple specific issues involving reader feedback. I have my own ideas about how this should work (and I'm gonna tell you them) but I want to hear everyone else's ideas as well. This concerns the idea of "stale" and "expired" ratings (the tool does not currently handle "expired" ratings). There are two vectors to this conversation, and while they can be addressed independently they must be understood holistically:


 * 1) The user experience of Fresh, Stale, and Expired ratings, and
 * 2) The mechanism of calculation of those three states.

A bit of background that may be re-treading some ground: Wikipedia articles are moving targets. Since they are dynamic, a user can only ever rate an article according to its state when it was rated. The article's quality could easily be improved (hopefully) or worsened in a short timeframe after the ratings were applied. This means that, in reality, a user's ratings are truly only valid for a specific version of the article.

Obviously, this creates a large gap between the user's expectation of the tool's behavior and what I am going to call "rating accuracy." As the article moves forward in time (gets edited), the accuracy of the user's rating decreases. However, the user will (forever) expect that the ratings they applied to the article will remain - that we will remember that they rated the article.

To combat this, we introduced the idea of stale ratings. A user's ratings become "stale" when the article has been modified sufficiently that we believe that the ratings are reflective of an invalid or old version of the article. When a user's ratings are "stale", they are prompted to re-evaluate the article on its current merits.

The aggregate rating for an article, by necessity, includes "stale" ratings (if we did not include stale ratings in our calculations, the article would have values of "0" in all categories whenever it gets edited, and the function [and value] of the tool would be entirely negated). To this end, we have chosen to utilize a formation of a "moving average" (though that idiom is imprecise).

To me, a "moving average" implies that the aggregate is calculated purely on a time-based scale. Since many (most) articles have infrequent edits, basing the "freshness" of a rating on its age is inaccurate. Instead, we must base our averages on article edit activity, which is time-irrelevant.

For phase 2, we intend to introduce the concept of expired ratings. These are ratings that were applied to the article at one point but, over time, have ceased to be relevant (the article has been modified too much). These ratings are still stored, and the user will see that the ratings exist, but they will not be included in the aggregate calculation. An easy-to-understand example as to why this is important is described thusly:


 * 1) On August 1st, 2011, Joe reads the article on Pokemon, and gives it a 1 in "accuracy." He leaves the page.
 * 2) Between August 1, 1999 and August 10, 2011, the article undergoes some revision. Joe's ratings are now "stale", but still used in the calculation of the aggregate.
 * 3) On August 11, 2011, the article is completely rewritten by an expert. Joe's ratings are now effectively useless: he rated an article that, for all intents and purposes, never existed.

Clearly, Joe's ratings should be thrown out at this time. They are no longer relevant. However, Joe, as a user, wants to know that he did, at one point in time, rate the article. To this end, we are required to surface to Joe that yes, he did rate the article, but no, his ratings are no longer considered accurate.

So, we have four states for a user's ratings:


 * 1) Never Rated.  The user has never rated the article.
 * 2) Fresh.  The user's ratings have been placed on the current (or close to current) revision of the article.  These ratings are used in the aggregate calculation.
 * 3) Stale. The user's ratings were placed on a revision of the article which approximates the current revision, but should be revisited.   These ratings are still used in the aggregate calculation.
 * 4) Expired. The user's ratings were placed on a revision of the article which is no longer represents the article in its current form. These ratings are not used in the aggregate calculation.

As with all things Wikipedia, there is no easy, clear-cut answer to what the term "stale" should mean. Currently, with revision 1 of the Article Feedback Tool, a user's rating is considered "stale" when there have been 5 or more edits to the article in question. Up until then, the ratings are fresh. This formula was chosen for expedience: it was easy to code and understand, and we believe that it solved the immediate need (which was to prevent moving averages from disappearing on articles that do not get traffic). I do not feel this is the correct formula; we can do better.

The number of edits to an article is not a valuable metric to determine article change. If I were to add five categories to an article in five edits, the article's content would not change but our mechanism for stale detection would fire.

So that brings us to the following questions (and I'll give my stabs at the answers below):


 * 1) How should we indicate to a user that their ratings have expired, and
 * 2) What should our formula be for determining Fresh, Stale, and Expired?

The answer to the first, in my opinion, is very similar to the way that we express stale ratings. A different color with specific language. This is actually, in my opinion, the easiest of the problems.

The answer to the second, however, is the stickiest wicket. My proposal for them is this:


 * A user's ratings are considered "stale" when:
 * The article has achieved 10 revisions or
 * The article has been modified +/- 20% of its size or
 * The article has achieved 5 revisions and +/- 15% of its size


 * A user's ratings are considered "expired" when:
 * The article has achieved 30 revisions or
 * The article has been modified +/- 35% of its size or
 * The article has achieved 15 revisions and +/- 20% of its size

Those seem like big numbers (and I think they are) but we cannot think about articles in terms of the largest and most trafficked pages. We have to look at the average article and its type of traffic. These numbers (and the formula used) can be modified based on the traffic and/or size of a given page but I think we should start with a baseline and work from there.

Please feel free to tear my ideas apart. I'd love to hear dissenting opinions.

(There are also other factors to take into consideration about this, such as "when do averages get calculated", but these are performance questions, and can be dealt with when they come into scope.)

Thoughts appreciated. -Jorm (WMF) 06:57, 9 October 2010 (UTC)


 * Hi Brandon! I think that an all-inclusive historical data plus the current data will be a better solution. Only two states, Fresh and Expired (plus Never Rated, of course), that are more easily manageable. About the second question, I was thinking of something similar to that. Time and number of edits are not good parameters. Some articles are edited rarely, while others are edited continuously. The first thing that comes to my mind is to use a scale ratio [X / total number of edits]. If the result is above a predefined value, then the state for a user's ratings is changed. Some FAs (Featured Articles) (and some GAs, As, and Bs too) are edited continuously, but it is a sequence of an edit and a revert, an edit and a revert, and so on. The end result is that the article has changed very-very little in its form or not at all. Is there the possibility of adding the class value of a page to the other parameters? I think it could help. The articles' size would work fine (vandalism and blanking of a page excluded). But, please read this when you have the time. I wrote it yesterday and there are other suggestions and thoughts. I hope it will help. Isn't better to move this discussion to the talk page? Sorry, but I'm new of MediaWiki. Cheers. –p joe f (talk • contribs) 14:52, 9 October 2010 (UTC)
 * I think that there is a clear need for an "expired" state. Without one, the aggregate ratings will trend towards the center, whether they deserve to be there or not.  Consider the following scenario:
 * An article is found to be in extremely poor condition and starts with several low ratings. For ease of the scenario, we'll only address 1 metric - say, Readability. It gets 10 ratings:  one 3, four values of 2, and five values of 1, resulting in an aggregate rating of 1.4.
 * Several editors see this and bring the article into a project. They begin improving it.  During this time, it gets a ten more ratings:  two values of 2, six values of 3, and two values of 4 (so now it has 5 x 1, 6 x 2, 7 x 3, 2 x 4).  Now the article's aggregate value is 2.3
 * After a while, the article is dramatically improved - perhaps even reaching Featured Article status. It gets an additional 10 ratings:  four values of 4, and six values of 5 (so now it has 5 x 1, 6 x 2, 7 x 3, 6 x 4, 6 x 5).  Now the article's aggregate value is nearly 3.0 - and it's unlikely that it will get better because the early low values will always pull down the average.  It will never be able to reach a perfect 5, even if everyone rates it that way.
 * Obviously, the opposite can occur: well written articles can decline in quality.  One of the values of this system could be that it serves as a warning system for editors, and retaining clearly outdated ratings in the formula works against that utility.
 * I think that retaining ratings that are clearly outdated greatly reduces the accuracy of the tool.-Jorm (WMF) 19:33, 10 October 2010 (UTC)

Influence on account creation & editing
I've been trying to wrap my head around some of the metrics that Howie was interested in, in particular: In both cases, I'm still looking for an appropriate baseline, i.e. what to compare it to. For 1., the ideal baseline I can think of would be the number of readers who created an account in the after they read the page (but didn't rate it). For 2., it would be the number of readers who edited in the after they read the page (but didn't rate it). For both cases, it seems to me we're pretty much stuck, because I don't think we have access to that kind of information. Am I mistaken? guillom 22:40, 19 October 2010 (UTC)
 * 1) percentage of raters who create an account within a after the rating
 * 2) percentage of raters who edit an article within a after the rating


 * Having a baseline for these comparisons would be great, but you're right -- we don't have access to that kind of information (of if we do, it would be news to me). Part of this exercise is to use these numbers to establish baseline for future work.  In other words, given our current implementation, we can expect x% of users to edit an article after rating.  Then if we change the interface (e.g., after rating submission, include a message asking users to edit the article), we can determine how much the change affects the likelihood of users editing after they rate an article.


 * The user behavior question that I'd like to shed light on is "how effectively do ratings serve as an on-ramp for other forms of involvement." For example,  a percentage of users that edit an article after rating could suggest that ratings are a good on-ramp for editing.  I understand we don't have a baseline for comparison -- some users may have edited anyway, even if the ratings weren't there.  But I do think having this data will help push the discussion forward. Howief 18:24, 20 October 2010 (UTC)