Talk:Article feedback/Public Policy Pilot/Workgroup

__NEWSECTIONLINK__  This page is a place for you to tell the Wikimedia Tech team what issues you encounter when using the Article feedback experimental tool during its trial period on Public Policy articles. See also the Frequently asked questions.

Please help us and leave a comment below and be done with it, or join the workgroup if you'd like to be further involved down the road.

We welcome your ideas, but please focus on the issues rather than on possible solutions.

→ Add your story


 * Archives:
 * September 24, 2010 – October 11, 2010

On Stale and Expired Ratings
Hi guys!

(Apologies in advance if this is unclear. It's a difficult topic, and I recognize that my ability to communicate to those who are non-native English speakers may be lacking. I will gladly attempt to be clearer if need arises.)

I'd like to start a conversation about a couple specific issues involving reader feedback. I have my own ideas about how this should work (and I'm gonna tell you them) but I want to hear everyone else's ideas as well. This concerns the idea of "stale" and "expired" ratings (the tool does not currently handle "expired" ratings). There are two vectors to this conversation, and while they can be addressed independently they must be understood holistically:


 * 1) The user experience of Fresh, Stale, and Expired ratings, and
 * 2) The mechanism of calculation of those three states.

A bit of background that may be re-treading some ground: Wikipedia articles are moving targets. Since they are dynamic, a user can only ever rate an article according to its state when it was rated. The article's quality could easily be improved (hopefully) or worsened in a short timeframe after the ratings were applied. This means that, in reality, a user's ratings are truly only valid for a specific version of the article.

Obviously, this creates a large gap between the user's expectation of the tool's behavior and what I am going to call "rating accuracy." As the article moves forward in time (gets edited), the accuracy of the user's rating decreases. However, the user will (forever) expect that the ratings they applied to the article will remain - that we will remember that they rated the article.

To combat this, we introduced the idea of stale ratings. A user's ratings become "stale" when the article has been modified sufficiently that we believe that the ratings are reflective of an invalid or old version of the article. When a user's ratings are "stale", they are prompted to re-evaluate the article on its current merits.

The aggregate rating for an article, by necessity, includes "stale" ratings (if we did not include stale ratings in our calculations, the article would have values of "0" in all categories whenever it gets edited, and the function [and value] of the tool would be entirely negated). To this end, we have chosen to utilize a formation of a "moving average" (though that idiom is imprecise).

To me, a "moving average" implies that the aggregate is calculated purely on a time-based scale. Since many (most) articles have infrequent edits, basing the "freshness" of a rating on its age is inaccurate. Instead, we must base our averages on article edit activity, which is time-irrelevant.

For phase 2, we intend to introduce the concept of expired ratings. These are ratings that were applied to the article at one point but, over time, have ceased to be relevant (the article has been modified too much). These ratings are still stored, and the user will see that the ratings exist, but they will not be included in the aggregate calculation. An easy-to-understand example as to why this is important is described thusly:


 * 1) On August 1st, 2011, Joe reads the article on Pokemon, and gives it a 1 in "accuracy." He leaves the page.
 * 2) Between August 1, 1999 and August 10, 2011, the article undergoes some revision. Joe's ratings are now "stale", but still used in the calculation of the aggregate.
 * 3) On August 11, 2011, the article is completely rewritten by an expert. Joe's ratings are now effectively useless: he rated an article that, for all intents and purposes, never existed.

Clearly, Joe's ratings should be thrown out at this time. They are no longer relevant. However, Joe, as a user, wants to know that he did, at one point in time, rate the article. To this end, we are required to surface to Joe that yes, he did rate the article, but no, his ratings are no longer considered accurate.

So, we have four states for a user's ratings:


 * 1) Never Rated.  The user has never rated the article.
 * 2) Fresh.  The user's ratings have been placed on the current (or close to current) revision of the article.  These ratings are used in the aggregate calculation.
 * 3) Stale. The user's ratings were placed on a revision of the article which approximates the current revision, but should be revisited.   These ratings are still used in the aggregate calculation.
 * 4) Expired. The user's ratings were placed on a revision of the article which is no longer represents the article in its current form. These ratings are not used in the aggregate calculation.

As with all things Wikipedia, there is no easy, clear-cut answer to what the term "stale" should mean. Currently, with revision 1 of the Article Feedback Tool, a user's rating is considered "stale" when there have been 5 or more edits to the article in question. Up until then, the ratings are fresh. This formula was chosen for expedience: it was easy to code and understand, and we believe that it solved the immediate need (which was to prevent moving averages from disappearing on articles that do not get traffic). I do not feel this is the correct formula; we can do better.

The number of edits to an article is not a valuable metric to determine article change. If I were to add five categories to an article in five edits, the article's content would not change but our mechanism for stale detection would fire.

So that brings us to the following questions (and I'll give my stabs at the answers below):


 * 1) How should we indicate to a user that their ratings have expired, and
 * 2) What should our formula be for determining Fresh, Stale, and Expired?

The answer to the first, in my opinion, is very similar to the way that we express stale ratings. A different color with specific language. This is actually, in my opinion, the easiest of the problems.

The answer to the second, however, is the stickiest wicket. My proposal for them is this:


 * A user's ratings are considered "stale" when:
 * The article has achieved 10 revisions or
 * The article has been modified +/- 20% of its size or
 * The article has achieved 5 revisions and +/- 15% of its size


 * A user's ratings are considered "expired" when:
 * The article has achieved 30 revisions or
 * The article has been modified +/- 35% of its size or
 * The article has achieved 15 revisions and +/- 20% of its size

Those seem like big numbers (and I think they are) but we cannot think about articles in terms of the largest and most trafficked pages. We have to look at the average article and its type of traffic. These numbers (and the formula used) can be modified based on the traffic and/or size of a given page but I think we should start with a baseline and work from there.

Please feel free to tear my ideas apart. I'd love to hear dissenting opinions.

(There are also other factors to take into consideration about this, such as "when do averages get calculated", but these are performance questions, and can be dealt with when they come into scope.)

Thoughts appreciated. -Jorm (WMF) 06:57, 9 October 2010 (UTC)


 * Hi Brandon! I think that an all-inclusive historical data plus the current data will be a better solution. Only two states, Fresh and Expired (plus Never Rated, of course), that are more easily manageable. About the second question, I was thinking of something similar to that. Time and number of edits are not good parameters. Some articles are edited rarely, while others are edited continuously. The first thing that comes to my mind is to use a scale ratio [X / total number of edits]. If the result is above a predefined value, then the state for a user's ratings is changed. Some FAs (Featured Articles) (and some GAs, As, and Bs too) are edited continuously, but it is a sequence of an edit and a revert, an edit and a revert, and so on. The end result is that the article has changed very-very little in its form or not at all. Is there the possibility of adding the class value of a page to the other parameters? I think it could help. The articles' size would work fine (vandalism and blanking of a page excluded). But, please read this when you have the time. I wrote it yesterday and there are other suggestions and thoughts. I hope it will help. Isn't better to move this discussion to the talk page? Sorry, but I'm new of MediaWiki. Cheers. –p joe f (talk • contribs) 14:52, 9 October 2010 (UTC)
 * I think that there is a clear need for an "expired" state. Without one, the aggregate ratings will trend towards the center, whether they deserve to be there or not.  Consider the following scenario:
 * An article is found to be in extremely poor condition and starts with several low ratings. For ease of the scenario, we'll only address 1 metric - say, Readability. It gets 10 ratings:  one 3, four values of 2, and five values of 1, resulting in an aggregate rating of 1.4.
 * Several editors see this and bring the article into a project. They begin improving it.  During this time, it gets a ten more ratings:  two values of 2, six values of 3, and two values of 4 (so now it has 5 x 1, 6 x 2, 7 x 3, 2 x 4).  Now the article's aggregate value is 2.3
 * After a while, the article is dramatically improved - perhaps even reaching Featured Article status. It gets an additional 10 ratings:  four values of 4, and six values of 5 (so now it has 5 x 1, 6 x 2, 7 x 3, 6 x 4, 6 x 5).  Now the article's aggregate value is nearly 3.0 - and it's unlikely that it will get better because the early low values will always pull down the average.  It will never be able to reach a perfect 5, even if everyone rates it that way.
 * Obviously, the opposite can occur: well written articles can decline in quality.  One of the values of this system could be that it serves as a warning system for editors, and retaining clearly outdated ratings in the formula works against that utility.
 * I think that retaining ratings that are clearly outdated greatly reduces the accuracy of the tool.-Jorm (WMF) 19:33, 10 October 2010 (UTC)

Influence on account creation & editing
I've been trying to wrap my head around some of the metrics that Howie was interested in, in particular: In both cases, I'm still looking for an appropriate baseline, i.e. what to compare it to. For 1., the ideal baseline I can think of would be the number of readers who created an account in the after they read the page (but didn't rate it). For 2., it would be the number of readers who edited in the after they read the page (but didn't rate it). For both cases, it seems to me we're pretty much stuck, because I don't think we have access to that kind of information. Am I mistaken? guillom 22:40, 19 October 2010 (UTC)
 * 1) percentage of raters who create an account within a after the rating
 * 2) percentage of raters who edit an article within a after the rating


 * Having a baseline for these comparisons would be great, but you're right -- we don't have access to that kind of information (of if we do, it would be news to me). Part of this exercise is to use these numbers to establish baseline for future work.  In other words, given our current implementation, we can expect x% of users to edit an article after rating.  Then if we change the interface (e.g., after rating submission, include a message asking users to edit the article), we can determine how much the change affects the likelihood of users editing after they rate an article.


 * The user behavior question that I'd like to shed light on is "how effectively do ratings serve as an on-ramp for other forms of involvement." For example,  a percentage of users that edit an article after rating could suggest that ratings are a good on-ramp for editing.  I understand we don't have a baseline for comparison -- some users may have edited anyway, even if the ratings weren't there.  But I do think having this data will help push the discussion forward. Howief 18:24, 20 October 2010 (UTC)

I see this is a student project, but
I see this is a student project, so I will try to be really, really nice. Who started this? Who got consensus at VP or a centralized discussion or similar to splash self-references all over encyclopedia articles.. would the answer be "no one"? Ling.Nut 07:21, 21 October 2010 (UTC)


 * I'm not sure what you're talking about. Are you referring to the Article feedback tool, or the Public Policy Initiative? You mention "self-references" so I'm inclined to think it's the latter. If I'm right, I believe outreach:Public Policy Initiative should answer most of your questions, or at least direct you to people who can answer them. guillom 16:59, 21 October 2010 (UTC)


 * Ling.Nut, this isn't a student project. The Public Policy Initiative (more precisely, the articles in WikiProject United States Public Policy) are essentially serving as the testing ground for software to get article quality ratings from readers.  Later iterations will have other features, aimed at getting readers even more involved, e.g., by making suggestions for article improvement.  With this pilot, we also have the chance to compare, in similar terms, how readers evaluate articles versus how Wikipedians and non-Wikipedian topic experts do, through the research effort described at w:WP:USPP/ASSESS.
 * As to your question of who started this, the short answer is, the Wikimedia Foundation. It's been driven mainly by the goals articulated in the strategic planning process, to find better ways to evaluate article quality and to involve readers.  So this pilot is a step in that direction.  But it's more of a conversation-starter and starting point for further evolution than a finished feature.--Sageross 17:56, 21 October 2010 (UTC)


 * I'm interpreting "self-reference" as something that references the fact that an article is an article, somewhat breaking the 4th wall. I think that would fall under the category of w:WP:NOT. Devourer09 03:27, 22 October 2010 (UTC)
 * Whereas all the maintenance and warning banners at the top of articles don't reference the fact that an article is an (unsourced, biased, what-have-you) article? guillom 04:46, 22 October 2010 (UTC)
 * Pokemon defence? 81.178.144.143
 * Could you try to be even more cryptic? guillom 05:23, 23 October 2010 (UTC)

e.g., by making suggestions for article improvement. - we have a thing called a talk page for that. maybe we just need to change the name from "discussion" to "Suggest improvements"? 81.178.144.143

Systemic bias
US public policy? Does this have a universal audience? Why not pick 2000 articles at random. 81.178.144.143
 * Public Policy Initiative, that's why. Nifboy 08:00, 25 October 2010 (UTC)

Zero
It's possible to add a vote of zero on up to three criteria, by simply not selecting them. However, it is not possible to un-check the first option to set a zero-vote. Nor does it seem possible to vote zero stars for all 4 thingys. Chzz 16:54, 23 October 2010 (UTC)

What to do if the category is added to another article?
An editor added this category to the article Philippines which is not a US public policy article that's part of the pilot. Should it be removed or should it be left alone? Five people have already used it at the time of this post. Lambanog 17:39, 23 October 2010 (UTC)


 * I removed it. For now, the scope of the pilot is controlled by the Public Policy Initiative. guillom 00:57, 24 October 2010 (UTC)

Template misbehaving ?
This spotted on the English-language Wikipedia using the "classic" skin. Shown here in Firefox, but looks more or less the same in IE8. Most likely a problem with the template(s). - Topbanana 19:24, 3 November 2010 (UTC)



Additional Pages
In order to get a better understanding of how ratings reflect substantial changes in articles, we'd like to put the article feedback on approximately 50 pages outside the public policy initiative. Please take a look here and provide comments. Howief 00:52, 10 November 2010 (UTC)
 * Cool. Until now, I've been removing the tool from articles outside public policy.  Should I stop that, and allow it to spread a bit into other articles just based on people seeing the code and wanting it for articles that interest them?  There were about 30 articles that had it that I removed it from, which had been added in the last 3 weeks or so.
 * On a related note, will it be easy to separate out the public policy ratings from the others, so as to retain a dataset for analyzing the public policy initiative?--Sross (Public Policy) 15:04, 10 November 2010 (UTC)