Talk:Article feedback/Public Policy Pilot/Workgroup

__NEWSECTIONLINK__  This page is a place for you to tell the Wikimedia Tech team what issues you encounter when using the Article feedback experimental tool during its trial period on Public Policy articles. See also the Frequently asked questions.

Please help us andleave a comment below and be done with it, or join the workgroup if you'd like to be further involved down the road.

We welcome your ideas, but please focus on the issues rather than on possible solutions.

[ → Add your story]

Other projects
What about testing the extension on smaller wikis? Some Wiktionaries use a JavaScript-based tool to gather feedback, I suppose they would be interested. --Nemo 06:55, 15 September 2010 (UTC)
 * Ok, now I'll add the link: wikt:en:Wiktionary:Feedback (see also interwiki); and don't forget strategy:Special:RatedPages (there are lots of comments on the wiki about it). I don't understand why you're developing a new feature with a pilot on (a small part of) Wikipedia while there are several other projects that eagerly need such a feature, and in fact are already using something similar (but much more crappy). --Nemo 07:01, 24 September 2010 (UTC)
 * Hey Federico!
 * I totally missed this given the flurry I've been trapped in over the past couple days and for that I apologize.
 * The answer is this: part of the reason we are doing this on such a small article subset is to actually ensure that the technology works and see immediate problems. While I don't see any moral or political reasons not to enable it in other places, the extension is slated for a series of rather rapid, iterative changes (hopefully improvements). So my advice is to wait a bit; I'm about to start on design for phase 2 (some of the feedback we've already gotten dovetails with what was expected, and we're going ahead and implementing it).
 * In the meantime, I'd love it if you joined the workgroup and gave some ideas. You're a smart guy and can see around corners a lot.
 * I know that's not the answer you were looking for but I hope that helps.--Jorm (WMF) 19:16, 24 September 2010 (UTC)

Assorted comments
This feature has a lot of potential, but the current implementation sucks.

A bit of background
First, we need to establish that rating articles/entries is not a new idea. The English Wiktionary, for example, has been doing this for years. You can look at wikt:User:Conrad.Irwin/feedback.js for the code that the English Wiktionary uses. A few key points about the English WIktionary's implementation:
 * because it's implemented in JavaScript, it only works for users who load and run JavaScript on this domain;
 * it only displays for anonymous users;
 * it displays in the sidebar;
 * it appears to only work in the (now antiquated) Monobook skin currently;
 * it uses a number of simple metrics for articles with a simple one-click interface; the options to choose from are: [this entry is] "good," "bad," "messy," "mistake in definition," "confusing," "could not find the word I want," "incomplete," "entry has inaccurate information," "definition is too complicated," and finally "if you have time, leave us a note."

Current ArticleAssessment implementation
The current implementation of ArticleAssessment has a few niceties:
 * it's implemented in PHP with a proper database backend;
 * it has a nice UI for rating an article (the stars are pretty).

But the main issues I see with it are:
 * it's enormous &mdash; the entire "view results" box shouldn't be shown at all until the user clicks something;
 * the metrics are terrible;
 * it's located at the bottom of lengthy articles, making it unlikely that anyone will see it; those who do see it will likely not want to participate because it looks complicated (as opposed to the one-click system that the English Wiktionary uses).

Room for improvement
My suggestions:
 * look at how a site like ted.com uses user feedback; the Wikipedias have hundreds of awesome articles that nobody knows about and they aren't sorted by anything useful currently; this tool could be adapted to create useful metrics, e.g., [this article is] informative, interesting, sloppy, boring, unintelligible, confusing (math articles, anyone?), biased
 * once you have ratings from users, you can generate all sorts of nifty tools; you can have the most interesting articles listed in a dynamic report; or you can have "select a random informative, well-sourced history article"; this is actually something that would be useful;
 * I understand and appreciate the desire to be unobtrusive, but the rating system needs to be more visible somehow; the sidebar is a good place to look at (esp. if you can reasonably collapse some of the interwiki links on long articles); it might also be possible to put an unobtrusive icon near the top of the page (the central focal point for nearly any article); mashable.com has been using a blue box at the top of articles&mdash;that's a bit much, I think;
 * further simplify the interface, but allow for more in-depth comments if the user wants to provide them.

Hope that helps, --MZMcBride 22:54, 24 September 2010 (UTC)

Response to MZMcBride's Comments
A couple or responses, so that the design rationale is better understood:

First, we decided specifically against allowing user comments with ratings. My opinion was (and I still hold it) that such comments will be either a) of little value or b) better as comments on the corresponding Discussion page. The options thus left us with two directions:


 * 1) They aren't stored anywhere except some random table. They would quickly become outdated or useless. They would require additional development to allow them to be visible, likely resulting in yet another tab ("View Rating Comments" or somesuch);
 * 2) We inject the comments as a new item on the Discussion page (either standard Talk or LiquidThreads). They are visible to be sure but since they are going to be left by users who do not normally engage in Discussion pages, any responses will be either ignored/unseen or become confusing to the user. (Users who understand Discussion pages will already know to leave comments there).

Further, comments in such a form are likely (at this stage) to be about the tool and not the article.

I agree that, from the viewpoint of a Wikitionary, that comments at rating would be valuable, but Wiktionary entries do not spawn the same types of discussions that Wikipedia entries have, and this tool is targeted at encyclopedic content.

Second, the placement of the ratings box. The placement of the box at the bottom of the article is not by chance; it is very specifically by design. Placing it above, within the article space (or even in the side bar) does not help to ensure that the article has actually been read. If the article is 7 screens long and the tool is located on the first screen (say, below the language links), then users will be encouraged to rate the article before they have read it completely.

I agree that the current placement is sub-optimal; I'd prefer it to arrive before the reference list. However, we decided for ease-of-impact to place it as low as possible.

We are not mashable, nor are we Netflix or even Yelp. They have entirely different motivations for their ratings tools (they boil down to generation of clicks, which generates ad revenue [with the exception of Netflix, whose rating system is interestingly outside of scope]).

The exact unobtrusiveness of the tool is specific as well, for a couple reasons:


 * 1) Community Acceptance. It was early on determined that a "loud" ratings box would be received negatively by the community. My initial design had the View Ratings box and the Rate this Page box completely decoupled, with the View box at the top of the article (which is where I expect it will eventually live, should the tool be accepted).  We decided that this was too much for the community to accept in one dollop, so I decided to visually connect them (I personally see the tool as two "tools" with discrete purposes - purposes that, for all intents, are at odds with one another).  The vulnerability of the system to information cascade and anchoring is why results are hidden at the outset.
 * 2) It's Not the Point. The point of the article is the article, not the ratings box.  Sure, the ratings are another aspect of an article, but they are not the article itself (just as the History is not, nor the Discussion - even though I believe those are just as important).  I personally view the ratings histogram to be another vector (hah) within the History.

Regarding the display of the "View Results" pane at the outset: it must be obvious to the user that they can see the results of the article. A primary goal was to make the tool as minimal to use as possible (and a planned design is even more minimal than this one).

I cannot speak to the choice of metrics except to say:


 * 1) They are configurable. We can change them at any time (pending translations, of course)
 * 2) They are an experiment. I personally believe that we can approximate the expected values of three of them using analytics; the outlier is "Neutrality," which we may find is entirely useless on a metric scale (but may still be useful as a type of "honeypot" for reader venom).  One of the answers we hope to get out of the workgroup is a better set of metrics. (The workgroup goes beyond metrics as well: I want to get better formulas for "stale" and "expired", for instance.)

This tool, too, is effectively implemented in Javascript. The design decision behind that was one of performance: we want to reduce calls to the server database as much as possible. There are two questions we have to ask each time:


 * 1) Should the tool be displayed? (handled in php)
 * 2) Does the tool need to display existing ratings? (this is the big one, and handled via javascript)

As a result, simply injecting the html as the page gets rendered would have been easier but would also have been more of a burden. A full-scale roll-out would clearly be implemented in php, but for now it's done client-side.

As far as graphs and histograms go, that's on the plan. We did not have sufficient design or development time to include them (though in the early design comps there are indications as to where they should go).

There's a lot of stuff that is "in the plan" that didn't make it into this revision, by the way. There is a roadmap, and I'm currently working to get a framework written for it. For example, the concept of "expired" ratings isn't in the current version, and I'm keen to get it into the design. Also, the idea of "self-identified experts" - we'd like to track that. Even if we don't apply weight to self-identified experts, it makes for an interesting line in the histogram. There are even more aspects that lie further in The Deep (ways to tie this into discussion systems, or viewing user rating histories and the like).

This became a book. Sorry. --Jorm (WMF) 05:33, 25 September 2010 (UTC)

Thanks for a 95% constructive and helpful comment, Mz. ;-) I'll add to Brandon's note that the design of the system reflects its primary intentions. These are explained in Article feedback/Public Policy Pilot/FAQs, but specifically, the quantitative assessment of change-over-time across defined quality dimensions is one of the objectives for this deployment. That's a lot easier when you're dealing with a four variable / five point scale where the vast majority of ratings submit complete data for all variables, as opposed to a tagging system with forced prioritization, where your objective is to highlight predominant characteristics (you're surfacing that a video is "inspiring", but you end up with very little information about how many people think it's "long-winded").

That is not to say that we didn't discuss tagging systems -- we did, and I thank you for bringing up TED; I had only seen the output side of it, and your post inspired me to look at the input side. It's a very cool system, and I agree that a system like this could be very useful for precisely the kind of purposes you describe: surfacing articles with specific characteristics. Another direction to explore is the system employed by Newstrust, which offers a similar initial rating system to ours, and expands to allow for additional input for those who would like to provide it.--Eloquence 07:45, 29 September 2010 (UTC)

Workgroup open
Who is welcomed to join the workgroup? Thorncrag 05:47, 27 September 2010 (UTC)
 * Hi. I already answered your question in the blog comments. guillom 13:50, 27 September 2010 (UTC)
 * Oh, sorry; I did not see that the last time I checked.  Thorncrag 20:15, 27 September 2010 (UTC)

Colour of the Stars
Good morning, I've come to the test from an article in the German Signpost. Not sure whether this is the right place for feedback, but here goes: I do not like the colour of the rating stars for three - partly cross-cultural - reasons: Would you consider changing the colour of the stars? To a dark green or a blue possibly? --Minderbinder 05:51, 29 September 2010 (UTC)
 * In Central Europe, the Red Star is perceived as a symbol of Communism in general, and the Russian Army in particular. Neither of these are very friendly connotations for large sections of the user population here. Now I realize that your star is a bit more bulky than a pentagram, but a five-pointed red star is a five-pointed red star.
 * When teachers grade term papers and the like over here, red is used to mark errors. The more red in your paper, the worse it is. This runs exactly counter to the meaning implied here.
 * Red means stop, green means go. Again: red flags as markers of quality are trouble signs, not good things.
 * colour significances vary. In the US, red is currently the symbol of the (right-of-center) Republican party  DGG 01:09, 30 September 2010 (UTC)
 * How about using something  completely  neutral such  as Cscr-featured.png‎? (the featured article star) or a tick  sign?--Kudpung 23:44, 9 December 2010 (UTC)

Feature Ideas

 * Include some context so that readers know why they are seeing the tool (e.g., a "What's this?" link with an explanation that the tool is part of the public policy project). See original post. Howief 20:08, 22 September 2010 (UTC)


 * Provide the ability to generate a graph of how the ratings have changed over time. The point of the software is to see how good an article is at any given time, but it would be super useful to actually see if the article was deeemed to have improved over time. -- Witty lama.

Shimgray's comments
See http://www.generalist.org.uk/blog/2010/article-ratings/. guillom 03:20, 1 October 2010 (UTC)

Comments from en-wiki

 * These are copied over from an ill-conceived discussion page on English Wikipedia. Sorry for the confusion, Nifboy and Peregrine Fisher. -Sage 


 * The one GA, Yucca_Mountain_nuclear_waste_repository, has mediocre ratings. Whatever that means. - Peregrine Fisher (talk) 20:21, 7 October 2010 (UTC)
 * Looking at the early data, are registered users and IPs rating the same articles or is there a difference in which articles IPs choose to rate versus registered users? If so, would this explain the discrepency between registered/IP ratings (if e.g. IPs are rating mostly good articles and registered users are rating mostly stubs)? Nifboy (talk) 21:04, 7 October 2010 (UTC)

Comments from Fetchcomms
I haven't seen anything too objectionable, and I'm not familiar with the technical side, but the one thing that bothers me is its placement. Can we move the feedback box after the categories? I think it makes the page flow better a bit. Also, is there a way to turn on the feedback tool for articles in a more "secure" way than just a category? I don't know what (maybe a MediaWiki-space listing of pages that need the feedback tool, or some special page to configure it, or something else), but that might be more useful from keeping people from inadvertently removing the category. Anyway, it seems to have worked fine for me so far. Fetchcomms 02:20, 8 October 2010 (UTC)

Suggestions
I'd like to make a few suggestions: Smallman12q 00:27, 13 December 2010 (UTC)
 * That the "Your Feedback" section be linked to a "Your Feedback" which is listed first in the "Interaction" sidebar.
 * The feedback should have a "Graphics" option for rating the usefulness of images and a "Layout" option for rating how well the graphics have been layed out, and how well they have been positioned. The "Graphics" option could also have a subsection for Caption quality.
 * There should be a "Grammer" option for rating the grammer of the article.
 * There should be a comment textbox at the bottom of the form.
 * The forms should be collapsible.


 * If Smalman12q's suggestions are implemented, I recommend spelling Grammar correctly. EncMstr 09:36, 29 December 2010 (UTC)

Feedback on feedback
Just saw this for the first time and it works fairly well. At first I couldn't figure out which end (left or right) was the high end for the ratings, but once I put the pointer over the circles and the stars showed up it made sense. The results box should probably be hidden until after the input is submitted.

I wonder to what uses you'll put the data. If you answer "Just general feedback and then editors will develop more uses as they go along," then you are doing this wrong. Data collection should be designed with specific uses or questions in mind. As set up now the survey will probably only answer questions like "Do readers like this article?" without much information on which readers do or don't like the article or why they do or don't like the article. In other words, the scores on "Well sourced," "complete," etc. will always come back equal, or perhaps in a fixed pattern, e.g. readability always lower. You might also be interested in who likes or dislikes the article - e.g. by age, sex, or participation on Wikipedia, or why they were reading the article (because of a news story, school assignment, work, ....)

And do remember that you don't have to have the same survey for everybody - different boxes can pop up each time the article is shown - so that readers don't have to answer say 20 questions. Readers might only answer 3 questions per survey, but the overall readership might be answering 20 questions in total. 68.45.215.63 13:15, 19 January 2011 (UTC)

Article Feedback does not work on secure pages
Using the HTTPS proxy, the Article Feedback tool does not work correctly. For example, shows "An error has occurred. Please try again later" and neither the Show results link nor the Feedback link works. Dtrebbien 18:16, 22 January 2011 (UTC)

Connecting the Likert scale to reality
A majorproblem with the rating system IMO is that it's too subjective. What's three stars for me might be one star for you. It's also possible that over time people will start to game the ratings as they do at other popular sites, jamming it 0s or 5s to have the most impact on certain aspects of the article (eg: this article is completely biased because I disagree with it). It's also hard to deal with newer articles. A short article gets a low rating for completeness... but is it neutral? It could be 5 stars for neutrality because what's in there is neutral. It could be a 1 star for neutrality because it doesn't fairly represent all sides of the debate. Same thing with sources. Is a new article written from two sources well sourced because everything is sourced, or poorly sourced because it needs far more expansion?

The answer is NOT better jargon. People will say we need to use more common sense language, to appeal to the average reader. Others will say we need to use the Wikipedia definitions like Verifiability and NPOV to ensure maximum accuracy. Both miss the point. You still open the system up to gaming, and wildly conflicting interpretations of what "3 stars" means.

A better system would be to make room for a literal description of what the ratings mean. It wouldn't have to be cluttered. In fact, the scale would be easier to read if it were all presented down the left hand side. This would make room for verbal descriptions on the right hand side:
 * Well-Sourced: (x)(x)(x)   : Inconsistent use of sources: some parts sourced, some not.
 * Neutral: (x)(x)(x)(x)       : Neutral in tone, but issues in how the debate is presented.
 * Complete: (x)(x)(x)       : Another section of detail is needed.
 * Readable: (x)(x)        : Difficult to read.

As you mouse over the different ratings from 0 to 5, the literal descriptions would automatically change. For example, mousing over "0" for well-sourced would say "Mostly unsourced and dubious statements", whereas "5" for well-sourced would say "all statements appear supported by reliable sources". Clicking on the star would "lock in" the literal description (eg: 4 stars says "mostly sourced, but some might not be reliable"). You could always click on a new star rating to lock in a different rating and description.

The benefit of this new design would flow both ways. Readers would make fewer blatant errors with rating, and it would be easier to find ratings made in bad faith. Editors would gain more consistent feedback with a more clear interpretation. And readers would start to understand the standards of a good article much better... instead of just going off their own feelings of whether they liked it or agreed with it.

I hope you reconsider the current design because I think it presents several problems if applied on a large scale. Bigwikifan 18:09, 22 February 2011 (UTC)

useless crap. Please. Go write some articles.

 * Please. Please. Please. If you do not feel competent to write articles, please go somewhere else to while away your online time; do not clog (and crappify) Wikipedia with self-reference crap. Wikipedia is an encyclopedia, not Face Book. If you want an encyclopedia, you do not want this initiative (or whatever it is). GlitchCraft 14:38, 6 March 2011 (UTC)
 * I'm sure none of us felt competent to write articles when we first started editing on Wikipedia. Everyone needs to start somewhere. The system can also be (more) useful for highlighing articles in need of serious work, which we are already not aware of through the WikiProject article assessment. If grammar was implemented, it would be highly useful in alerting users who like to clean-up bad grammar. Jolly Janner 01:32, 11 March 2011 (UTC)

"trustworthy" and "objective" will not accumulate meaningful results.
people are going to rate the articles in these two categories based on how dissonant or consonant it is with their pre-established beliefs. quite irrespective of the fidelity or neutrality or balance of the information in the article. that is just human nature. the results will be near 50-50 for more controversial articles, and near 100% for more "boring" articles. and that is really what these will end up measuring, how controversial vs. how boring the subject matter is. and that information really is of no use to anyone trying to improve the articles.174.102.197.120 12:59, 18 March 2011 (UTC)


 * oh, and i mention this because i think we should replace these two categories with ones that will give more meaningful results. 174.102.197.120 13:14, 25 March 2011 (UTC)

Any chance we can get it applied across Wikiproject Poland?
Any chance we ca have the tag applied across Wikiproject Poland pages? Is there a bot-generated list of articles tagged (specifically one by WikiProject) to aid in management of feedback and method for tracking improvement? Ajh1492 12:48, 25 March 2011 (UTC)

Shakespeare Authorship Question?
This has been added to the SAQ article, which does not meet the stated requirements for inclusion (a lot of edits anticipated in the coming months or an undeveloped article). In addition, it's impossible to tell who added it from the edit history. Can anyone tell me anything about this? Tom Reedy 20:15, 25 March 2011 (UTC)

This feedback device will just prove to be a magnet for POV pushers in this particular case. Paul Barlow 20:28, 25 March 2011 (UTC)


 * Update: I've removed it from the SAQ page. The selection criteria appear to be completely random. --GuillaumeTell 21:44, 25 March 2011 (UTC)
 * Please see the following thread for a summary of the Article Feedback tool, its goals, and the current trial. For research purposes, we've selected 3,000 articles to collect data on.  The 3,000 were determine by selecting a random set of articles within 3 different article length-bands as the early research showed that article of different lengths showed different distributions of ratings.  Howief 22:51, 25 March 2011 (UTC)