Article feedback/Research

November 2011
The new research plan for testing the impact of Article feedback (v.5) on user engagement is described on this page. From now on, all research reports on Article Feedback will be posted to Meta.

July 2011
We posted an update on call to action data as well an overview of responses to the post-rating survey.

Call-to-action and expertise
On March 14, 2011 version 2 of the Article Feedback Tool was released and applied to a sample of approximately 3,000 articles from the English Wikipedia. Throughout March and April 2011 we kept analyzing on a regular basis the volume and quality of ratings collected via the AFT. We also started studying data related to call-to-actions and self-identified expertise of raters, two of the new features introduced in v.2 of the Article Feedback Tool.

Decreased rating volume: effects of the expertise checkbox
In the transition between v.1 and v.2 we observed a significant drop (-46.7%) in the weekly volume of ratings and in the conversion rate (the number of ratings per impression) for a sample of 380 articles. While these articles do not represent a random sample of Wikipedia articles (they all formed part of the Public Policy Initiative pilot implementation of AFT) we selected them as they offer the longest series of longitudinal data we have to date to consistently study the temporal evolution of ratings. We controlled for the number of visits these pages received on a weekly basis and excluded that a temporary or seasonal variation in the number of visits could be the cause of the decreasing number of ratings (the aggregate number of visits remained roughly constant throughout v.1 and v.2, with the only exception of the winter season – Christmas is marked by a gray vertical line in the plots below).

We made the hypothesis that this drop might be due to the effect of displaying the expertise checkbox introduced in v.2. To check for this we introduced on April 28, 2011 two different versions of AFT, one displaying the expertise checkbox and the other one with this checkbox hidden, and randomly assigned visitors to one of these two buckets. We ran an A/B test on articles that had obtained at least one rater per bucket, which resulted in a sample of 7681 articles. We found no statistically significant difference in the number of ratings per article between the two conditions (expertise checkbox hidden vs. expertise checkbox displayed), based on the results of an unpaired t-test (single-tail p>0.1). As a result, we can exclude that hiding or displaying the expertise checkbox may significantly affect the number of ratings submitted by readers.



Reader Feedback and Article Quality
Can readers meaningfully measure the quality of an article, e.g. how complete it is? This is a challenging question to answer because the base measures of quality can be difficult to quantify (e.g., how can article completeness be measured?). We’ve made some simplifying assumptions, and based on the data we’ve analyzed so far, we find that in some categories, reader feedback appears to be correlated with some objective measures of an article.

For example, in articles under 50kb, there appears to be a correlation between the Trustworthy and Complete ratings of an article and the article’s length. The longer the article is, the more Trustworthy and Complete it is considered by readers (analysis may be found here). In other categories, there does not appear to be much of a correlation between objective measures of the article and ratings.

At an aggregate level, there appears to be some alignment between reader assessments and community assessments of an article’s quality. Of the 25 most highly rated articles in the sample (average rating of 4.54), 10 are either Featured Articles or Good Articles (3 Featured and 7 Good). Of the 25 most poorly rated articles (average rating of 3.29), there is only one Good Article and no Featured Articles.



Feedback from Experts

We also provided a method for users to self-identify as knowledgeable about the topic. A set of checkboxes allow readers to indicate whether they are knowledgeable (in general) about a topic and, if so, the specific source of their knowledge. The main goal of this feature is to determine whether self-identified experts rate differently than users who did not self-identify as experts. While there are many more things we can do to verify expertise, self-identification is an easy first-step to understanding expert rating behavior.



Our preliminary results indicate that overall, experts show a similar distribution of ratings as do non-experts (see analysis here). But when individual articles are analyzed, a different pattern emerges. It appears that users who claim general knowledge do not rate substantially different than non-experts. But users who claim specific knowledge from either studies or their profession appear to rate differently (see analysis here). Based on the limited data collected so far, it appears as though expert ratings are more diverse than non-expert ratings of the same article.

We have only scratched the surface of analyzing the correlation between ratings and actual changes in article quality. We hope that the Wikimedia community and the research community will use the rating data dumps, which we will make publicly available, to continue this analysis.

Ratings as a Way to Engage Readers
Invitations to participate

As part of the v2.0 release in March, we introduced some “calls to action”, or invitations to the reader to participate after once they’ve submitted a rating. There are three different calls to action currently being tested:
 * 1) Create an account
 * 2) Edit the article
 * 3) Take a survey

Here is a summary of the results (The detailed click-through analysis may be found here):

The data show that 40% of users who are presented with the option to take a survey after completing their rating end up clicking through. And even though the call-to-action asks the user to complete a survey, some readers took the opportunity to provide feedback on the content of the article via the open text field. We observed something similar during the first phase of feature. During the first phase, there was a “Give us feedback about this feature” link.

Though the link specifically asked for feedback on the feature (as opposed to the content of the article), some readers provided rather detailed comments on the article content. While the comments field of the survey had its fair share of vandals and useless comments, there are clearly some readers who want to provide constructive feedback about the content of the article. The notion of these readers wanting to contribute to Wikipedia is reflected both in our user interviews as well as the survey results. Forty-four percent of survey respondents indicated that they rated because they hoped that their rating “would positively affect the development of the page” and 37% of respondents rated because they “wanted to contribute to Wikipedia.” These results show that an easy-to-use feedback tool is a promising way to engage these readers.



The “Edit” call-to-action also received a 15% click-through rate. While lower than the 40% who completed the survey, a low-teens percentage click-through rate is still significant, especially considering that these users probably had no intention of editing the article at the time they submitted the rating. This result suggests that a certain set of users, when presented with the option, would like to edit the article they just rated. More analysis, however, needs to be done on the actions after a user clicks on the call-to-action. The preliminary data indicate approximately 17% of these users end up successfully completing an edit, though the absolute numbers are still very small and we need more observations to establish statistical significance. We also need a measurement of the quality of the edits. We don’t know whether these edits are constructive, vandalism, or bail actions (user clicks save just to get out of the screen).

We intend more experimentation with these and other calls to action in the future.

Volume of Ratings

The Article Feedback tool is currently on approximately 3,000 articles, less than 0.1% of the total number of articles on the English Wikipedia. Over the past 1.5 months, however, over 47,000 individual users have rated articles:



In comparison, the English Wikipedia has approximately 35,000-40,000 active editors each month. With this experimental deployment, we can see that the number of users willing to rate an article exceeds the number of users willing to edit an article by at least an order of magnitude. Not only does the feedback tool offer a way to engage more users, some of these users may end up editing, as the call-to-action data show.

Additional Findings

 * Three out of six raters in a small-scale user test did not complete their rating action, neglecting to press the "Submit" button. Recent revisions of the feature add a reminder to submit the rating.
 * Based on interviews with raters, in the second version of the feature, the category "readable" was changed to "well-written", "neutral" to "objective", and "well-sourced" to "trustworthy".
 * Raters who completed the survey call-to-action used the "other comments" section as a way to express opinions both about the tool itself, but also about the article they are rating, as well as Wikipedia as a whole. Among these responses are many which would make very useful talk page contributions, and many respondents seem also likely to be the kinds of people who could be motivated to edit. However, there is also a significant percentage of useless/noise responses, highlighting the need for moderation or filtering to the extent that free-text comments are integrated with the tool.
 * Our user studies have highlighted that readers do not consider rating necessarily to be a form of "feedback". The tool does not currently use the term "feedback" in the user interface, and we may add feedback features such as free-text comments in future, so this does not have major implications at this time.

February 2011
A new series of analyses was performed in February 2011 based on a richer dataset and focusing on a broader set of research questions, in preparation for the launch of the Article Feedback Tool v.2.

December 20, 2010
This is a short update focusing on GFAJ-1 and its ratings since the article feedback tool was applied on December 2. GFAJ-1 is an interesting case study because the article had only about a paragraph of information as of December 2, 2010. On December 2, NASA had a press conference announcing the discovery of arsenic-based life. Subsequent to this announcement, the article was developed more fully. Brandon Harris was able to put the Article Feedback Tool on the page to see if the changes to the article would be reflected in the ratings. GFAJ-1 represents a natural experiment because we were able to put the Article Feedback Tool prior to substantial changes in the article.
 * Here is the version of the article when the Article Feedback tool was applied on December 2.
 * Here is the version of the article as of December 20 (date of this analysis).
 * As can be seen from the diff, the article has undergone substantial change.

''Note: The analysis presented here is very cursory, done mainly by manipulating ratings data within a spreadsheet. We ideally would map ratings to revision, but doing so requires more intensive data manipulation. Time (a common thread that runs through both ratings and revisions) was therefore used as an approximation. More complex analysis is required to draw firm conclusions about the relationships between ratings and changes in the article.''

The changes in the GFAJ-1 article appear to be reflected in the ratings, though it is unclear how tightly coupled the two are. Here is a time-series of the "Complete" rating:



The article length clearly increases over time, and the trendline of the "Complete" slopes upward (please note that the trendline is a regression based on a moving average, not of the individual ratings). The ratings for Well-Sourced also slope upwards:



In the case of GFAJ-1, the Neutral and Readable ratings trend downwards:





Overview
Here is an update of the Article Feedback Data as of November 8, 2010. It is based on approximately 12,498 ratings submitted from Sep 22 - Nov 8. A running list of articles is maintained here, but please keep in mind the list is subject to change.

A quick summary of the points so far:
 * Ratings by Anonymous users outpace ratings by Registered users by 10x.
 * For many articles, the number of ratings from Registered users is not enough to provide meaningful information about article quality.


 * Ratings by Anonymous users skew high, with most anonymous users giving either a 4 or 5 rating across all dimensions. We intend to measure if this skew persists over time (e.g., if an article is significantly improved, yet ratings from Anonymous users doesn't change noticeable).
 * Ratings by Registered users are both lower and show less of a skew compared to ratings by Anonymous users. This could suggest that Registered users are more critical and/or give higher quality ratings, though more data is needed to support this assertion.  We intend to measure how substantial changes in an article affect ratings from Registered users.
 * In its current form, the tool is not a good on-ramp for editing. In the next release of the feature, we will test interface messages to see if they have an effect on editing after a user rates (e.g., "Did you know you can edit this article?").

Comparing Anon Reviewers to Registered Reviewers
Anonymous users submit about 10 times the rates as Registered users do. Registered users continue to rate at a lower mean, but higher completion rate.

Here are the distributions for Anon and Registered users:



We continue to see a skew towards 4s and 5s from Anonymous users. Registered users show less of a skew towards high ratings than do Anonymous users.

Length of Articles and Ratings
The Public Policy Project includes articles at various stages of development. It includes short, stub-like articles such as 5 centimeter band and Executive Budget, but also longer articles such as United States Constitution and Don't ask, don't tell. We wanted to see whether the shorter-stub like articles were rated differently than more developed articles, particularly along the Well-Sourced and Complete dimensions. We defined a shorter article as an article under 1.5kb in length.



Registered users have ratings for Well-Sourced and Complete that skew heavily towards 1s and 2s. Anonymous users do not show quite the same skew towards 1s and 2s.

Articles over 1.5kb in length show the following ratings distribution:



Ratings Volume by Article
While Registered users appear to show less of a skew in their ratings, the volume of ratings from Registered users is very low. Here are the top 10 articles by volume of ratings:

Even Don't ask, don't tell, the article most frequently rated by Registered Users only received 35 ratings over nearly 7 weeks of having the feedback tool on the page. For most articles, the volume of ratings from Registered users is so low that they are not likely to provide meaningful information about quality to Readers.

Rating and Editing
In order to understand the relationship between rating articles and editing articles, we counted the number of times an article was edited by a user either before or after the user rated the article. One hypothesis we're trying to test is whether ratings, as a low-barrier form of participation, is an on-ramp for editing. To test this hypothesis, we looked at the frequency of cases where a user edits an article after rating it, but does not edit the article before rating. Anonymous and Registered users were looked at separately.

With the current implementation, it does not look like the Ratings tool is a good on-ramp for editing -- only .35% of ratings resulted in editing after the rating. But we should keep in mind that the current interface does not do anything to explicitly suggest to the user that they may edit the article:



Here is the data for Registered users:

Interestingly, 16.1% of Registered users edited the same article they rated. Most of these edits are cases where the user edited the article prior to rating.

Overview
Here is an update of the Article Feedback Data as of October 4, 2010. It is based on approximately 2,800 ratings submitted from Sep 22 - Oct 4. A running list of articles is maintained here, but please keep in mind the list is subject to change.

Overall Ratings Data
The following table summarizes the aggregate rating data:

The mean number of ratings is 7.2. The median is 3.


 * Completion rates for each category (defined as the number of ratings for the category divided by the total number of ratings) is between 90% and 96%.

Comparing Anon Reviewers to Registered Reviewers
Here are the tables comparing ratings from Anonymous and Registered users:

A few things worth noting:


 * It appears as though registered users are “tougher” in their grading of the articles than are anon users. This is especially notable in the area of “well sourced” (3.8 mean for anon vs. 2.5 mean for registered) and “complete” (3.6 vs. 2.4).  It’s interesting to note that the means for “neutral” are almost identical.


 * The completion rate for reviews continues to be higher for registered users. It’s worth noting that “Neutral” had the lowest completion rate for both registered and anonymous users.


 * The standard deviation of ratings across all categories is lower for registered than for anon. While this appears to suggest that the ratings of registered users are more internally consistent than the ratings of anonymous users, looking at the actual distributions suggests the opposite:



The distribution of the ratings are beginning to show marked differences between Anonymous and Registered Users:
 * Anonymous Users are much more generous with their ratings. 4s and 5s are most common rating across all categories.  These users are far more likely to give 5's than are registered users.  For example, under "Well-Sourced", 45% of the ratings from anonymous users were 5 stars whereas only 10% of registered users rated this category 5 stars.
 * Registered Users show distinct patterns depending on the category:
 * Neutral and Readable: Both these categories show a normal-like distribution around the mean.
 * Well-Sourced and Complete: For these categories, the most common rating is 1, and the ratings fall off in a linear-like fashion from 1 to 5. The perceptions registered users have of these categories appears to be significantly worse than their perceptions of other categories.

10 most frequently rated articles
(Simply sorted by number of submitted "well sourced" ratings.)


 * United_States_Constitution - 80 ratings -- linked from Wikimedia blog post
 * Don't_ask,_don't_tell - 61 ratings -- linked from Wikimedia blog post
 * Capital_punishment - 37 ratings
 * Terrorism - 35 ratings
 * United_States_Declaration_of_Independence - 32 ratings
 * DREAM_Act - 32 ratings
 * LGBT_rights_in_the_United_States - 30 ratings
 * 5_centimeters - 28 ratings -- third item in public policy category
 * Pollution - 27 ratings
 * Abortion - 22 ratings

To Do

 * Breakdown of ratings (particularly num. ratings) by user (username or IP)
 * Top 10 (most rated) article comparison
 * Top 10 (most prolific raters) user comparison
 * Short article (with rating tool visible) Vs. others comparison
 * Short No. 1 (viewable on 1280 X 1024): Executive_Order_11478
 * Short No. 2: 5_centimeters (stub)
 * Short No. 3: 1984_Cable_Franchise_Policy_and_Communications_Act (stub)
 * Short No. 4: David_Ray_Hate_Crimes_Prevention_Act (stub)
 * Short No. 5: Balanced_Budget_Act_of_1997 (stub)
 * Comparison of average ratings to current Wikipedia rating system (FA, GA, etc)
 * Investigate the 87+% 4 metric ratings (forced choice? felt mandatory?  confidence in some over others?)
 * Email questionnaire to users about confidence in the accuracy of their ratings
 * Investigate whether those rating articles have also contributed/edited that article (could be done in the questionnaire)
 * Ask Roan if we can have a cumulative Page View column in our CSV data pull
 * Investigate "neutrality" - changing the word? description? placement?
 * Investigate "completeness"' relation to article length

Overview
Here is some preliminary data on the Article Feedback tool. It is based on approximately 1,470 ratings across 289 articles during the first ~week of the Pilot (Sep 22-28, 2010). A running list of articles is maintained here, but please keep in mind the list is subject to change. The article-level raw data may also be found here.

Overall Ratings Data
The following table summarizes the aggregate rating data.


 * Overall, it’s difficult to conclude whether the differences in category averages are meaningful.  But on average, raters have a relatively similar view of each category (e.g., the perceptions of the articles in the Pilot, as a whole, are that they are about as well sourced as they are neutral, complete, and readable).
 * Completion rates for each category (defined as the number of ratings for the category divided by the total number of ratings) is between 87% and 93%. From a usability standpoint, it appears as though four categories is an acceptable number of categories for users to rate, though further research would help us better understand this (e.g., users may simply be clicking through, they may think rating all four categories is a requirement, etc.).  Here’s a table that breaks down the number of ratings by the number of categories completed:

The vast majority of ratings (83%) have all four categories rated, while 17% are missing at least one category.

Comparing Anon Reviewers to Registered Reviewers
In total, there were 1,300 users (defined by unique IPs and registered accounts). Of the 1,300, 1,138 (88%) were anon and 162 (12%) were registered accounts. When anons and registered reviews are analyzed separately, some interesting patterns start to appear.

A few things worth noting:


 * It appears as though registered users are “tougher” in their grading of the articles than are anon users. This is especially notable in the area of “well sourced” (3.7 mean for anon vs. 2.8 mean for registered) and “complete” (3.5 vs. 2.7).  It’s interesting to note that the means for “neutral” are almost identical.


 * The completion rate for reviews is higher for registered users as well. It’s worth noting that “Neutral” had the lowest completion rate for both registered and anonymous users.


 * The standard deviation of ratings across all categories is lower for registered than for anon. While this appears to suggest that the ratings of registered users are more internally consistent than the ratings of anonymous users, looking at the actual distributions suggests the opposite:



Anonymous users are far more likely to give 5's than are registered users. For example, under "Well-Sourced", 45% of the ratings from anonymous users were 5 stars whereas only 17% of registered users rated this category 5 stars. Registered users also appear to have a (relatively speaking) more even distribution across the 5 stars.

Finally, registered users are more likely to rate multiple articles.

Anon Reviewers

Registered Reviewers