Article feedback/Research

=May 2011 Update=

Call-to-action and expertise
On March 14, 2011 version 2 of the Article Feedback Tool was released and applied to a sample of approximately 3,000 articles from the English Wikipedia. Throughout March and April 2011 we kept analyzing on a regular basis the volume and quality of ratings collected via the AFT. We also started studying data related to call-to-actions and self-identified expertise of raters, two of the new features introduced in v.2 of the Article Feedback Tool.

Decreased rating volume: effects of the expertise checkbox
In the transition between v.1 and v.2 we observed a significant drop (-46.7%) in the weekly volume of ratings and in the conversion rate (the number of ratings per impression) for a sample of 380 articles. While these articles do not represent a random sample of Wikipedia articles (they all formed part of the Public Policy Initiative pilot implementation of AFT) we selected them as they offer the longest series of longitudinal data we have to date to consistently study the temporal evolution of ratings. We controlled for the number of visits these pages received on a weekly basis and excluded that a temporary or seasonal variation in the number of visits could be the cause of the decreasing number of ratings (the aggregate number of visits remained roughly constant throughout v.1 and v.2, with the only exception of the winter season – Christmas is marked by a gray vertical line in the plots below).

We made the hypothesis that this drop might be due to the effect of displaying the expertise checkbox introduced in v.2. To check for this we introduced on April 28, 2011 two different versions of AFT, one displaying the expertise checkbox and the other one with this checkbox hidden, and randomly assigned visitors to one of these two buckets. We ran an A/B test on articles that had obtained at least one rater per bucket, which resulted in a sample of 7681 articles. We found no statistically significant difference in the number of ratings per article between the two conditions (expertise checkbox hidden vs. expertise checkbox displayed), based on the results of an unpaired t-test (single-tail p>0.1). As a result, we can exclude that hiding or displaying the expertise checkbox may significantly affect the number of ratings submitted by readers.



=Feb 2011 Update= A new series of analyses was performed in February 2011 based on a richer dataset and focusing on a broader set of research questions, in preparation for the launch of the Article Feedback Tool v.2.

=Dec 20 Update: GFAJ-1= This is a short update focusing on GFAJ-1 and its ratings since the article feedback tool was applied on December 2. GFAJ-1 is an interesting case study because the article had only about a paragraph of information as of December 2, 2010. On December 2, NASA had a press conference announcing the discovery of arsenic-based life. Subsequent to this announcement, the article was developed more fully. Brandon Harris was able to put the Article Feedback Tool on the page to see if the changes to the article would be reflected in the ratings. GFAJ-1 represents a natural experiment because we were able to put the Article Feedback Tool prior to substantial changes in the article.
 * Here is the version of the article when the Article Feedback tool was applied on December 2.
 * Here is the version of the article as of December 20 (date of this analysis).
 * As can be seen from the diff, the article has undergone substantial change.

''Note: The analysis presented here is very cursory, done mainly by manipulating ratings data within a spreadsheet. We ideally would map ratings to revision, but doing so requires more intensive data manipulation. Time (a common thread that runs through both ratings and revisions) was therefore used as an approximation. More complex analysis is required to draw firm conclusions about the relationships between ratings and changes in the article.''

The changes in the GFAJ-1 article appear to be reflected in the ratings, though it is unclear how tightly coupled the two are. Here is a time-series of the "Complete" rating:



The article length clearly increases over time, and the trendline of the "Complete" slopes upward (please note that the trendline is a regression based on a moving average, not of the individual ratings). The ratings for Well-Sourced also slope upwards:



In the case of GFAJ-1, the Neutral and Readable ratings trend downwards:





=Nov 8 Update=

Overview
Here is an update of the Article Feedback Data as of November 8, 2010. It is based on approximately 12,498 ratings submitted from Sep 22 - Nov 8. A running list of articles is maintained here, but please keep in mind the list is subject to change.

A quick summary of the points so far:
 * Ratings by Anonymous users outpace ratings by Registered users by 10x.
 * For many articles, the number of ratings from Registered users is not enough to provide meaningful information about article quality.


 * Ratings by Anonymous users skew high, with most anonymous users giving either a 4 or 5 rating across all dimensions. We intend to measure if this skew persists over time (e.g., if an article is significantly improved, yet ratings from Anonymous users doesn't change noticeable).
 * Ratings by Registered users are both lower and show less of a skew compared to ratings by Anonymous users. This could suggest that Registered users are more critical and/or give higher quality ratings, though more data is needed to support this assertion.  We intend to measure how substantial changes in an article affect ratings from Registered users.
 * In its current form, the tool is not a good on-ramp for editing. In the next release of the feature, we will test interface messages to see if they have an effect on editing after a user rates (e.g., "Did you know you can edit this article?").

Comparing Anon Reviewers to Registered Reviewers
Anonymous users submit about 10 times the rates as Registered users do. Registered users continue to rate at a lower mean, but higher completion rate.

Here are the distributions for Anon and Registered users:



We continue to see a skew towards 4s and 5s from Anonymous users. Registered users show less of a skew towards high ratings than do Anonymous users.

Length of Articles and Ratings
The Public Policy Project includes articles at various stages of development. It includes short, stub-like articles such as 5 centimeter band and Executive Budget, but also longer articles such as United States Constitution and Don't ask, don't tell. We wanted to see whether the shorter-stub like articles were rated differently than more developed articles, particularly along the Well-Sourced and Complete dimensions. We defined a shorter article as an article under 1.5kb in length.



Registered users have ratings for Well-Sourced and Complete that skew heavily towards 1s and 2s. Anonymous users do not show quite the same skew towards 1s and 2s.

Articles over 1.5kb in length show the following ratings distribution:



Ratings Volume by Article
While Registered users appear to show less of a skew in their ratings, the volume of ratings from Registered users is very low. Here are the top 10 articles by volume of ratings:

Even Don't ask, don't tell, the article most frequently rated by Registered Users only received 35 ratings over nearly 7 weeks of having the feedback tool on the page. For most articles, the volume of ratings from Registered users is so low that they are not likely to provide meaningful information about quality to Readers.

Rating and Editing
In order to understand the relationship between rating articles and editing articles, we counted the number of times an article was edited by a user either before or after the user rated the article. One hypothesis we're trying to test is whether ratings, as a low-barrier form of participation, is an on-ramp for editing. To test this hypothesis, we looked at the frequency of cases where a user edits an article after rating it, but does not edit the article before rating. Anonymous and Registered users were looked at separately.

With the current implementation, it does not look like the Ratings tool is a good on-ramp for editing -- only .35% of ratings resulted in editing after the rating. But we should keep in mind that the current interface does not do anything to explicitly suggest to the user that they may edit the article:



Here is the data for Registered users:

Interestingly, 16.1% of Registered users edited the same article they rated. Most of these edits are cases where the user edited the article prior to rating. =Oct 4 Update=

Overview
Here is an update of the Article Feedback Data as of October 4, 2010. It is based on approximately 2,800 ratings submitted from Sep 22 - Oct 4. A running list of articles is maintained here, but please keep in mind the list is subject to change.

Overall Ratings Data
The following table summarizes the aggregate rating data:

The mean number of ratings is 7.2. The median is 3.


 * Completion rates for each category (defined as the number of ratings for the category divided by the total number of ratings) is between 90% and 96%.

Comparing Anon Reviewers to Registered Reviewers
Here are the tables comparing ratings from Anonymous and Registered users:

A few things worth noting:


 * It appears as though registered users are “tougher” in their grading of the articles than are anon users. This is especially notable in the area of “well sourced” (3.8 mean for anon vs. 2.5 mean for registered) and “complete” (3.6 vs. 2.4).  It’s interesting to note that the means for “neutral” are almost identical.


 * The completion rate for reviews continues to be higher for registered users. It’s worth noting that “Neutral” had the lowest completion rate for both registered and anonymous users.


 * The standard deviation of ratings across all categories is lower for registered than for anon. While this appears to suggest that the ratings of registered users are more internally consistent than the ratings of anonymous users, looking at the actual distributions suggests the opposite:



The distribution of the ratings are beginning to show marked differences between Anonymous and Registered Users:
 * Anonymous Users are much more generous with their ratings. 4s and 5s are most common rating across all categories.  These users are far more likely to give 5's than are registered users.  For example, under "Well-Sourced", 45% of the ratings from anonymous users were 5 stars whereas only 10% of registered users rated this category 5 stars.
 * Registered Users show distinct patterns depending on the category:
 * Neutral and Readable: Both these categories show a normal-like distribution around the mean.
 * Well-Sourced and Complete: For these categories, the most common rating is 1, and the ratings fall off in a linear-like fashion from 1 to 5. The perceptions registered users have of these categories appears to be significantly worse than their perceptions of other categories.

10 most frequently rated articles
(Simply sorted by number of submitted "well sourced" ratings.)


 * http://en.wikipedia.org/wiki/United_States_Constitution - 80 ratings -- linked from Wikimedia blog post
 * http://en.wikipedia.org/wiki/Don't_ask,_don't_tell - 61 ratings -- linked from Wikimedia blog post
 * http://en.wikipedia.org/wiki/Capital_punishment - 37 ratings
 * http://en.wikipedia.org/wiki/Terrorism - 35 ratings
 * http://en.wikipedia.org/wiki/United_States_Declaration_of_Independence - 32 ratings
 * http://en.wikipedia.org/wiki/DREAM_Act - 32 ratings
 * http://en.wikipedia.org/wiki/LGBT_rights_in_the_United_States - 30 ratings
 * http://en.wikipedia.org/wiki/5_centimeters - 28 ratings -- third item in public policy category
 * http://en.wikipedia.org/wiki/Pollution - 27 ratings
 * http://en.wikipedia.org/wiki/Abortion - 22 ratings

To Do

 * Breakdown of ratings (particularly num. ratings) by user (username or IP)
 * Top 10 (most rated) article comparison
 * Top 10 (most prolific raters) user comparison
 * Short article (with rating tool visible) Vs. others comparison
 * Short No. 1 (viewable on 1280 X 1024): http://en.wikipedia.org/wiki/Executive_Order_11478
 * Short No. 2: http://en.wikipedia.org/wiki/5_centimeters (stub)
 * Short No. 3: http://en.wikipedia.org/wiki/1984_Cable_Franchise_Policy_and_Communications_Act (stub)
 * Short No. 4: http://en.wikipedia.org/wiki/David_Ray_Hate_Crimes_Prevention_Act (stub)
 * Short No. 5: http://en.wikipedia.org/wiki/Balanced_Budget_Act_of_1997 (stub)
 * Comparison of average ratings to current Wikipedia rating system (FA, GA, etc)
 * Investigate the 87+% 4 metric ratings (forced choice? felt mandatory?  confidence in some over others?)
 * Email questionnaire to users about confidence in the accuracy of their ratings
 * Investigate whether those rating articles have also contributed/edited that article (could be done in the questionnaire)
 * Ask Roan if we can have a cumulative Page View column in our CSV data pull
 * Investigate "neutrality" - changing the word? description? placement?
 * Investigate "completeness"' relation to article length

=Sep 28 Update=

Overview
Here is some preliminary data on the Article Feedback tool. It is based on approximately 1,470 ratings across 289 articles during the first ~week of the Pilot (Sep 22-28, 2010). A running list of articles is maintained here, but please keep in mind the list is subject to change. The article-level raw data may also be found here.

Overall Ratings Data
The following table summarizes the aggregate rating data.


 * Overall, it’s difficult to conclude whether the differences in category averages are meaningful.  But on average, raters have a relatively similar view of each category (e.g., the perceptions of the articles in the Pilot, as a whole, are that they are about as well sourced as they are neutral, complete, and readable).
 * Completion rates for each category (defined as the number of ratings for the category divided by the total number of ratings) is between 87% and 93%. From a usability standpoint, it appears as though four categories is an acceptable number of categories for users to rate, though further research would help us better understand this (e.g., users may simply be clicking through, they may think rating all four categories is a requirement, etc.).  Here’s a table that breaks down the number of ratings by the number of categories completed:

The vast majority of ratings (83%) have all four categories rated, while 17% are missing at least one category.

Comparing Anon Reviewers to Registered Reviewers
In total, there were 1,300 users (defined by unique IPs and registered accounts). Of the 1,300, 1,138 (88%) were anon and 162 (12%) were registered accounts. When anons and registered reviews are analyzed separately, some interesting patterns start to appear.

A few things worth noting:


 * It appears as though registered users are “tougher” in their grading of the articles than are anon users. This is especially notable in the area of “well sourced” (3.7 mean for anon vs. 2.8 mean for registered) and “complete” (3.5 vs. 2.7).  It’s interesting to note that the means for “neutral” are almost identical.


 * The completion rate for reviews is higher for registered users as well. It’s worth noting that “Neutral” had the lowest completion rate for both registered and anonymous users.


 * The standard deviation of ratings across all categories is lower for registered than for anon. While this appears to suggest that the ratings of registered users are more internally consistent than the ratings of anonymous users, looking at the actual distributions suggests the opposite:



Anonymous users are far more likely to give 5's than are registered users. For example, under "Well-Sourced", 45% of the ratings from anonymous users were 5 stars whereas only 17% of registered users rated this category 5 stars. Registered users also appear to have a (relatively speaking) more even distribution across the 5 stars.

Finally, registered users are more likely to rate multiple articles.

Anon Reviewers

Registered Reviewers