User:DarTar/SandBox

=Feb 2011 Update=

Overview
We ran a new series of analyses based on data from the Phase 1 Article Feedback tool to try and address a number of research questions that may help inform the design/implementation of this tool. The analysis at this stage is exploratory and we will delve into the preliminary findings further over the coming weeks. Feedback is always appreciated in the talk page.

The questions that we considered for the present study are the following:
 * Are ratings reliable indicators of article quality?
 * Are there correlations between measurable features of an article (size, number of citations, views, quality-related templates) and the volume/quality of ratings?
 * Do different classes of users (anonymous vs. registered) rate articles differently and consistently within the same group?
 * What factors drive conversions (i.e. the decision to rate an article after visiting it)?
 * Are there significant changes over time in rating?
 * Do changes in article features produce shifts in ratings or rating volume?

We decided to focus on well sourced ratings in particular as an initial case study to try an understand the relation between the presence of citations vs. source/citation needed templates (or lack thereof) on the one hand and the perceived quality of the article on the other hand.

The dataset
The sample consists of a total of 727 articles  selected from the PPI project + an additional list of articles related to special events We collected ratings for articles in this sample between September 2010-January 2011 (hereafter: "observation period") for a total of 52787 ratings, 94.3% of which were generated by anonymous users vs. 5.7% by registered users. The mean number of ratings is 72 per article but, as expected, the distribution of ratings/article was very skewed, as detailed below. On top of ratings available from the Article Feedback tool, we obtained the following data:
 * daily volume of article views (from http://stats.grok.se)
 * daily changes in article length (via the Wikipedia API)
 * daily changes in number of citations (via the Wikipedia API)
 * daily changes in the number of citation/source needed templates (via the Wikipedia API)

Article Length
The list of articles selected for this study is not a random sample of Wikipedia articles and as such it shouldn't be considered representative of Wikipedia articles at large. In particular, the sample includes articles that were already at a very mature stage at the beginning of the observation period (such as en:United States) or articles that were created from scratch and underwent a dramatic volume of edits during the observation period (such as the en:GFAJ-1 article). As a result articles in the sample differ substantially in initial size and in how much they changed during the observation period, both in absolute terms (total number of bytes added) or relative terms (proportion of bytes added with respect to the initial size). Figure 1 shows the distribution of (initial) article lengths (using Log binning). Figure 2 shows the distribution of relative length change during the observation period (top) and the relation between relative length change and initial length (bottom). Length plays a central role in quality ratings, but as the scatterplot in figure 2 shows, the majority of articles in the sample tend to start with a fairly large size and undergo changes during the observation period that are smaller than the initial length. Cases with a relative change higher than 100% of the initial length tend to occur only for smaller articles, where relatively small contributions can create large relative changes.



Rating volume
Articles in the sample differ significantly in the volume of ratings they generate, with a strongly skewed distribution of the number of ratings per article across all four rating dimensions. Figure 3 shows a histogram with the distribution of the total number of ratings per article, with logarithmically spaced bins.



However, it's interesting to note that the completion rate for articles that get rated is very high, i.e. when people decide to rate an article, they consistently do so along all four dimensions. Figure 4 compares the volume of ratings per article along different dimensions (each dot represents an individual article) and shows a very strong linear correlation between the number of ratings an article produces across any 2 dimensions. To put it differently, it's very rare for articles to display a very high number of ratings in one dimension only with few ratings in the other 3 dimensions.



Views
We included the number of daily views per article as a control variable in the present analysis. Comparing traffic for articles in the sample shows that views vary dramatically not only in volume (popular vs. less popular articles) but also in how they vary over time (see Figure 5 below). Some articles (such as en:United States) display regular weekly fluctuations (together with seasonal fluctuations, i.e. less views during the winter holidays season), however the monthly volume of views tends to be roughly constant over time. In contrast, other articles referring to temporally identifiable events (such as en:Black Friday (shopping)) display strong peak in views around a specific date preceded and followed by a long period of silence. Other articles (such as en:DREAM Act) display multiple peaks in daily views as a function of increased coverage of issues and events in the media.

Conversions
We will refer to conversions as the number of users who actually decide to submit a rating when viewing an article. The conversion rate will be expressed as the proportion of ratings per number of views over a given period (e.g. one day).

Views
As one would expect, the volume of daily ratings an article produces is strongly dependent on the total number of views it gets per day. Peaks and dips in the volume of daily ratings are aligned with peaks and dips in views, no matter how irregular visit patterns are.

Article length
Possibly the most interesting finding in the present analysis is that the probability for people to rate an article when they visit it decays at a very fast rate (following a power law relation) with the size of the article.



The trend in Figure 8 suggests that :
 * Users don’t seem to bother using the feedback tool for long articles. This could be interpreted as either (1) the fact that the current positioning of the feedback tool at the bottom of a long article decreases the chances that people actually see it or (2) that people are less incentivized to rate an article when it's very long. This is partly consistent with the hypothesis defended by some scholars that feedback is more likely to occur when information is of bad quality or contains major inaccuracies.
 * No matter what the explanation for this is, we should expect the volume of ratings to be skewed towards shorter (and presumably lower quality) articles.

Article Length
Article length is a good predictor of rating scores, but it affects individual quality dimensions in different way. Rating scores for shorter articles, in particular, tend to be sensitive to length only when users assess completeness/well-sourcedness, but insensitive to length in the case of neutrality or readability. Above a specific article length threshold, rating scores tend to become consistent and do not correlate further with length: further increases in article length above this threshold won't produce any difference in the perception of quality by raters.

Figure 9 shows the relation between average length and average rating score along the 4 quality dimensions for longer articles (>50Kb). For articles in this category, average rating scores do not vary with length, suggesting that at this length they are consistently perceived as articles of a good quality and further increases in length do not seem to produce shifts in average article scores. Interestingly the distribution of average scores for this class of articles is comparable across all 4 dimensions.

Conversely, if we focus on shorter articles (<50Kb, Figure 10) we observe that length tends to be much more correlated with average scores, but it only does so in the case of well sourced and complete ratings (highlighted, light gray). This is consistent with the intuition that article length should not be a critical factor for an article to be perceived as neutral or readable (i.e. there is no reason why short articles should be perceived as less neutral or less readable than longer ones), while there are reasons to believe that this should be the case for well sourced and complete ratings.

If we focus on the case of well sourced scores, in particular, and consider the difference in score between the longest and shortest articles we find a quite dramatic effect of length.

Figure 11 shows the relation between average length and average quality score (well sourced) for a sample of 300 articles with the largest volume of ratings. Articles in red are the top 50 by length, articles in blue are the lowest 50 by length, the two classes of articles tend to separate into two fairly distinct groups based on individual average rating scores.