User:TJones (WMF)/Notes/How Wrong Would Using Out of Date Page View Data Be?

January 2016 — See TJones_(WMF)/Notes for other projects.

Introduction
We want to integrate page view information into the scoring algorithms we use for both the completion suggester and our regular search results. Our initial idea is we only update this page view information when doing normal document updates after a page edit (for technical reasons, page view data is available/provided when a page is edited). We need to analyze if this page view data will be "good enough" or if we need to do something more.

The goal of this analysis is to look at the distribution of how often pages are edited and compare that against how much page view data tends to change over time.


 * We want to get a sense of how many outliers there are, and how big the spikes are (and report #/% of spikes).
 * We want to to look at week to week changes in daily average over a week, with a one-week gap, two-week gap, and three-week gap, to get a sense of how fast things change week to week (and report #/% of significant changes by time span).
 * We want to calculate the #/% of outlier spikes that do not have an edit the same day.

See Phabricator Ticket, too.

Data Collection
I used Hive on stat1002 to get view counts for all pages, by day, for September 2015. I limited output to 50M rows to avoid the very, very long tail. (I started out with just 5M rows, but it wasn't enough!)

The HQL: use wmf; SELECT SUM(view_count) AS views, year, month, day, project, page_title FROM wmf.pageview_hourly WHERE year = 2015 AND month = 9 AND project = 'en.wikipedia' AND agent_type = 'user' AND page_title NOT LIKE '%\:%' GROUP BY  year, month, day, project, page_title ORDER BY  views DESC LIMIT 50000000;

The last row had 10 page views, and since the list of pages with 10 page views wasn't complete, I deleted all of those, leaving 49,181,835 pages with 11 or more page views for any day in September 2015.

Some stats:
 * There are 49,181,835 examples of pages with 11 or more views some day in September 2015.
 * There are 3,638,889 distinct pages.

Outliers
I looked for outliers in page views for each page using the interquartile range (IQR). High outliers are those that are "far" above the third quartile page view count, where "far" is 1.5x the difference between the first and third quartile. For example, if the first quartile daily page view count is 1000, and the third quartile count is 2000, then an outlier is 1.5x the difference (1000 * 1.5 = 1500) above the third quartile (2000 + 1500 = 3500). Read the IQR wiki page for a more detailed explanation. (Also note that Mikhail suggested using Median Absolute Deviation over IQR to find outliers, but I didn't have the time to code it up and, more importantly, contemplate out how best to handle missing values. At least he tried to save me from myself.)

Because the data is truncated for page counts below 11 page views, I had to extrapolate for incomplete data. Otherwise we would find many "outliers" for pages with small sample sizes. For example, page views of 15, 16, 17, and 25 would make 25 an "outlier", which seems unlikely, and also something we wouldn't feel bad about missing.


 * If data was missing (i.e., below 11 page views) for more than 3/4 of the days in September, I assumed the first quartile page view count was 0 and the third quartile page count was 10. That is, maximize the size of the interquartile range (10), so that at least 25 page views are needed to count as an "outlier". (This undercounts outliers, but at these low page view counts, it doesn't really matter.)


 * If data was missing (i.e., below 11 page views) for more than 1/4 (but less than 3/4) of the days in September, I assumed the first quartile page view count was 0 (and the third quartile remains the third quartile of the available data. That is, page view counts above 1.5x the third quartile of the available data would count as outliers.

I also looked at ignoring outliers compared to a minimum page view count. For example, while a page view count of 99 can technically be an outlier, it is not a serious failure not to catch it (compared to a spike of 500,000 page views). I counted outliers using IQR, and among those outliers notes those with at least 100, 500, 1K, 2K, and 5K daily page views.

Outlier counts:
 * 1,856,745 using IQR.
 * 804,271 with IQR + minimum page views of 100 on that day.
 * 233,694 with IQR + minimum page views of 500 on that day.
 * 123,533 with IQR + minimum page views of 1,000 on that day.
 * 61,361 with IQR + minimum page views of 2,000 on that day.
 * 21,333 with IQR + minimum page views of 5,000 on that day.

It is interesting to note that most "outliers" (>93%) don't even get 1,000 page views.

Pretty Pictures—Median vs Max
One of my goals for this analysis was to generate a plot of median value vs maximum value, to get a sense of how many outliers there are, and "how big the spikes are." Turns out a significant portion of pages have outliers, so the graphs aren't as clear or explanatory as I had hoped, but they are presented here anyway.

There are 3,638,889 distinct pages in the collection, and 1,002,680 (27.6%) have ≥1 outliers.

The Main Page and a few other pages have extremely high page view counts, so I've used a log/log scale. The graph includes a faint line along x=y since the aspect ratio of the graph isn't 1:1.

Plot of median vs max page views (log/log scale) for enwiki pages in September 2015.

R code: qplot(log10(med), log10(max), col=out, alpha=I(1/5), size=I(1)) + stat_function(fun=function(x)(x), geom="line", alpha=0.1) + ggtitle("Sept 2015 Outliers - Median vs Max (log/log)")

Blue dots are pages with ≥1 IQR outliers, and black dots have no outliers. Outliers on or near the x=y line have low counts in the sample (e.g., only one daily page view count over 10).

The plot below shows the detail of the lower left quadrant of the graph (linear scale). The dots are more numerous and overlap more, and so they are more transparent in this view, and have had jitter applied so they don't all overlap. Detail of lower quadrant of plot of median vs max page views (linear scale) for enwiki pages in September 2015.

R code: qplot(jitter(med,1,0.3), jitter(max,1,0.3), col=out, xlim=c(0,100), ylim=c(0,100), alpha=I(1/10), size=I(1)) + stat_function(fun=function(x)(x), geom="line", alpha=0.1) + ggtitle("Sept 2015 Outliers - Median vs Max - Detail")

The obvious non-outlier cutoff at 25 (the "Bermuda Triangle") is for pages with data mising for 3/4 of the days in September, which limits their ability to be considered "outliers". See above. I'm not terribly concerned by the lack of outliers below 25 views in a day. Even if such a page view count is an outlier, it's still not terribly significant.

The cluster of dots above (3.3) on the log/log graph ("Halley's Comet") is unexplained.

Distributions of Outliers by Median Page Views
I also divided the pages into logarithmic (base 2) buckets based on the median page views for each page, in anticipation that outliers would occur at different rates for pages with different numbers of "typical" daily page views. This turned out not to be true, and about 60-70% of pages in each well-represented bucket have outliers.

Note that there are no buckets below Bucket 3 because of the minimum page view limit of 11. Also, Bucket 3's outlier count has been suppressed by the missing data extrapolation above.

Base-2 Buckets, by Median Page Views min      max        pages  outliers     out% Bucket 3: 11        15       2024918    148174     7.3% Bucket 4: 16        31        860004    383357    44.6% Bucket 5: 32        63        314590    201127    63.9% Bucket 6: 64        127       190513    118576    62.2% Bucket 7: 128       255       116662     71343    61.2% Bucket 8: 256       511        67463     41089    60.9% Bucket 9: 512       1023       35417     21491    60.7% Bucket 10: 1024     2047       18821     10761    57.2% Bucket 11: 2048     4095        7351      4631    63.0% Bucket 12: 4096     8181        2400      1619    67.5% Bucket 13: 8199     16344        598       410    68.6% Bucket 14: 16413    32572        118        87    73.7% Bucket 15: 33176    62681         29        12    41.4% Bucket 16: 76659    84088          2         2   100.0% Bucket 17: 155009   155009         1         0     0.0% Bucket 19: 840562   840562         1         1   100.0% Bucket 24: 18545589 18545589       1         0     0.0%

Edits
For my initial analysis, I took a 1-in-182 sample of the full list of outliers, to get about ~10K pages to analyze more closely. There were 10,039 pages in the sample (with 9,983 unique pages—so there were 56 pages with 2 instances of outliers in the sample).

For the 9,983 unique pages, I pulled revision information for those pages using the following URL: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=timestamp&rvlimit=max&format=json&titles=  A number of pages (30) are missing, and seem to have been deleted since September. Those were dropped from the sample.

1300 outliers (12.9% of outliers, for 1292 unique pages) are redirects, and were counted separately.

Same Day Edits
Comparing the edit history to the outliers we see that most pages don't have edits on the day they have spikes in page views:


 * same-day edits for 797 of 8739 article outliers (9.1%)
 * same-day edits for 12 of 1300 redirect outliers (0.9%)

Broken down by Base-2 Buckets (Median Page Views):

min and max are for the buckets as a whole (from above), edits, outliers, and edit% are for the outlier sample.

articles: min       max       edits   outliers    edit% Bucket 3: 11        15        76      761         10% Bucket 4: 16        31        217     3010        7.2% Bucket 5: 32        63        130     1840        7.1% Bucket 6: 64        127       92      1297        7.1% Bucket 7: 128       255       81      809         10% Bucket 8: 256       511       62      510         12.2% Bucket 9: 512       1023      62      282         22% Bucket 10: 1024     2047      38      129         29.5% Bucket 11: 2048     4095      20      65          30.8% Bucket 12: 4096     8181      11      25          44% Bucket 13: 8199     16344     7       10          70% Bucket 14: 16413    32572     1       1           100%

redirects: min      max       edits   outliers    edit% Bucket 3: 11        15        2       184          1.1% Bucket 4: 16        31        3       581          0.5% Bucket 5: 32        63        5       303          1.7% Bucket 6: 64        127       1       133          0.8% Bucket 7: 128       255       0        53          0.0% Bucket 8: 256       511       0        30          0.0% Bucket 9: 512       1023      0        10          0.0% Bucket 10: 1024     2047      0         2          0.0% Bucket 11: 2048     4095      1         3         33.3% Bucket 12: 4096     8181      0         0           - Bucket 13: 8199     16344     0         1          0.0%

"Sizable" Outliers
If we consider some minimum threshold for "sizable" outliers, the stats improve, but are not overwhelmingly better. min_size is the minimum size of the outlier (in page views), regardless of the size of the median number of page views. edits, outliers, and edit% are for the outlier sample. For redirects, the sample is probably too small to be meaningful, but the results are presented here anyway.

Even if we set the bar fairly high (e.g., 5000 page views for a "sizable" outlier) we will only expect to see edits on the same day for about half of outlier pages.

articles: min_size   edits   outliers    edit% 100       480     4130        11.6%    500        246     1261        19.5%   1000        185      678        27.3%   2000        120      342        35.1%   5000         65      137        47.4%  10000         33       57        57.9%  25000          7       13        53.8%  50000          2        3        66.7%

redirects: min_size   edits   outliers    edit% 100       2       367          0.5%   500        1        63          1.6%  1000        1        31          3.2%  2000        1        13          7.7%  5000        1         6         16.7% 10000        1         4         25.0% 25000        0         2          0.0% 50000        0         1          0.0%

Week by Week Changes
One proposal for smoothing out the outliers is to use a weekly average. The question then is how much it matters whether we do updates daily, weekly, or less often.

Using a 1-in-20 subset of the pages I have, I computed the 7-day average for weeks 0 through 3 (September 1-7, 2-14, 15-21, and 22-28). All missing page view counts (≤10) were replaced with 0.

I compared the averages for week 0 against each of week 1, week 2, and week 3 in order to gauge changes over a 1-, 2-, or 3-week interval.

The charts below shows the changes from week 0 to each of the other three weeks. The red line is x=y, and the blue lines represent a change by a factor of 1.25 (increase or decrease) from x=y. The dashed grey lines are x=500 and y=500.

This is more dispersion as the gap between samples increases, and Halley's Comet (see above) can be seen rising just above and to the left of (3,3). There is also a shift of mass away from the line x=y towards the point (1,0) in weeks 2 and 3, indicating that some pages had decreased views from week to week.. However, these are for very low total page counts.

For page counts above 500, there are very few points outside the 1.25x lines.



Weekly average page view comparison for week 0 vs weeks 1-3 of Sept 2015.

R code:

qplot(log10(wk0), log10(wk1), size=I(1), alpha=I(0.1), shape=I(1)) + ggtitle("Sept 2015 Week 0 vs Week 1 (log/log)") + stat_function(fun=function(x)(x), geom="line", alpha=0.7, color="red") + stat_function(fun=function(x)(x + log10(1.25)), geom="line", alpha=0.7, color="blue") + stat_function(fun=function(x)(x + log10(1/1.25)), geom="line", alpha=0.7, color="blue") + geom_vline(xintercept = log10(500), alpha=0.2, linetype = "longdash") + geom_hline(yintercept = log10(500), alpha=0.2, linetype = "longdash")

In order to get a better idea of how things change by week, I grouped weekly counts into the same Binary Buckets as before and plotted the points (with significant jitter to spread them around). Each square in the charts below is made up individual points that should be plotted on the nearest whole number intersection, but have been spread to better show relative density.



Weekly average page view comparison for week 0 vs week 1-3 of Sept 2015, using log2 buckets.

R code:

qplot(jitter(bwk0,1,0.45), jitter(bwk1,1,0.45), size=I(1/6)) + ggtitle("Sept 2015 Week 0 vs Week 1 (by Binary Bucket)")

For those who prefer tables of numbers, here is the same info in tabular format (presented in the same orientation as the charts above).

Week 0 vs Week 1:

16                                                                                               1     2 15                                                                                          1     5      14                                                                                    1     2     1      13                                          1     1                 1     1     12    23    1            12                              2     2           1     2     1     6     30    85    4                  11      1           1           2                 2     9     7     77    312   32    3     2            10      1     1     2                 1     1     9     15    184   677   72    5                        9       4           2     2     5     13    12    37    304   1522  121   8     1           1            8       5                 9     14    15    53    546   2736  260   12    1     1     1                  7       15    4     4     10    27    102   896   4647  411   29    8     2                              6       24    9     14    18    95    1468  7638  703   58    11    4     2           1                  5       59    25    38    160   2398  11685 1106  67    14    5           1                              4       178   102   356   3595  15891 2023  69    16    2     4                                          3       929   1146  3826  9499  3022  105   18    5     1     1     1                                    2       4887  3915  5755  3461  311   31    10    2                                                      1       16902 5767  3776  922   79    17    4                 1     1                                    0       41656 9921  3358  708   106   24    10          1     2     1                                    wk1/wk0 0     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15    16

Week 0 vs Week 2:

16                                         1                                                     1      15                                                                        1                 1     4     2 14                                                1                       1           1           1      13                                    1                       1     3     2     12    19    2     1      12      1                             1     3     1     2     2     10    34    72    7     1            11      2     1                 2     3     1     2     7     10    80    277   40    3     2            10      1     1                 2     1     7     11    29    232   622   92    6                        9       3                       5     9     15    41    304   1406  142   11    4     1                  8       6     1     2     6     17    21    70    560   2551  289   26    5     1                        7       20    5     5     13    34    111   865   4329  510   39    16                1                  6       30    5     19    36    149   1298  7019  930   104   27    4     3                 1            5       77    22    52    147   1816  10620 1568  107   27    10    2     3     1                        4       203   98    271   2242  14027 2959  156   32    12    6                                          3       694   590   2030  7617  4512  230   44    11    5     1     1                 1                  2       2127  1616  3524  4299  750   101   29    5     1                                                1       4329  2265  2950  1659  254   45    11    3                                                      0       57168 16286 8279  2365  384   86    29    3     1     4     3                                    wk2/wk0 0     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15    16

Week 0 vs Week 3:

16                                                     1                                                15                                                                                                2     2 14                                                      2           1           1           2     4      13                                          1           1     1           5     11    19                 12                                          1     1     1     6     15    32    67    5     2     1      11      1                 1           1     1     2     7     13    78    263   44    4     1            10      3     2                 2     2     5     17    22    228   613   96    8     1     1            9       4           2     2     3     12    15    53    320   1341  140   20    2     1                  8       7     1           3     11    42    81    538   2433  335   36    5     1     1                  7       27    3     2     18    41    107   935   4196  605   54    11    1           1                  6       34    14    10    33    169   1370  6771  1024  105   21    9     4     2           1            5       90    28    54    146   1868  10274 1665  135   28    13    1     2                              4       243   143   339   2229  13467 3103  201   41    17    7           1           1                  3       769   566   1892  7280  4693  281   57    13    5     3     1                                    2       2081  1497  3392  4168  829   78    18    3     1     2                                          1       3818  2044  2792  1764  259   49    14    5     1     1     1                                    0       57584 16592 8649  2740  610   167   53    8     4     2     3                                    wk3/wk0 0     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15    16

For page counts above 32 (Bucket 6 and up), there are very few items that change by more than a factor of two in 1, 2, or 3 weeks.

We should assess the impact of changes in page view stats on any scoring algorithm we consider. Does a factor of 2 change in page views have a significant impact in ranking? What about a factor of 4? If moderate changes don't affect the top ranked items, we don't need to worry too much about most blips.

PageviewAPI
Analytics now offers a page view API. It currently can be a day or so behind, but may offer a better way of catching important changes in page views. It's possible to get a list of the top 1000 most-viewed pages for a day (e.g., for Jan 20, 2016). If we ask nicely, we might be able to get more from them, if we promise to ask only once or twice a day.

Based on a quick review of my data for September, the top 50K pages for a day would give us all pages that have >1000 page views, which may very well be everything worth taking note of for scoring.

Conclusions & Recommendations

 * We can't reliably catch day-by-day outliers by using the page view information that comes along with edits because not enough edits happen, esp. on pages that are redirects. It may not be worth it to capture that stream of page view data since it is going to be quite voluminous but also very incomplete (about 30% of outliers above 2K page views—less on smaller page view volumes).


 * Weekly averages (rather than day-by-day counts) don't usually move that much (i.e., by more than a factor of 2). If we can capture daily or weekly page view stats, that should keep us reasonably up-to-date overall, esp. if these moderate swings don't affect scoring much.


 * If we are doing, say, weekly or less frequent page view updates for a rolling daily average, we could also gather daily statistics from the page view API and store the high mark over the last 3-7 for the top 1K to 50K most-viewed articles. The ranking algorithm could use either the rolling daily average or the high mark (which ever is higher).


 * For "Trending" topics, looking at the top 1K page views every hour (unfortunately not currently available through the PageviewAPI) would be the best way to catch suddenly trending topics if we want to be more responsive, but it isn't clear that it's worth it.