# User:TJones (WMF)/Notes/How Wrong Would Using Out of Date Page View Data Be?

January 2016 — See TJones_(WMF)/Notes for other projects.

## Contents

## Introduction[edit]

We want to integrate page view information into the scoring algorithms we use for both the completion suggester and our regular search results. Our initial idea is we only update this page view information when doing normal document updates after a page edit (for technical reasons, page view data is available/provided when a page is edited). We need to analyze if this page view data will be "good enough" or if we need to do something more.

The goal of this analysis is to look at the distribution of how often pages are edited and compare that against how much page view data tends to change over time.

- We want to get a sense of how many outliers there are, and how big the spikes are (and report #/% of spikes).
- We want to to look at week to week changes in daily average over a week, with a one-week gap, two-week gap, and three-week gap, to get a sense of how fast things change week to week (and report #/% of significant changes by time span).
- We want to calculate the #/% of outlier spikes that do not have an edit the same day.

See Phabricator Ticket, too.

## Data Collection[edit]

I used Hive on stat1002 to get view counts for all pages, by day, for September 2015. I limited output to 50M rows to avoid the very, very long tail. (I started out with just 5M rows, but it wasn't enough!)

**The HQL:**

use wmf; SELECT SUM(view_count) AS views, year, month, day, project, page_title FROM wmf.pageview_hourly WHERE year = 2015 AND month = 9 AND project = 'en.wikipedia' AND agent_type = 'user' AND page_title NOT LIKE '%\:%' GROUP BY year, month, day, project, page_title ORDER BY views DESC LIMIT 50000000;

The last row had 10 page views, and since the list of pages with 10 page views wasn't complete, I deleted all of those, leaving 49,181,835 pages with 11 or more page views for any day in September 2015.

Some stats:

**There are 49,181,835 examples of pages with 11 or more views some day in September 2015.****There are 3,638,889 distinct pages.**

## Outliers[edit]

I looked for outliers in page views for each page using the interquartile range (IQR). High outliers are those that are "far" above the third quartile page view count, where "far" is 1.5x the difference between the first and third quartile. For example, if the first quartile daily page view count is 1000, and the third quartile count is 2000, then an outlier is 1.5x the difference (1000 * 1.5 = 1500) above the third quartile (2000 + 1500 = 3500). Read the IQR wiki page for a more detailed explanation. *(Also note that Mikhail suggested using Median Absolute Deviation over IQR to find outliers, but I didn't have the time to code it up and, more importantly, contemplate out how best to handle missing values. At least he tried to save me from myself.)*

Because the data is truncated for page counts below 11 page views, I had to extrapolate for incomplete data. Otherwise we would find many "outliers" for pages with small sample sizes. For example, page views of 15, 16, 17, and 25 would make 25 an "outlier", which seems unlikely, and also something we wouldn't feel bad about missing.

- If data was missing (i.e., below 11 page views) for more than 3/4 of the days in September, I assumed the first quartile page view count was 0 and the third quartile page count was 10. That is, maximize the size of the interquartile range (10), so that at least 25 page views are needed to count as an "outlier". (This undercounts outliers, but at these low page view counts, it doesn't really matter.)

- If data was missing (i.e., below 11 page views) for more than 1/4 (but less than 3/4) of the days in September, I assumed the first quartile page view count was 0 (and the third quartile remains the third quartile of the available data. That is, page view counts above 1.5x the third quartile of the available data would count as outliers.

I also looked at ignoring outliers compared to a minimum page view count. For example, while a page view count of 99 can technically be an outlier, it is not a serious failure not to catch it (compared to a spike of 500,000 page views). I counted outliers using IQR, and among those outliers notes those with at least 100, 500, 1K, 2K, and 5K daily page views.

**Outlier counts:**

**1,856,745 using IQR.**- 804,271 with IQR + minimum page views of 100 on that day.
- 233,694 with IQR + minimum page views of 500 on that day.
- 123,533 with IQR + minimum page views of 1,000 on that day.
- 61,361 with IQR + minimum page views of 2,000 on that day.
**21,333 with IQR + minimum page views of 5,000 on that day.**

It is interesting to note that most "outliers" (>93%) don't even get 1,000 page views.

### Pretty Pictures—Median vs Max[edit]

One of my goals for this analysis was to generate a plot of median value vs maximum value, to get a sense of how many outliers there are, and "how big the spikes are." Turns out a significant portion of pages have outliers, so the graphs aren't as clear or explanatory as I had hoped, but they are presented here anyway.

There are 3,638,889 distinct pages in the collection, and 1,002,680 (27.6%) have ≥1 outliers.

The Main Page and a few other pages have extremely high page view counts, so I've used a log/log scale. The graph includes a faint line along x=y since the aspect ratio of the graph isn't 1:1.

*Plot of median vs max page views (log/log scale) for enwiki pages in September 2015.*

**R code:**

qplot(log10(med), log10(max), col=out, alpha=I(1/5), size=I(1)) + stat_function(fun=function(x)(x), geom="line", alpha=0.1) + ggtitle("Sept 2015 Outliers - Median vs Max (log/log)")

Blue dots are pages with ≥1 IQR outliers, and black dots have no outliers. Outliers on or near the x=y line have low counts in the sample (e.g., only one daily page view count over 10).

The plot below shows the detail of the lower left quadrant of the graph (linear scale). The dots are more numerous and overlap more, and so they are more transparent in this view, and have had jitter applied so they don't all overlap.

*Detail of lower quadrant of plot of median vs max page views (linear scale) for enwiki pages in September 2015.*

**R code:**

qplot(jitter(med,1,0.3), jitter(max,1,0.3), col=out, xlim=c(0,100), ylim=c(0,100), alpha=I(1/10), size=I(1)) + stat_function(fun=function(x)(x), geom="line", alpha=0.1) + ggtitle("Sept 2015 Outliers - Median vs Max - Detail")

The obvious non-outlier cutoff at 25 (the "Bermuda Triangle") is for pages with data mising for 3/4 of the days in September, which limits their ability to be considered "outliers". See above. I'm not terribly concerned by the lack of outliers below 25 views in a day. Even if such a page view count is an outlier, it's still not terribly significant.

The cluster of dots above (3.3) on the log/log graph ("Halley's Comet") is unexplained.

### Distributions of Outliers by Median Page Views[edit]

I also divided the pages into logarithmic (base 2) buckets based on the median page views for each page, in anticipation that outliers would occur at different rates for pages with different numbers of "typical" daily page views. This turned out not to be true, and about 60-70% of pages in each well-represented bucket have outliers.

Note that there are no buckets below Bucket 3 because of the minimum page view limit of 11. Also, Bucket 3's outlier count has been suppressed by the missing data extrapolation above.

**Base-2 Buckets, by Median Page Views**

min max pages outliers out% Bucket 3: 11 15 2024918 148174 7.3% Bucket 4: 16 31 860004 383357 44.6% Bucket 5: 32 63 314590 201127 63.9% Bucket 6: 64 127 190513 118576 62.2% Bucket 7: 128 255 116662 71343 61.2% Bucket 8: 256 511 67463 41089 60.9% Bucket 9: 512 1023 35417 21491 60.7% Bucket 10: 1024 2047 18821 10761 57.2% Bucket 11: 2048 4095 7351 4631 63.0% Bucket 12: 4096 8181 2400 1619 67.5% Bucket 13: 8199 16344 598 410 68.6% Bucket 14: 16413 32572 118 87 73.7% Bucket 15: 33176 62681 29 12 41.4% Bucket 16: 76659 84088 2 2 100.0% Bucket 17: 155009 155009 1 0 0.0% Bucket 19: 840562 840562 1 1 100.0% Bucket 24: 18545589 18545589 1 0 0.0%

## Edits[edit]

For my initial analysis, I took a 1-in-182 sample of the full list of outliers, to get about ~10K pages to analyze more closely. There were 10,039 pages in the sample (with 9,983 unique pages—so there were 56 pages with 2 instances of outliers in the sample).

For the 9,983 unique pages, I pulled revision information for those pages using the following URL:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=timestamp&rvlimit=max&format=json&titles=<TITLE>

A number of pages (30) are missing, and seem to have been deleted since September. Those were dropped from the sample.

1300 outliers (12.9% of outliers, for 1292 unique pages) are redirects, and were counted separately.

### Same Day Edits[edit]

Comparing the edit history to the outliers we see that most pages don't have edits on the day they have spikes in page views:

- same-day edits for 797 of 8739 article outliers (9.1%)
- same-day edits for 12 of 1300 redirect outliers (0.9%)

Broken down by Base-2 Buckets (Median Page Views):

*min* and *max* are for the buckets as a whole (from above), *edits, outliers,* and *edit%* are for the outlier sample.

articles: min max edits outliers edit% Bucket 3: 11 15 76 761 10% Bucket 4: 16 31 217 3010 7.2% Bucket 5: 32 63 130 1840 7.1% Bucket 6: 64 127 92 1297 7.1% Bucket 7: 128 255 81 809 10% Bucket 8: 256 511 62 510 12.2% Bucket 9: 512 1023 62 282 22% Bucket 10: 1024 2047 38 129 29.5% Bucket 11: 2048 4095 20 65 30.8% Bucket 12: 4096 8181 11 25 44% Bucket 13: 8199 16344 7 10 70% Bucket 14: 16413 32572 1 1 100%

redirects: min max edits outliers edit% Bucket 3: 11 15 2 184 1.1% Bucket 4: 16 31 3 581 0.5% Bucket 5: 32 63 5 303 1.7% Bucket 6: 64 127 1 133 0.8% Bucket 7: 128 255 0 53 0.0% Bucket 8: 256 511 0 30 0.0% Bucket 9: 512 1023 0 10 0.0% Bucket 10: 1024 2047 0 2 0.0% Bucket 11: 2048 4095 1 3 33.3% Bucket 12: 4096 8181 0 0 - Bucket 13: 8199 16344 0 1 0.0%

#### "Sizable" Outliers[edit]

If we consider some minimum threshold for "sizable" outliers, the stats improve, but are not overwhelmingly better. *min_size* is the minimum size of the outlier (in page views), regardless of the size of the median number of page views. *edits, outliers,* and *edit%* are for the outlier sample. For redirects, the sample is probably too small to be meaningful, but the results are presented here anyway.

Even if we set the bar fairly high (e.g., 5000 page views for a "sizable" outlier) we will only expect to see edits on the same day for about half of outlier pages.

articles:min_size edits outliers edit% 100 480 4130 11.6% 500 246 1261 19.5% 1000 185 678 27.3% 2000 120 342 35.1% 5000 65 137 47.4% 10000 33 57 57.9% 25000 7 13 53.8% 50000 2 3 66.7%

redirects:min_size edits outliers edit% 100 2 367 0.5% 500 1 63 1.6% 1000 1 31 3.2% 2000 1 13 7.7% 5000 1 6 16.7% 10000 1 4 25.0% 25000 0 2 0.0% 50000 0 1 0.0%

## Week by Week Changes[edit]

One proposal for smoothing out the outliers is to use a weekly average. The question then is how much it matters whether we do updates daily, weekly, or less often.

Using a 1-in-20 subset of the pages I have, I computed the 7-day average for weeks 0 through 3 (September 1-7, 2-14, 15-21, and 22-28). All missing page view counts (≤10) were replaced with 0.

I compared the averages for week 0 against each of week 1, week 2, and week 3 in order to gauge changes over a 1-, 2-, or 3-week interval.

The charts below shows the changes from week 0 to each of the other three weeks. The red line is x=y, and the blue lines represent a change by a factor of 1.25 (increase or decrease) from x=y. The dashed grey lines are x=500 and y=500.

This is more dispersion as the gap between samples increases, and Halley's Comet (see above) can be seen rising just above and to the left of (3,3). There is also a shift of mass away from the line x=y towards the point (1,0) in weeks 2 and 3, indicating that some pages had decreased views from week to week.. However, these are for very low total page counts.

For page counts above 500, there are very few points outside the 1.25x lines.

*Weekly average page view comparison for week 0 vs weeks 1-3 of Sept 2015.*

**R code:**

qplot(log10(wk0), log10(wk1), size=I(1), alpha=I(0.1), shape=I(1)) + ggtitle("Sept 2015 Week 0 vs Week 1 (log/log)") + stat_function(fun=function(x)(x), geom="line", alpha=0.7, color="red") + stat_function(fun=function(x)(x + log10(1.25)), geom="line", alpha=0.7, color="blue") + stat_function(fun=function(x)(x + log10(1/1.25)), geom="line", alpha=0.7, color="blue") + geom_vline(xintercept = log10(500), alpha=0.2, linetype = "longdash") + geom_hline(yintercept = log10(500), alpha=0.2, linetype = "longdash")

In order to get a better idea of how things change by week, I grouped weekly counts into the same Binary Buckets as before and plotted the points (with significant jitter to spread them around). Each square in the charts below is made up individual points that should be plotted on the nearest whole number intersection, but have been spread to better show relative density.

*Weekly average page view comparison for week 0 vs week 1-3 of Sept 2015, using log2 buckets.*

**R code:**

qplot(jitter(bwk0,1,0.45), jitter(bwk1,1,0.45), size=I(1/6)) + ggtitle("Sept 2015 Week 0 vs Week 1 (by Binary Bucket)")

For those who prefer tables of numbers, here is the same info in tabular format (presented in the same orientation as the charts above).

**Week 0 vs Week 1:**

16 1 2 15 1 5 14 1 2 1 13 1 1 1 1 12 23 1 12 2 2 1 2 1 6 30 85 4 11 1 1 2 2 9 7 77 312 32 3 2 10 1 1 2 1 1 9 15 184 677 72 5 9 4 2 2 5 13 12 37 304 1522 121 8 1 1 8 5 9 14 15 53 546 2736 260 12 1 1 1 7 15 4 4 10 27 102 896 4647 411 29 8 2 6 24 9 14 18 95 1468 7638 703 58 11 4 2 1 5 59 25 38 160 2398 11685 1106 67 14 5 1 4 178 102 356 3595 15891 2023 69 16 2 4 3 929 1146 3826 9499 3022 105 18 5 1 1 1 2 4887 3915 5755 3461 311 31 10 2 1 16902 5767 3776 922 79 17 4 1 1 0 41656 9921 3358 708 106 24 10 1 2 1 wk1/wk0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

**Week 0 vs Week 2:**

16 1 1 15 1 1 4 2 14 1 1 1 1 13 1 1 3 2 12 19 2 1 12 1 1 3 1 2 2 10 34 72 7 1 11 2 1 2 3 1 2 7 10 80 277 40 3 2 10 1 1 2 1 7 11 29 232 622 92 6 9 3 5 9 15 41 304 1406 142 11 4 1 8 6 1 2 6 17 21 70 560 2551 289 26 5 1 7 20 5 5 13 34 111 865 4329 510 39 16 1 6 30 5 19 36 149 1298 7019 930 104 27 4 3 1 5 77 22 52 147 1816 10620 1568 107 27 10 2 3 1 4 203 98 271 2242 14027 2959 156 32 12 6 3 694 590 2030 7617 4512 230 44 11 5 1 1 1 2 2127 1616 3524 4299 750 101 29 5 1 1 4329 2265 2950 1659 254 45 11 3 0 57168 16286 8279 2365 384 86 29 3 1 4 3 wk2/wk0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

**Week 0 vs Week 3:**

16 1 15 2 2 14 2 1 1 2 4 13 1 1 1 5 11 19 12 1 1 1 6 15 32 67 5 2 1 11 1 1 1 1 2 7 13 78 263 44 4 1 10 3 2 2 2 5 17 22 228 613 96 8 1 1 9 4 2 2 3 12 15 53 320 1341 140 20 2 1 8 7 1 3 11 42 81 538 2433 335 36 5 1 1 7 27 3 2 18 41 107 935 4196 605 54 11 1 1 6 34 14 10 33 169 1370 6771 1024 105 21 9 4 2 1 5 90 28 54 146 1868 10274 1665 135 28 13 1 2 4 243 143 339 2229 13467 3103 201 41 17 7 1 1 3 769 566 1892 7280 4693 281 57 13 5 3 1 2 2081 1497 3392 4168 829 78 18 3 1 2 1 3818 2044 2792 1764 259 49 14 5 1 1 1 0 57584 16592 8649 2740 610 167 53 8 4 2 3 wk3/wk0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

For page counts above 32 (Bucket 6 and up), there are very few items that change by more than a factor of two in 1, 2, or 3 weeks.

We should assess the impact of changes in page view stats on any scoring algorithm we consider. Does a factor of 2 change in page views have a significant impact in ranking? What about a factor of 4? If moderate changes don't affect the top ranked items, we don't need to worry too much about most blips.

## PageviewAPI[edit]

Analytics now offers a page view API. It currently can be a day or so behind, but may offer a better way of catching important changes in page views. It's possible to get a list of the top 1000 most-viewed pages for a day (e.g., for Jan 20, 2016). If we ask nicely, we might be able to get more from them, if we promise to ask only once or twice a day.

Based on a quick review of my data for September, the top 50K pages for a day would give us all pages that have >1000 page views, which may very well be everything worth taking note of for scoring.

## Conclusions & Recommendations[edit]

**We can't reliably catch day-by-day outliers by using the page view information that comes along with edits**because not enough edits happen, esp. on pages that are redirects. It may not be worth it to capture that stream of page view data since it is going to be quite voluminous but also very incomplete (about 30% of outliers above 2K page views—less on smaller page view volumes).

**Weekly averages**(rather than day-by-day counts)**don't usually move that much**(i.e., by more than a factor of 2). If we can capture daily or weekly page view stats, that should keep us reasonably up-to-date overall, esp. if these moderate swings don't affect scoring much.

- If we are doing, say, weekly or less frequent page view updates for a rolling daily average, we
**could also gather daily statistics from the page view API and store the high mark over the last 3-7 for the top 1K to 50K most-viewed articles.**The ranking algorithm could use either the rolling daily average or the high mark (which ever is higher).

**For "Trending" topics, looking at the top 1K page views every hour**(unfortunately not currently available through the PageviewAPI)**would be the best way to catch suddenly trending topics**if we want to be more responsive, but it isn't clear that it's worth it.