User:MPopov (WMF)/SEO/sameAs test

Group:	Readers Web and Product Analytics
Start:	2018-11-14
End:	2019-06
Team members:	Olga Vasileva (Program Manager), Stephen Niedzielski (Software Engineer), Tilman Bayer (Data Analyst), Mikhail Popov (Data Analyst)
Lead:	Olga Vasileva
Management:	Olga Vasileva (program), Sam Smith (engineering), Kate Zimmerman (analysis)

This page is an archive. Do not edit the contents of this page. Please direct any additional comments to the current talk page.

This page describes Wikimedia Audiences's work to improve Wikipedia presence in search results by adding "sameAs" meta property to Wikipedia articles pointing to the corresponding (tracked in T209306 on Phabricator). Using hierarchical regression modelling of search engine-referred daily traffic to 269 editions of Wikipedia, we estimated the effect of sameAs to be a 1.4% increase (95% CI: 0.7-2.1) in average page views per day. This suggests that rolling the feature out to 100% of pages (where applicable) and all Wikipedias would be a beneficial and impactful decision.

Introduction

Following an SEO consultation, one of the recommendations to improve Wikimedia presence in search results was to add structured data in the form of the Schema.org "sameAs" meta property, which is defined as:

URL of a reference Web page that unambiguously indicates the item's identity. E.g. the URL of the item's Wikipedia page, Wikidata entry, or official website.

We decided to assess the possible impact of this recommendation in a scientific and statistically valid way, so we designed a randomized controlled experiment in which Wikipedia articles would be randomly assigned to either the control group or the treatment group – meaning they would have the sameAs meta element if applicable. We also came up with a plan to guide our decision making:

Actions based on change in average search engine-referred page views per day
Change	Action	Notes
>1% decrease	remove sameAs
0-1% decrease	discussion, most likely remove	discuss if difference is small enough to warrant full rollout
0-1% increase	discussion, most likely rollout	discuss if difference is small enough to warrant full rollout
>1% increase	rollout	rollout feature to 100%, proceed with other wikis

The test was gradually deployed on 14 November 2018 to a small percentage of pages to make sure that nothing broke (cf. T208755). By 20 November, all articles on each tested Wikipedia were in the test. As of 5 March 2019, the test is still deployed and pending final decision based on the results of the analysis described here.

Methods

Design of Experiment

Certain languages were excluded from the test due to reservation for a parallel SEO test, but the sameAs test was rolled out to 270 editions of Wikipedia. On each of those Wikipedias, articles were randomly partitioned into a control group and a treatment group. Test group assignment was performed using the random number that each page has in the MediaWiki database (cf. page_random).

Articles in the control group did not have a sameAs meta property added, while articles in the treatment group had a sameAs meta property added if one was available (not all articles are associated with Wikidata Q-items). For this test, we were interested in measuring differences in overall traffic between the control and treatment groups. So if a page was in the treatment group but did not have a sameAs meta property added because it was not associated with a Wikidata Q-item, we still included search engine-referred traffic to it in our analysis.

Analysis

In our analysis we:

determined the assignment of pages using the January 2019 snapshot of MediaWiki pages in the Data Lake
calculated the total page views within each tested Wikipedia language, by test group
focused on search engine-referred traffic
excluded known spider traffic (as determined by User-Agent pattern matching), focusing on "user" traffic
split the traffic into "top 100 pages" and "less popular pages" (determined on a day-by-day basis) because traffic to most popular pages fluctuates wildly and introduces a lot of variability

With the exception of low-traffic Wikipedias which had fewer than 100 visited articles per day, the analysis focused specifically on the more stable traffic to less popular pages. One Wikipedia was excluded because one of its test groups did not receive enough traffic, leaving us with 269 Wikipedias (which included English, Spanish, Japanese, Russian, German, Arabic, Thai, Hindi, Afrikaans, among others) to analyze.

Basic Check

When looking at the ratio of post-deployment to pre-deployment average page views per day, the treatment group had a better ratio (greater, not greater than or equal to) than the control group in 159 languages. As an initial assessment, we specified the following hypothesis test of whether the post-pre ratio of traffic was better in the treatment group compared to the control group:

{\begin{aligned}{\text{H}}_{0}&:p\leq 0.5\\{\text{H}}_{a}&:p>0.5\end{aligned}}

Performing a one-sided exact binomial test (also known as a sign test) on 159 successes (treatment > control) out of 269 total observations yields a p-value of 0.0017, which means that we can reject the null hypothesis with a high degree of confidence.

Inferring Impact

During EDA, we saw that the two groups had differences in traffic before deployment, which informed us that our model of post-deployment traffic should adjust for pre-deployment traffic.

Since the data are highly right-skewed, we applied the log transformation to stabilize variance and yield residuals with correct distributional qualities which our multilevel model assumes. We use multilevel modeling (also known as hierarchical and mixed-effects regression) because the observations are nested within language, so we specify a per-language random intercept to increase the accuracy of the model in identifying the fixed effect of treatment. The final model is:

{\begin{aligned}\log y_{i}&\sim {\mathcal {N}}(\alpha _{j[i]}+\beta _{x}\log x_{i}+\beta _{T}T_{i},\sigma _{y}^{2}),i=1,\ldots ,538\\\alpha _{j}&\sim {\mathcal {N}}(\mu _{0},\sigma _{\alpha }^{2}),j=1,\ldots ,269\end{aligned}}

where $y$ is the post-deployment average pageviews per day, $x$ is the pre-deployment average pageviews per day, and $T$ is an indicator variable indicating the test group (0 = control, 1 = treatment), $\alpha _{j}$ is a random intercept, and $\mu _{0}$ is the overall intercept. Because the response variable is on the log scale, $e^{\beta _{T}}$ yields the multiplicative effect of treatment, and the 95% confidence interval can be estimated using the delta method.

Results

Fitting the model to the data using lmer function from the lme4 R package, we obtained the following estimates:

Parameter estimates
Parameter	Estimate	Standard Error	95% Confidence Interval
$\beta _{x}$ (pre-test)	1.012	0.004	(1.00, 1.02)
$\sigma _{y}$ (individual std. dev.)	0.04
$\sigma _{\alpha }$ (group std. dev.)	0.278
$\mu _{0}$ (overall intercept)	-0.157	0.023	(-0.203, -0.112)
$\beta _{T}$ (treatment)	0.014	0.003	(0.007, 0.021)
$e^{\beta _{T}}$ (multiplicative effect)	1.014	0.004	(1.007, 1.021)

That is, we estimate the impact of sameAs meta property to be a 1.4% increase in search engine-referred traffic on average after accounting for pre-test traffic within wikis, with the 95% confidence interval of 0.7–2.1% increase. Based on these results and our decision plan laid out above, we recommend rolling out the feature to 100% of pages (where applicable) and the remaining languages.

References

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Sheather, S. (2009). A Modern Approach to Regression with R. New York, NY: Springer Science & Business Media.