Reading/Search Engine Optimization/Sitemaps test



This page describes Wikimedia Product and Wikimedia Technology's work to improve Wikipedia presence in search results by creating XML sitemaps for search crawlers (tracked in T198965 on Phabricator). This extended study is a follow-up to the inconclusive analysis of an earlier effort on Italian Wikipedia. We generated sitemaps for Indonesian, Korean, Dutch, Punjabi, and Portuguese Wikipedias and analyzed search-referred traffic to those wikis. Our thorough analysis and statistical models yielded inconclusive results.

Unlike the sameAs A/B test which enabled us to detect a small, statistically significant change by being a randomized controlled experiment, we were not able to detect a causal impact with this test. Depending on costs and complexity of maintaining up-to-date sitemaps (see Discussion below for details), we may need to perform either additional evaluations or put this particular SEO strategy on indefinite pause.

Introduction
Sitemaps allow a website's administrator to inform search engines about URLs on a website that are available for crawling. A sitemap is an XML file that lists the URLs for a site. It can include information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. As part of our overall efforts to improve SEO for Wikipedia and its sister projects, we decided to evaluate the potential of sitemaps in increasing traffic from search engines like Google. The following languages have had sitemaps created and submitted to the Google Search Console:


 * Indonesian (idwiki)
 * Italian (itwiki) from before, so it has a sitemap but its traffic was not included in the analysis
 * Korean (kowiki)
 * Dutch (nlwiki, nds_nlwiki)
 * Punjabi (pawiki, pnbwiki)
 * Portuguese (ptwiki)

The following language have been kept from the sameAs A/B test to be used as "controls" in this test:


 * Bhojpuri (bhwiki)
 * Cherokee (chrwiki)
 * Kazakh (kkwiki)
 * Catalan (cawiki)
 * French (frwiki)
 * Yoruba (yowiki)
 * Kalmyk (xalwiki)

Prior to the analysis, we decided on the following action plan to inform our decision-making based on the data:

Methods
We performed the analysis using the methodology introduced by Brodersen et al. (2015) wherein a Bayesian structural time series (BSTS) model is trained on the pre-intervention period of the set of control time series unaffected by the intervention. That model is used to generate predictions of the counterfactual time series – "what if sitemaps were not deployed?" in our case – and then we compared the predicted time series with the actual time series to infer the impact. This is the same approach employed by Xie et al. (2019) to asses the impact of the Hindi Wikipedia awareness campaign.

The model of search engine-referred traffic among treated wikis included a local trend and various seasonality & autoregressive components:


 * AR(5)
 * Day of week
 * Week of year
 * Christmas & New Year as holidays

as well as a "control" time series which we assume to be unaffected by the intervention:


 * search engine-referred traffic to "control" wiki(s)

We also evaluated a version of the model which did not have the seasonality components, which is what Brodersen et al. use in their paper because the assumption there is that the control time series would handle any seasonalities and external factors. However, because the controls were a poor match to the target time series, the seasonality adjustments turned out to be a required part of the model.

We utilized 10-fold (5-fold in case of individual wiki traffic) forward-chaining cross-validation (CV) to estimate the MAPE of the models and assess the accuracy of our model in predicting the counterfactual. Since we were analyzing 30 days of traffic post-intervention, we evaluated the model on 10 blocks of 30 days leading up to the intervention, using all the data available relative to each of the evaluation blocks ("folds").

Results
Using a model trained on daily traffic from 2016-02-05 (when we began tracking search engine-referred traffic separately from externally-referred traffic in general) through 2018-11-14 (the day before the intervention) to forecast a 30-day counterfactual from 2018-11-15 through 2018-12-15, we found no statistically significant evidence of SEO improvement. We modeled the mobile and desktop traffic separately and the results can be seen in and. In each figure, the top half shows the predictions $$\hat{y}$$ (with a 95% Credible Interval) in yellow and the actual time series $$y$$ in black; the bottom half is split between showing the estimated absolute impact $y - \hat{y}$ in blue and estimated relative impact $$\frac{y - \hat{y}}{\hat{y}}$$ in green.

Although the estimated impact is above 0 on most days – suggesting a possible positive effect – the daily 95% CI consistently includes 0, which means our model has not found evidence of impact on visits from search engines with the data we have. These results are consistent with what we saw in the previous analysis (which employed different methodology), wherein we did not find convincing evidence of impact.

In addition to the aggregated traffic, we also analyzed traffic to the individual languages (with mixed results and accuracy) to see whether any one language dominated or masked the results when combined with others. This separate, per-language analysis did not yield any additional insights. The best predictive model was the model of search-referred traffic to Dutch Wikipedia and its results can be seen in and, which show a potential positive effect (albeit relatively small), but the high uncertainty makes these results inconclusive.

Discussion
Google's Search Console documentation says this about sitemaps:

Search engines like Google read this file to more intelligently crawl your site. A sitemap tells the crawler which files you think are important in your site, and also provides valuable information about these files: for example, for pages, when the page was last updated, how often the page is changed, and any alternate language versions of a page.

If your site’s pages are properly linked, our web crawlers can usually discover most of your site. Even so, a sitemap can improve the crawling of your site, particularly if your site meets one of the following criteria:


 * Your site is really large. As a result, it’s more likely Google web crawlers might overlook crawling some of your new or recently updated pages.
 * Your site has a large archive of content pages that are isolated or not well linked to each other. If your site pages do not naturally reference each other, you can list them in a sitemap to ensure that Google does not overlook some of your pages.

Using a sitemap doesn't guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling. However, in most cases, your site will benefit from having a sitemap, and you'll never be penalized for having one.

If we wished to investigate sitemaps further on a second set of wikis, we may need to rethink our testing approach and design a different experiment. Alternatively, we could narrow our analysis at a substantial time cost. Essentially, sitemaps are effective for pages that don't have links to them, and so would be unvisitable by crawlers which rely on links to index sites, which means that obscure or unpopular articles are the ones most likely to be substantially impacted by sitemaps. However, as Google states "a sitemap doesn't guarantee that all items will be crawled and indexed" which means that even if we were to invest in identifying low-link, high-potential pages to focus our analysis on, it's not guaranteed that they would have been indexed.

Furthermore, there is still a big, outstanding question: once a sitemap is generated, how often does it need to be updated? Wikipedias and its sister projects have hundreds (thousands?) of pages being added/renamed/removed every day, how much work would it require to maintain a relatively up-to-date sitemap for each wiki? Would the process be automated entirely and simply require a large initial investment of building out an automated system? What does that initial effort look like in terms of engineer time? What does the automated process require in terms of computational resources? These are aspects we need to consider.