Wikimedia Search Platform/Search/Testing Search

How Do We Test Search Changes?
Testing proposed changes to search can be a complicated business, and how we do it depends on what kinds of changes we need to test. Some of the tools we use include the following (also see the Discovery Search Glossary):
 * There's the Relevance Forge (formerly "Relevance Lab"—but too many things have "lab" in their name). There are RelForge servers set up that allow us to host multiple copies of wiki indexes with different indexing schemes, or, more simply, we can query the same index with different configurations (weighting schemes, query types, etc—anything that doesn't require re-indexing). RelForge can measure the zero results rate (ZRR), the "poorly performing" rate (fewer than 3 results), query re-ordering, and a few other automatically computable metrics. It is possible to manually review the changed results, too. We can run RelForge against targeted corpora (i.e., queries with question marks) or general regression test corpora. It's a bit of a blunt instrument, but it makes it very easy to asses the maximum impact of a change quickly by seeing what percentage of a regression test set has changes. If it's unexpectedly high, we know something "interesting" is going on that warrants further investigation; when it's unexpectedly low, we can look at all the changed examples and try to figure out what's going on.
 * RelForge also has an optimizer that will take a number of numerical configuration parameters and a range of values for each and do a grid search over the parameter space to find the best config, based on PaulScore, which is derived from actual user click data. (See the Glossary for more on PaulScore, which is a name we gave an otherwise unnamed metric proposed by Paul Nelson.)


 * For English Wikipedia, we are trying to emulate Google, Bing, and others with Discernatron, which allows users to manually review and rank results. Unfortunately it's tedious to do the ranking and we don't have a lot of data because we aren't paying people to do it. However, we have a moderately sized test set that we can use to test proposed changes and to calibrate other less manual-work–intensive methods, which we hope to then generalize to other projects. We use the Discernatron scores to calculate discounted cumulative gain (DCG) values before and after changes. The DCG scores are based on the ordering of results compared to the manually scored results.


 * We've also been experimenting with other clickstream metrics, in particular using dynamic Bayesian networks (DBN). The DBN model uses a lot more aggregate data, which makes it more precise, but requires a certain number of people to have looked at equivalent queries (ignoring case, extra spaces, etc.), which means it must ignore the long tail, but we've been comparing it to the Discernatron results and it compares favorably, and requires no additional manual rating work. We've also just started experiments with learning to rank, but those are still in early days.

For individual search improvements, we use whichever of these methods seems most helpful at the time, depending on the scope and complexity of the improvements being evaluated. For the overall problem of showing results from so many second-try search options, we'll probably need to use A/B tests to see what display method and ordering has the most user engagement and refine that over time, since there's no a priori way to determine the best ordering of additional results.
 * We also use A/B tests when there's no other good way to test things, or we aren't fully confident in other measures. The data is collected and analyzed by our data analysts, and the analysis write-ups are on GitHub and Commons (for example, this one). They use a lot of very cool and high-powered statistical techniques to tease out what's going on, including ZRR, PaulScore, and lots, lots more. A/B tests can look at either general search performance, or performance on specific kinds of queries, similar to how we use RelForge.