User:TJones (WMF)/Notes/Phrase Slop Pre-Test

Introduction
We are continuing with our A/B testing of potential ElasticSearch improvements, focusing this time on "phrase slop", which is a parameter that allows query phrases, like "quick fox", to match phrases in documents that have one or more extra words in them, like "quick brown fox". A slop of one allows one extra word to intervene, a slop of 50 allows 50 extra words to intervene.

Data Sample
I took a random sub-sample of one day's worth (~8 am to ~8 am starting on 2015-08-15) of full text queries against ptwiki and dewiki, sampled at 1-in-10 and 1-in-50 respectively. We chose ptwiki and dewiki because they are the largest wikis we can load in labs for testing at the moment.

I also collected a similar sample from enwiki (1-in-100), for comparison purposes, though we do not have a test copy of enwiki to run slop tests against.

Below are stats on the samples. Because only queries with phrases in them are affected by phrase slop, I only looked at queries with at least one double quote character in them. Thus, not all are actually phrase searches—some have only one double quote in them, and so won't be affected by changes to slop.

In the quoted queries, I found: Note that non-DOI, non–media-player queries with quotes make up less than 1% of queries for all three wikis.
 * Many were DOI queries, which I did not test.
 * There were also many media player queries ("..." film).
 * enwiki still has a lot of "title_1" AND "title 2" queries, generally all coming from one IP address.; neither ptwiki nor dewiki have any.
 * There were a very small number of queries that seemed to be from malformed log entries, which were excluded.

Method
I re-ran the unique quoted queries (ignoring the DOI queries for now) against the lab instance of ptwiki and dewiki, with "precise" slop (the value used when searching for "phrases in quotes") set to 0, 1, 2, and 50, and recorded the total number of hits that resulted from each query for each slop value. Note that 50 is probably not a plausible slop value, but it provides an upper bound for a very loose configuration.

The current production configuration has zero slop, so I computed the differences in results compared to zero slop, both for zero queries, which we want to improve, and all queries, since we don't want to make everything else appreciably worse.

Results
The numbers below are for unique queries (to save time, since each query takes on the order of 30 seconds to run), so they are somewhat skewed—less so for ptwiki and more so for dewiki—but the trend is quite clear, since the number of non-zero deltas in every category are in the single digits.

ptwiki
ptwiki showed no effect for slop of 1 or 2 on current zero queries, though there was a minimal increase in total hits returned for other queries with quotes.

dewiki
dewiki returned more results for zero queries with slop set at 1 or 2, but the effect is small—only 4 and 12 queries out of 518 unique zero-result queries.

Pretty Pictures
I was hoping for some pretty scatter plots showing the changes in results for various slop values, but they were pretty disappointing. Here they are anyway.

Conclusions
Overall, the effect of setting slop to 1 or 2 was minimal. There aren't that many queries with quotes, and most quoted zero queries are not affected by changes in slop, while a small number of non-zero quoted queries return additional, arguably less precise results.

To Do

 * Run some DOI queries against ptwiki and dewiki to see what happens
 * Run with higher slop values (3, 4, or 5) if anyone asks
 * Gather quoted query stats for other wikis if anyone asks
 * Fill out any empty stats slots (e.g., for enwiki) if anyone asks