User:TJones (WMF)/Notes/Phrase Slop Pre-Test
August 2015 — See TJones_(WMF)/Notes for other projects.
We are continuing with our A/B testing of potential ElasticSearch improvements, focusing this time on "phrase slop", which is a parameter that allows query phrases, like "quick fox", to match phrases in documents that have one or more extra words in them, like "quick brown fox". A slop of one allows one extra word to intervene, a slop of 50 allows 50 extra words to intervene.
For more on phrase slop, see the ElasticSearch documentation.
I took a random sub-sample of one day's worth (~8 am to ~8 am starting on 2015-08-15) of full text queries against ptwiki and dewiki, sampled at 1-in-10 and 1-in-50 respectively. We chose ptwiki and dewiki because they are the largest wikis we can load in labs for testing at the moment.
I also collected a similar sample from enwiki (1-in-100), for comparison purposes, though we do not have a test copy of enwiki to run slop tests against.
Below are stats on the samples. Because only queries with phrases in them are affected by phrase slop, I only looked at queries with at least one double quote character in them. Thus, not all are actually phrase searches—some have only one double quote in them, and so won't be affected by changes to slop.
In the quoted queries, I found:
- Many were DOI queries, which I did not test.
- There were also many media player queries ("..." film).
- enwiki still has a lot of "title_1" AND "title 2" queries, generally all coming from one IP address.; neither ptwiki nor dewiki have any.
- There were a very small number of queries that seemed to be from malformed log entries, which were excluded.
Note that non-DOI, non–media-player, non-title-AND-title queries with quotes make up less than 1% of queries for all three wikis.
|queries with quotes||2,790 (3.70%)||1321 (1.92%)||11,172 (10.8%)|
|DOI queries (ignored)||2,578||481||365|
|media player queries||135||237||755|
|"title_1" AND "title 2"||0||0||9,668 (9.35%)|
|malformed query log||3||2||3|
|everything else||74 (0.10%)||601 (0.87%)||381 (0.37%)|
|Queries to Re-run|
|well-formed non-DOI queries||209||838|
|unique non-DOI queries||197||663|
|unique media player queries||133||212|
|unique everything else||64||451|
|unique single " queries||14||9||18|
I re-ran the unique quoted queries (ignoring the DOI queries for now) against the lab instance of ptwiki and dewiki, with "precise" slop (the value used when searching for "phrases in quotes") set to 0, 1, 2, and 50, and recorded the total number of hits that resulted from each query for each slop value. Note that 50 is probably not a plausible slop value, but it provides an upper bound for a very loose configuration.
The current production configuration has zero slop, so I computed the differences in results compared to zero slop, both for zero queries, which we want to improve, and all queries, since we don't want to make everything else appreciably worse.
The numbers below are for unique queries (to save CPU), so they are somewhat skewed—less so for ptwiki and more so for dewiki—but the trend is quite clear, since the number of non-zero deltas in every category for slop values of 1 and 2 are in the single digits.
ptwiki showed no effect for slop of 1 or 2 on current zero queries, though there was a minimal increase in total hits returned for other queries with quotes.
|all quoted queries||____||quoted zero queries|
|+11 to +20||1||5||1|
|+21 to +30||1|
|+31 to +40||1|
|+41 to +50||2||1||1||1|
|+51 to +100||1||4||1|
|+101 to +500||3||3||3|
|+501 to +1000||3|
|+1001 to +10000||5|
dewiki returned more results for zero queries with slop set at 1 or 2, but the effect is small—only 7 and 14 queries out of 362 unique zero-result queries.
|all quoted queries||____||quoted zero queries|
|+11 to +20||8||13||21||1||1||4|
|+21 to +30||1||8||20||1||4|
|+31 to +40||2||8||16||1||2|
|+41 to +50||3||1||9|
|+51 to +100||1||3||29||2|
|+101 to +500||3||7||45||2|
|+501 to +1000||2||8||2|
|+1001 to +10000||3||3||12||1|
|+10001 and up||5|
I was hoping for some pretty scatter plots showing the changes in results for various slop values, but they were pretty disappointing. Here they are anyway.
Some data points for dewiki are not shown on the graph (see above), because the changes were extreme.
Changes in results for ptwiki with slop set to 1, 2, or 50. The straight line represents no change.
Changes in results for dewiki with slop set to 1, 2, or 50. The straight line represents no change.
Overall, the effect of setting slop to 1 or 2 was minimal. There aren't that many queries with quotes, and most quoted zero queries are not affected by changes in slop, while a small number of non-zero quoted queries return additional, arguably less precise, results.
Given the differences in behavior between ptwiki and dewiki, it isn't clear that we can easily extrapolate to enwiki. However, there aren't that many quoted queries (outside of DOI, media player, and "title_1" AND "title 2" queries), so the maximum effect on human users will necessarily be small.
DOI v. slop
DOI queries, as a practical matter, are not affected by slop. I ran 187 DOI queries from dewiki and 145 DOI queries from ptwiki with slop values of 0, 1, 2, 50, and 5000. All returned 0 hits in all cases.
I can run with higher slop values (3, 4, or 5) or gather quoted query stats for other wikis if anyone is interested.