User:TJones (WMF)/Notes/Phrase Slop Pre-Test

From MediaWiki.org
Jump to navigation Jump to search

August 2015 — See TJones_(WMF)/Notes for other projects.

Introduction[edit]

We are continuing with our A/B testing of potential ElasticSearch improvements, focusing this time on "phrase slop", which is a parameter that allows query phrases, like "quick fox", to match phrases in documents that have one or more extra words in them, like "quick brown fox". A slop of one allows one extra word to intervene, a slop of 50 allows 50 extra words to intervene.

For more on phrase slop, see the ElasticSearch documentation.

Data Sample[edit]

I took a random sub-sample of one day's worth (~8 am to ~8 am starting on 2015-08-15) of full text queries against ptwiki and dewiki, sampled at 1-in-10 and 1-in-50 respectively. We chose ptwiki and dewiki because they are the largest wikis we can load in labs for testing at the moment.

I also collected a similar sample from enwiki (1-in-100), for comparison purposes, though we do not have a test copy of enwiki to run slop tests against.

Below are stats on the samples. Because only queries with phrases in them are affected by phrase slop, I only looked at queries with at least one double quote character in them. Thus, not all are actually phrase searches—some have only one double quote in them, and so won't be affected by changes to slop.

In the quoted queries, I found:

  • Many were DOI queries, which I did not test.
  • There were also many media player queries ("..." film).
  • enwiki still has a lot of "title_1" AND "title 2" queries, generally all coming from one IP address.; neither ptwiki nor dewiki have any.
  • There were a very small number of queries that seemed to be from malformed log entries, which were excluded.

Note that non-DOI, non–media-player, non-title-AND-title queries with quotes make up less than 1% of queries for all three wikis.

ptwiki dewiki enwiki
Data
sample queries 75,457 68,970 103,346
queries with quotes 2,790 (3.70%) 1321 (1.92%) 11,172 (10.8%)
DOI queries (ignored) 2,578 481 365
media player queries 135 237 755
"title_1" AND "title 2" 0 0 9,668 (9.35%)
malformed query log 3 2 3
everything else 74 (0.10%) 601 (0.87%) 381 (0.37%)
Queries to Re-run
well-formed non-DOI queries 209 838
unique non-DOI queries 197 663
unique media player queries 133 212
unique everything else 64 451
unique single " queries 14 9 18

Method[edit]

I re-ran the unique quoted queries (ignoring the DOI queries for now) against the lab instance of ptwiki and dewiki, with "precise" slop (the value used when searching for "phrases in quotes") set to 0, 1, 2, and 50, and recorded the total number of hits that resulted from each query for each slop value. Note that 50 is probably not a plausible slop value, but it provides an upper bound for a very loose configuration.

The current production configuration has zero slop, so I computed the differences in results compared to zero slop, both for zero queries, which we want to improve, and all queries, since we don't want to make everything else appreciably worse.

Results[edit]

The numbers below are for unique queries (to save CPU), so they are somewhat skewed—less so for ptwiki and more so for dewiki—but the trend is quite clear, since the number of non-zero deltas in every category for slop values of 1 and 2 are in the single digits.

ptwiki[edit]

ptwiki showed no effect for slop of 1 or 2 on current zero queries, though there was a minimal increase in total hits returned for other queries with quotes.

all quoted queries ____ quoted zero queries
ptwiki slop=1 slop=2 slop=50 slop=1 slop=2 slop=50
no_change 186 180 165 151 151 147
+1 1 4 3 1
+2 3 3
+3 1
+4 1
+5 2
+6
+7 1
+8 1
+9 2 1
+10 1 1
+11 to +20 1 5 1
+21 to +30 1
+31 to +40 1
+41 to +50 2 1 1 1
+51 to +100 1 4 1
+101 to +500 3 3 3
+501 to +1000 3
+1001 to +10000 5

dewiki[edit]

dewiki returned more results for zero queries with slop set at 1 or 2, but the effect is small—only 7 and 14 queries out of 362 unique zero-result queries.

all quoted queries ____ quoted zero queries
dewiki slop=1 slop=2 slop=50 slop=1 slop=2 slop=50
no_change 557 448 391 355 348 315
+1 38 51 32 4 8 16
+2 14 22 23 1 5
+3 12 13 13 1 3
+4 6 6 6 1
+5 5 16 8 1 1
+6 2 5 6 1
+7 3 7 4 1
+8 3 4 6 1
+9 1 3 6 1 1
+10 3 2 1
+11 to +20 8 13 21 1 1 4
+21 to +30 1 8 20 1 4
+31 to +40 2 8 16 1 2
+41 to +50 3 1 9
+51 to +100 1 3 29 2
+101 to +500 3 7 45 2
+501 to +1000 2 8 2
+1001 to +10000 3 3 12 1
+10001 and up 5

Pretty Pictures[edit]

I was hoping for some pretty scatter plots showing the changes in results for various slop values, but they were pretty disappointing. Here they are anyway.

Some data points for dewiki are not shown on the graph (see above), because the changes were extreme.

ptwiki[edit]

Changes in results for ptwiki with slop set to 1, 2, or 50. The straight line represents no change.

PT slop test 20 0-1.png PT slop test 20 0-2.png PT slop test 20 0-50.png

dewiki[edit]

Changes in results for dewiki with slop set to 1, 2, or 50. The straight line represents no change.

DE slop test 20 0-1.pngDE slop test 20 0-2.png DE slop test 20 0-50.png

Conclusions[edit]

Overall, the effect of setting slop to 1 or 2 was minimal. There aren't that many queries with quotes, and most quoted zero queries are not affected by changes in slop, while a small number of non-zero quoted queries return additional, arguably less precise, results.

Given the differences in behavior between ptwiki and dewiki, it isn't clear that we can easily extrapolate to enwiki. However, there aren't that many quoted queries (outside of DOI, media player, and "title_1" AND "title 2" queries), so the maximum effect on human users will necessarily be small.

Additional Notes[edit]

DOI v. slop[edit]

DOI queries, as a practical matter, are not affected by slop. I ran 187 DOI queries from dewiki and 145 DOI queries from ptwiki with slop values of 0, 1, 2, 50, and 5000. All returned 0 hits in all cases.

To Do[edit]

I can run with higher slop values (3, 4, or 5) or gather quoted query stats for other wikis if anyone is interested.