User talk:TJones (WMF)/Notes/Relevance Lab

Relevance Lab
I would even point a benefit over A/B tests here: with A/B tests we're unable to control the behavior of the same query with 2 different configurations. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Query Sets
Is it intentional that queries in B are the same in A without quotes? DCausse (WMF) (talk) 12:25, 20 October 2015 (UTC)

Query Mungers
It's not clear to me why you highlight query mungers here? Isn't it the whole purpose of the relevance lab: test various techniques to run a query. IMHO we should not alter user query outside cirrus. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Targeted Query Sets
This is extremely interesting, we could even go further and tag queries with labels like : ambiguous query, poor recall queries, questions, with quotes, with special syntax. Some algo we'll implement will certainly address only a specific use case, e.g. the query kennedy is extremely ambiguous and today results on enwiki are IMHO quite bad. Current work with pageviews will directly affect this kind of queries. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Concerning this section, it's a bit confusing. I'd suggest that we have a corpus of queries (annotated with descriptive tags allowing us to filter out a specific sub-corpus). But instead of talking about query sets A or B I would suggest talking about config A, B, C... only. Queries should remain the same. The fact that a query can be munged should be addressed in a config. We could even go a bit further and start to define what's a config:

A config is all the aspects of the system that affect the results:


 * 1) Index time
 * 2) Document model
 * 3) Parser: how do we parse wikitext, extract opening text and other subfields
 * 4) Analysis config: tokenization, stopwords, accent squashing...
 * 5) Mapping: how do we map the document model with the analysis config, reverse field...
 * 6) Query time
 * 7) Query Parser: currently we have only one parser and one syntax (special syntax)
 * 8) Query dependent scores
 * 9) Fulltext to ElasticDSL: this is how we map the fulltext query to elasticsearch DSL, we have today:
 * 10) Default query string with AND operator and various weights
 * 11) Common Terms Query with 4 profiles
 * 12) Initial score: this is the core function that gives the initial score to the document according to the user query, today we use the default (simple TD.IDF) but we could switch to other formulas like BMF25.
 * 13) Rescores: reorder the top-N docs, allow more complex algo to be applied on a limited number of docs
 * 14) Query dependant rescores: currently we only have phrase rescore on top 512 results which will over boost results where the input query appears in the exact same order in the doc.
 * 15) Query independant rescores: these are scores that depends only on the document. Today we have:
 * 16) incoming links (activated by default)
 * 17) boost templates (activated by default on enwiki)
 * 18) recency (activated by default on wikinews)
 * 19) Fallback methods
 * 20) Did you mean suggestions
 * 21) Lang detect+query against another wiki

If I want to work on query independent rescore like pagerank/pageviews I would first select the queries annotated "ambiguous" because I know that it's on these queries that rescore will have more impact.

So here removing quotes is either a Query Parser variation or a fallback method, but not a query set. DCausse (WMF) (talk) 12:25, 20 October 2015 (UTC)

Gold Standard Corpus
What's the meaning of SME? DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Not directly related but similar: SearchWiki, a failed/abandonned attempt from Google to allow custom ranking of search results, interface is interesting: searchwiki make search your own DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

During the offsite we tried to extract such queries from hive. But we were unable to filter out bot queries. We should not give up and try to find alternatives (filter with referer, do not look at the top-N...) DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Modified Indexes
There's 2 variations of this one:
 * 1) same document model but change in the mapping (reverse field): inplace reindex needed. We should just need to adapt our maintenance scripts to allow the creation of multiple indices (today the script will fail if it detects 2 versions of the content index).
 * 2) change in the document model or the way we parse input data (new field, better parser for opening text extraction, ...). I have no clue here, it requires access to prod db and will perform a full rebuild (5 days for enwiki) DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Effectiveness
Fx scores are simple and efficient but it's a bit on/off and won't evaluate the ordering. It'd be great to have multiple evaluation formulas. We should maybe take some time to look at other evaluation techniques to make sure we have the required data in the gold corpus: Normalized DCG seems to be very interesting. Spending 2/3 hours reading some papers can give us more ideas DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Performance
Yes it will allow us to send early warning if perf is really bad but I'm not sure we wil be able to do real perf test with lab. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)