User talk:TJones (WMF)/Notes/Relevance Lab

Relevance Lab
I would even point a benefit over A/B tests here: with A/B tests we're unable to control the behavior of the same query with 2 different configurations. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)


 * Good point! I think that's partially why the slop A/B test showed the crazy results it did. The noise in the different sets of queries overwhelmed the very small signal in the small number of queries with quotes that could be affected. TJones (WMF) (talk) 14:24, 20 October 2015 (UTC)

Query Sets
Is it intentional that queries in B are the same in A without quotes? DCausse (WMF) (talk) 12:25, 20 October 2015 (UTC)
 * Yep—see reply in Query Mungers below. TJones (WMF) (talk) 14:55, 20 October 2015 (UTC)

Query Mungers
It's not clear to me why you highlight query mungers here? Isn't it the whole purpose of the relevance lab: test various techniques to run a query. IMHO we should not alter user query outside cirrus. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * Sometimes we will be testing ideas, not their implementation, and sometimes the most efficient way of testing an idea, esp. when you don't know whether it will work, is to hack it together and see what happens. A real implementation of quote stripping would need to be smarter (don't strip the double quote if it's the only thing in the query), and would have to figure out how to merge results if we run it on queries that get some results. But as a first test, we could run a targeted set of zero-results queries that had quotes, with the quotes stripped, to test the effect on the queries that need the most help. Just stripping quotes in a text file is much easier than anything else.
 * A more realistic use case would be query translation (e.g., translating queries to English to search on enwiki). Assuming human-level language detection (i.e., because we did it manually) and good machine translation (e.g., using Google, Bing, and Babelfish translations as tests), what's the impact? If that only gets one more query out of 1000 to return results, maybe it's not worth it because we'll probably do worse than that in real life (with less accurate language detection and probably not state of the art machine translation), and testing manually created alternate queries is so much easier than finding and integrating a machine translation library, esp. if it turns out not to be worth it.
 * Query mungers would fall between manually created alternate queries and fully integrated options in Cirrus. If you have something that's easy to run externally to Cirrus (like a machine translation library, or a library for extracting contentful keywords from questions), then it's easy to test without going to the trouble of (or knowing how to) integrate that library into Cirrus. Long term, this would also allow people not familiar with Cirrus, like Community Volunteers, to test ideas/algorithms without having to integrate them.
 * If this makes sense, I'll integrate some of it into the main text. TJones (WMF) (talk) 14:55, 20 October 2015 (UTC)

Targeted Query Sets
This is extremely interesting, we could even go further and tag queries with labels like : ambiguous query, poor recall queries, questions, with quotes, with special syntax. Some algo we'll implement will certainly address only a specific use case, e.g. the query kennedy is extremely ambiguous and today results on enwiki are IMHO quite bad. Current work with pageviews will directly affect this kind of queries. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Concerning this section, it's a bit confusing. I'd suggest that we have a corpus of queries (annotated with descriptive tags allowing us to filter out a specific sub-corpus). But instead of talking about query sets A or B I would suggest talking about config A, B, C... only. Queries should remain the same. The fact that a query can be munged should be addressed in a config. We could even go a bit further and start to define what's a config:

A config is all the aspects of the system that affect the results:


 * 1) Index time
 * 2) Document model
 * 3) Parser: how do we parse wikitext, extract opening text and other subfields
 * 4) Analysis config: tokenization, stopwords, accent squashing...
 * 5) Mapping: how do we map the document model with the analysis config, reverse field...
 * 6) Query time
 * 7) Query Parser: currently we have only one parser and one syntax (special syntax)
 * 8) Query dependent scores
 * 9) Fulltext to ElasticDSL: this is how we map the fulltext query to elasticsearch DSL, we have today:
 * 10) Default query string with AND operator and various weights
 * 11) Common Terms Query with 4 profiles
 * 12) Initial score: this is the core function that gives the initial score to the document according to the user query, today we use the default (simple TD.IDF) but we could switch to other formulas like BMF25.
 * 13) Rescores: reorder the top-N docs, allow more complex algo to be applied on a limited number of docs
 * 14) Query dependant rescores: currently we only have phrase rescore on top 512 results which will over boost results where the input query appears in the exact same order in the doc.
 * 15) Query independant rescores: these are scores that depends only on the document. Today we have:
 * 16) incoming links (activated by default)
 * 17) boost templates (activated by default on enwiki)
 * 18) recency (activated by default on wikinews)
 * 19) Fallback methods
 * 20) Did you mean suggestions
 * 21) Lang detect+query against another wiki

If I want to work on query independent rescore like pagerank/pageviews I would first select the queries annotated "ambiguous" because I know that it's on these queries that rescore will have more impact.

So here removing quotes is either a Query Parser variation or a fallback method, but not a query set. DCausse (WMF) (talk) 12:25, 20 October 2015 (UTC)

Gold Standard Corpus
What's the meaning of SME? DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Not directly related but similar: SearchWiki, a failed/abandonned attempt from Google to allow custom ranking of search results, interface is interesting: searchwiki make search your own DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

During the offsite we tried to extract such queries from hive. But we were unable to filter out bot queries. We should not give up and try to find alternatives (filter with referer, do not look at the top-N...) DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Modified Indexes
There's 2 variations of this one:
 * 1) same document model but change in the mapping (reverse field): inplace reindex needed. We should just need to adapt our maintenance scripts to allow the creation of multiple indices (today the script will fail if it detects 2 versions of the content index).
 * 2) change in the document model or the way we parse input data (new field, better parser for opening text extraction, ...). I have no clue here, it requires access to prod db and will perform a full rebuild (5 days for enwiki) DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Effectiveness
Fx scores are simple and efficient but it's a bit on/off and won't evaluate the ordering. It'd be great to have multiple evaluation formulas. We should maybe take some time to look at other evaluation techniques to make sure we have the required data in the gold corpus: Normalized DCG seems to be very interesting. Spending 2/3 hours reading some papers can give us more ideas DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)

Performance
Yes it will allow us to send early warning if perf is really bad but I'm not sure we wil be able to do real perf test with lab. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)