User talk:TJones (WMF)/Notes/Relevance Lab

Relevance Lab
I would even point a benefit over A/B tests here: with A/B tests we're unable to control the behavior of the same query with 2 different configurations. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)


 * Good point! I think that's partially why the slop A/B test showed the crazy results it did. The noise in the different sets of queries overwhelmed the very small signal in the small number of queries with quotes that could be affected. TJones (WMF) (talk) 14:24, 20 October 2015 (UTC)

Query Sets
Is it intentional that queries in B are the same in A without quotes? DCausse (WMF) (talk) 12:25, 20 October 2015 (UTC)
 * Yep—see reply in Query Mungers below. TJones (WMF) (talk) 14:55, 20 October 2015 (UTC)

Query Mungers
It's not clear to me why you highlight query mungers here? Isn't it the whole purpose of the relevance lab: test various techniques to run a query. IMHO we should not alter user query outside cirrus. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * Sometimes we will be testing ideas, not their implementation, and sometimes the most efficient way of testing an idea, esp. when you don't know whether it will work, is to hack it together and see what happens. A real implementation of quote stripping would need to be smarter (don't strip the double quote if it's the only thing in the query), and would have to figure out how to merge results if we run it on queries that get some results. But as a first test, we could run a targeted set of zero-results queries that had quotes, with the quotes stripped, to test the effect on the queries that need the most help. Just stripping quotes in a text file is much easier than anything else.
 * A more realistic use case would be query translation (e.g., translating queries to English to search on enwiki). Assuming human-level language detection (i.e., because we did it manually) and good machine translation (e.g., using Google, Bing, and Babelfish translations as tests), what's the impact? If that only gets one more query out of 1000 to return results, maybe it's not worth it because we'll probably do worse than that in real life (with less accurate language detection and probably not state of the art machine translation), and testing manually created alternate queries is so much easier than finding and integrating a machine translation library, esp. if it turns out not to be worth it.
 * Query mungers would fall between manually created alternate queries and fully integrated options in Cirrus. If you have something that's easy to run externally to Cirrus (like a machine translation library, or a library for extracting contentful keywords from questions), then it's easy to test without going to the trouble of (or knowing how to) integrate that library into Cirrus. Long term, this would also allow people not familiar with Cirrus, like Community Volunteers, to test ideas/algorithms without having to integrate them.
 * If this makes sense, I'll integrate some of it into the main text. TJones (WMF) (talk) 14:55, 20 October 2015 (UTC)
 * Good point, so this is a way to optimize the process by allowing quick experimentations. But how will we be able to diff two query sets, I mean how do we link queries between sets? DCausse (WMF) (talk) 07:16, 21 October 2015 (UTC)
 * Queries to be run are in a text file, so if you run two different text files, the queries would be diffed in the order they appear in the text file. TJones (WMF) (talk) 14:07, 21 October 2015 (UTC)

Targeted Query Sets
This is extremely interesting, we could even go further and tag queries with labels like : ambiguous query, poor recall queries, questions, with quotes, with special syntax. Some algo we'll implement will certainly address only a specific use case, e.g. the query kennedy is extremely ambiguous and today results on enwiki are IMHO quite bad. Current work with pageviews will directly affect this kind of queries. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * That's a cool idea! The easiest way to get something equivalent to tagging is just to have different kinds of queries in different files, but of course that's not the same as having, for example, a regression test that has different kinds of queries in it that are tagged, and reports that show you which tags were most affected by your prosed change. I'll add this in, but it'll be lower priority (since it can be crudely simulated by having different query files.)
 * Oh, one other thing—an advantage of target query sets over tagged query sets is speed during development. If you are only working on queries with quotes, you can save 99% of your runtime by having a targeted query set. But for regression testing or just exploring the effect of a proposed change tags would be very cool! TJones (WMF) (talk) 14:47, 21 October 2015 (UTC)

Gold Standard Corpus
What's the meaning of SME? DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * Subject Matter Expert. I've added the definition to the first usage. TJones (WMF) (talk) 15:34, 21 October 2015 (UTC)

Not directly related but similar: SearchWiki, a failed/abandonned attempt from Google to allow custom ranking of search results, interface is interesting: searchwiki make search your own DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * Interesting. Looks like they replaced it with something simpler: Google Stars, but that may also have gone away. TJones (WMF) (talk) 15:34, 21 October 2015 (UTC)

During the offsite we tried to extract such queries from hive. But we were unable to filter out bot queries. We should not give up and try to find alternatives (filter with referer, do not look at the top-N...) DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * If we start with a random sample, there won't necessarily be a ton of bot traffic, and we can filter it out, some automatically, and some manually. Part of the annotation process might include noting that "hfdkhfkskkkkkkkkkkkkkkkkkkkkkkkk" is not a real query. TJones (WMF) (talk) 15:34, 21 October 2015 (UTC)

Modified Indexes
There's 2 variations of this one:
 * 1) same document model but change in the mapping (reverse field): inplace reindex needed. We should just need to adapt our maintenance scripts to allow the creation of multiple indices (today the script will fail if it detects 2 versions of the content index).
 * 2) change in the document model or the way we parse input data (new field, better parser for opening text extraction, ...). I have no clue here, it requires access to prod db and will perform a full rebuild (5 days for enwiki) DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)


 * This is better than my description, so I've copied it in. In the second case, couldn't we do it the same way as the first, by having an alternate index with whatever features we want? In the imaginary world of infinite resources, we could have multiple alternate indexes and you could specify them at run time. In the real world, perhaps we could support one alternate index at a time, always leaving the production-clone index available to anyone who needs it. We could replace the one (?) alternate index we could support rarely, and only after making sure it isn't needed. TJones (WMF) (talk) 16:11, 21 October 2015 (UTC)

Effectiveness
Fx scores are simple and efficient but it's a bit on/off and won't evaluate the ordering. It'd be great to have multiple evaluation formulas. We should maybe take some time to look at other evaluation techniques to make sure we have the required data in the gold corpus: Normalized DCG seems to be very interesting. Spending 2/3 hours reading some papers can give us more ideas DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * Oh course! Any metrics we like can be included. F-measure was just a placeholder (I actually don't like it all that much), but the little "etc" on the end wasn't obvious enough, so I've added your suggestions. The hard part is going to be creating the corpus to run it against. TJones (WMF) (talk) 16:43, 21 October 2015 (UTC)

Performance
Yes it will allow us to send early warning if perf is really bad but I'm not sure we wil be able to do real perf test with lab. DCausse (WMF) (talk) 11:31, 20 October 2015 (UTC)
 * Yeah, it's just a sanity check. As I mentioned, lots of irrelevant things could cause performance issues. TJones (WMF) (talk) 16:44, 21 October 2015 (UTC)

Minimum Viable Product (MVP)
For the UI could we build a mediawiki extension for this purpose or do we have to build a product outside MW? Leaving out the problem of index changes isn't it just running runSearch with a config file and a query file, running everything in the same MW instance would save extra work by avoiding all the ssh boilerplate? On the other hand having an external tool would allow the lab to work with multiple MW workers...DCausse (WMF) (talk) 08:06, 21 October 2015 (UTC)

Today runSearch will just output the number of results, we should maybe create a task to update this tool and allow outputs with more details.DCausse (WMF) (talk) 08:06, 21 October 2015 (UTC)

What's the easiest solution for reports, is it to do everything in txt/xml outputs or would it make sense to just save raw results of each run in db? Storing raw results in db would allow more flexibility for future evolution and sounds easier than dealing with files and directory (maybe I'm wrong).DCausse (WMF) (talk) 08:06, 21 October 2015 (UTC)

More improvements!
In the same vein as Diff and Stats, I'd really like to have the possibility to tag each query in the corpus. This could be very useful for the developer, and would allow to see details for these labels: Query Set XYZ Config A diff report (Config B) 1% change in the results, details: This would allow the developer to extract subset of queries when he wants to focus on particular task. I know we could do this with different query sets but I think it would be more flexible to have this tag at the query level.DCausse (WMF) (talk) 08:06, 21 October 2015 (UTC)
 * 1) ambiguous: 0%
 * 2) gibberish: 88% (warning!)
 * 3) questions: 0.1%