If a change is made to the search algorithm (e.g., deciding to strip question marks from inputs or turn underscores into spaces), how is that change's impact on search result quality measured? Is there a test suite that says "Obama" on the English Wikipedia should return X in the top five results? Or a search for "kidney" should return greater than X results on the Spanish Wikibooks or whatever? Where is this test suite and can users contribute tests to it?
Talk:Wikimedia Discovery/So Many Search Options
How is search being tested?
It depends on what we're testing. We don't quite have the manually curated test suite you are describing, but we have some things that are close. There are a number of problems with such a test suite, which we've been working to overcome for more than a year now. The biggest one is that it tests what the curators think should be tested—which is not always sufficiently representative, and can lead to unexpected side effects elsewhere. Some of the tools we use include (find links and other info in on bolded terms in the Glossary):
There's the Relevance Forge (formerly "Relevance Lab"—but too many things have "lab" in their name). There are RelForge servers set up that allow us to host multiple copies of wiki indexes with different indexing schemes, or, more simply, we can query the same index with different configurations (weighting schemes, query types, etc—anything that doesn't require re-indexing). RelForge can measure the zero results rate (ZRR), the "poorly performing" rate (fewer than 3 results), query re-ordering, and a few other automatically computable metrics. It is possible to manually review the changed results, too. We can run RelForge against targeted corpora (i.e., queries with question marks) or general regression test corpora. It's a bit of a blunt instrument, but it makes it very easy to asses the maximum impact of a change quickly by seeing what percentage of a regression test set has changes. If it's unexpectedly high, we know something "interesting" is going on that warrants further investigation; when it's low, we can look at all the changed examples and try to figure out what's going on.
RelForge also has an optimizer that will take a number of numerical configuration parameters and a range of values for each and do a grid search over the parameter space to find the best config, based on PaulScore, which is derived from actual user click data. (See the Glossary for more on PaulScore, which is a name we gave an otherwise unnamed metric proposed by Paul Nelson.)
For English Wikipedia, we are trying to emulate Google, Bing, and others with Discernatron, which allows users to manually review and rank results. Unfortunately it's mind-numbingly tedious to do the ranking (I speak from experience) and we don't have a lot of data because we aren't paying people to do it. However, we have a moderately sized test set that we can use to test proposed changes and to calibrate other less manual-work–intensive methods, which we hope to then generalize to other projects. We use the Discernatron scores to calculate discounted cumulative gain (DCG) values before and after changes. The DCG scores are based on the ordering of results compared to the manually scored results.
We've also been experimenting with other clickstream metrics, in particular using dynamic Bayesian networks (DBN). The DBN model uses a lot more aggregate data, which makes it more precise, but requires a certain number of people to have looked at equivalent queries (ignoring case, extra spaces, etc.), which means it must ignore the long tail, but we've been comparing it to the Discernatron results and it compares favorably, and requires no additional manual rating work. We've also just started experiments with Learning to rank, but those are still early days.
We also use A/B tests when there's no other good way to test things, or we aren't fully confident in other measures. The data is collected and analyzed by our data analysts, and the analysis write-ups are on GitHub and Commons (for example, this one). They use a lot of very cool and high-powered statistical techniques to tease out what's going on, including ZRR, PaulScore, and lots, lots more.
For individual search improvements, we'd use whichever of these methods seems most helpful at the time, depending on the scope and complexity of the improvements being evaluated. For the overall problem of showing results from so many second-try search options, I expect we'll use A/B tests to see what display method and ordering has the most user engagement and refine that over time, since there's no a priori way to determine the best ordering of additional results.
I've adapted this to its own wiki page, so we can point people to a more general answer in the future, and a more up-to-date answer in the more distant future.
Wow, thank you for the very thorough answer! There's quite a bit to process here.
- "Did you mean" part of DYM suggestion is GUI, I don't think we should put it in API response.
- BC is a pain but still by default I am inclined to it. In this case it means that if we have suggestion, we should keep
suggestionand if we rewrote query we should keep
rewrittenquery. I am open to being convinced otherwise though. In any case we should ask people actually using the API (mobile team? others?) what they think.
- I would really like to not have numbers in case all languages are different. I am aware this is a bit misleading and may be extra work in case they aren't, but that makes 99% of frequent cases much easier to handle. At the expense of breakage potential in 1% of other cases.
- I am not sure how the rewrite scenario is supposed to work now. Previously, if original query returned no results, we'd use (if allowed) rewritten one and do
rewrittenquery. Now what would original search return? Nothing? May not be helpful, especially as one doesn't know which additional result is the rewritten one before scanning all of them, and even then there may be more than one - so which one should be displayed? Too many options may be not helpful here. I think having a (configurable) preference order is better than just dumping everything on the user and having them try to sort it out.
- That's reasonable. For the SERP page, is it okay for a module to give back UI text? Otherwise, Cirrus will have to know how to label the results of each module, and I thought we were trying to keep Cirrus from having to know anything about the modules. This applies to DYM and others.
- I'm certainly not against that, though I left it out of the draft to see what people think. Asking API users is a good idea, and I want to ask the Mobile team, but we should get our first draft straight first.
- If we go with multiple results sets in the API, getting two from the same language wiki might not be that uncommon. DYM and quote stripping would both give results from the original wiki. Also, If the numbers are consistently present, everyone will always parse them correctly. If they don't show up in 99% of cases, then when the 1% happens, things will break, as you noted. I'm not sure what to suggest. Always having arbitrary numbers is both easy to code and consistent for the consumer. What's so bad about the numbers?
- One possibility is that
rewrittenquerygoes away in the main search results, and DYM is always a suggestion, even if the original query returns zero results. That allows the consumer to choose which suggestion to take (DYM, or one of the others). Alternatively, we could give DYM special legacy status and keep it exactly as it is and also put it in the additional search results section. That would require extra effort by new consumers to recognize that the main results are DYM results and not additional results, but it would be backward compatible for everyone else. Too many options is always a potential problem. I don't have strong feelings about a config for preferred order, but it seems reasonable, if it's easy for the consumer to convey to us.
Classify options and decide on the stopping criteria
The table with all the options makes a lot of sense to me. At a glance I completely agree with the order you put there.
We could maybe add a hint (new column?) to indicate when this option is applied. I can think of several big "boxes" to put all these options:
- cirrus: pre-parsing/munging (question mark stripping, could probably be merged with the next one )
- cirrus: parsing (nothing there yet)
- cirrus: query building (nothing there yet, example features: relax the constraint of the default AND operator)
- elastic: language analysis/stemming/...
- cirrus: fallback methods: all the other options, quote stripping, language/keyboard detection
I think the stopping criteria make sense only for fallback methods. Today the strategy is to stop after the first fallback method regardless of its success.
It's hard to tell if we can try again and again: e.g. quote stripping did not yield new results should I try a second fallback methods? if yes should I try the next fallback with the original query or with the quotes stripped?
I ended up coming up with a similar scheme under "category", but it isn't as detailed. If you think we need more detail there, feel free to add it.
We've been talking about displaying second-try search results, but what about API? If we do something to the search string, and it returns secondary result, how do we tell the API that we did this? Would we place the modified string (e.g. recoded to different keyboard or without quotes) into the suggestion?
I don't have a good answer here, but this is an excellent question, so I've added it to the "Open Questions" section.
Should we clarify/distinguish actual success vs potential success
Some of the fallback methods need to compute something before submitting something to elastic.
Some examples :
Language detection can have a potential success if we detect a language. Its actual success depends on the results found in the target index.
Quote stripping can have a potential success if quotes are present in the query. Its actual success is also bound to presence of results in the index.
Determining actual success will always have a high cost. But determining potential success could maybe be run on all the fallback methods?
If some of the fallback methods provide a confidence value it could be used to make a decision.
Concerning confidence, I know it's hard but I think it's worth the effort.
For quote stripping we could have some "potential success" confidence, if only one word is quoted then its confidence is very low, but if the number of quoted words is 2 or 3 it's more likely to return results.
Do you think we need explicit confidence measures, or are implicit ones good enough? For language detection, TextCat can be configured to give its best guess no matter what, and that was how I started looking at it (to maximize recall), but it can also be configured to be more conservative, and only give an answer that's likely to be right. The work I've been doing recently is focussing on that kind of configuration (the current prod config is in between). So, there's an implicit confidence built into the process, but it's hard to convert into a numerical score (though I'm thinking about ways to do that).
Even with a numerical score, would we want to do something different based on that score, or would it just be a yes/no decision? If it's a yes/no decision, that can probably be pushed back into the module (as with language detection, which can just return no answer if the confidence isn't there). If it's a more complex process based on the score, I worry about maintaining it, and needing different values for different languages or different wikis.
We'd also have to come up with a useful confidence measure in each case. For quote stripping, for example, it's also language dependent. Do we want to run tokenizing on the string to see how many tokens are inside the quotes? For languages with spaces between words, it's easy, for Chinese, it's hard (unless we can get the tokenization for the original query back from Elasticsearch).
If we're looking at a simple set of criteria (setting aside Chinese tokenization for a moment), they could be folded into the initial criteria—"presence of 2+ tokens inside paired double quotes". (Though it's worth noting that single quoted words stop spelling correction, at least, so even quoted single words might do better without quotes.)
In my experience, it's a lot harder to bolt on confidence after the fact if it hasn't been part of the system from the original design, so I'm also worried about the amount of effort. For quotes, I wonder if it's even worth it. They are fairly uncommon, so we wouldn't be wasting a ton of processing time if we just stripped and ran them every time.
I think "potential success" may be part of the accept/reject step I mentioned in the design topic. I.e. quotes part would check if there are quotes and may be if there are "interesting" quotes - e.g. quotes including more than one word - and then either try search or refuse to. Quantifying this beyond yes/no though would be very tricky I think. I don't see a good way to compare two methods without actually running the searches.
My point here is to clarify how the fallback methods are evaluated.
Because a fallback method needs to actually send a query to elastic to determine its actual success it's unlikely we'll be able to evaluate multiple ones. The cost to send a query to elastic is always high.
As you said: In my experience, it's a lot harder to bolt on confidence after the fact if it hasn't been part of the system from the original design
This is for this particular reason I'd prefer to build the logic that evaluate fallback methods based on a confidence rather than first comes first served.
I'm perfectly fine having fallback methods that returns 0 or 1 because they can't/don't know how to compute a confidence value. If we want to have a first comes first served approach we can simply optimize by stating that a confidence of 1 means: stop looking others it'll work trust me, you can send the query to elastic.
If you think that order is sufficient and that we'll never have to evaluate more than on fallback methods at a time then basing the logic solely on order is probably sufficient.
I'm not strongly opposed to just use the order but I find the confidence approach a bit more flexible for the future. Note that here I'm just talking about the "orchestration" code that will be in charge of running fallback methods. I completely agree that introducing an accurate confidence value into an existing fallback method is extremely hard and I'm not advocating to put a lot of effort into this. For instance with TextCat we can certainly start by returning 0 if the detection failed and 1 otherwise. If we figure out a way to have some confidence hints from the Textcat algorithm that would be awesome, if not no big deal.
Mostly agree, one more note - given that there are things we don't exactly know how they'd work out (e.g. which method is better, etc.) I think it's good to move incrementally - i.e. implement something simple first, see how it works and then decide how we proceed. E.g. if some query enhancement doesn't work that well at all, it'd be waste of time to spend any effort on trying to build confidence measure for it now.
on "relevant" search results
@Deskana (WMF) added "relevant" to "search results" in the opening paragraph, but I took it back out. We obviously prefer relevant results, but we do seem to have settled on the idea that, for human searchers (not bots), showing something is probably better than showing nothing, hence some of the semi-desperate searching approaches that work some of the time, but not even close to all the time.
I think "not obviously irrelevant" would be better than "relevant", although that's much less catchy. ;-)
Having thought about it more, the point I'm trying to make is that the trivial solution of showing random results rather than showing nothing is bad, but showing possibly irrelevant results is probably better than showing them nothing.
I'm not sure how best to represent this.
I think we need to be careful here because we still don't know too much (by "we" I mean both code and us as developers) about how relevant the results we are returning are. It's a very hard question. So we should not commit to something we have little hope to deliver. We can provide more search results, we can hope more results is better than no results, but claiming any of this would surely provide more relevant results sounds to me like being more bold that we can (yet?) afford to.
unescaped ?s are stripped from queries
Does it mean everywhere or only at the end of the sentence?
No, it's everywhere. Question marks are fairly rare, and almost always used in questions. It seemed more straightforward to have them all be escaped than to have inconsistent rules (and try to guess where sentences end in different languages).
Ah ok, makes sense.
I think if DYM returns good result we should stop there. However, that really depends - I could argue wrong kbd/language may be better in some cases. I would propose just make the order flexible enough (may be in code for starters even, or in config) and stop once we've got one good result and then just try it out and see if we're seeing anything weird and if we do, maybe change the order.
It's hard to decide whether DYM results are good. As I understand it, it doesn't do any word-level n-grams (bigrams would be reasonable) to decide whether the new suggested word fits the context of the rest of the query. There can be crazy recommendations. Do we know if it is good based on the number of results? I haven't thought about it carefully, so I'm not sure.
I do agree that wrong keyboard could move up the list, but since that is not very-well developed either, I'm not sure how it should play out. The queries generally look like gibberish, but I could imagine spelling correction finding something reasonable to correct to in some cases. (I think the general lack of overlap in applicability has spared us from interactions so far—though I worry in particular about quotes plus anything else.)
I have been thinking of language detection as a last-ditch effort, so I put it last, but it doesn't have to be. It's not super highly accurate—but then maybe none of these really are.
The table there looks good to me, so I'd propose to organize the second-chance searches as a module that would essentially do two things - a) look at the query and the result and see if it can do better (e.g. whether it has quotes to remove) and b) do the search and return new result.
With this, we can chain the available second-try searches and even allow customized ones. What do you think?
Seems reasonable, though I want to invoke @EBernhardson (WMF) for his opinion, since he worked on the lang ID implementation, IIRC. The idea of hooks for customizable second-try searches is interesting, too, if it's easy enough to support.
I'd separate things we want to do on every query - like case folding, etc. - from things we may only want to do on some queries, e.g. low-result queries. Not sure where inter-wiki search falls - it can be both...
I tried to capture that in the "initial criteria". Things we do for every query are "automatic" and things that are only for some queries have specific initial criteria. I like having everything in one place to keep it all in mind, even though at the moment I don't think the language analysis interacts with anything else.