Topic on Talk:Wikimedia Discovery/So Many Search Options

Should we clarify/distinguish actual success vs potential success

6
DCausse (WMF) (talkcontribs)

Some of the fallback methods need to compute something before submitting something to elastic.

Some examples :

Language detection can have a potential success if we detect a language. Its actual success depends on the results found in the target index.

Quote stripping can have a potential success if quotes are present in the query. Its actual success is also bound to presence of results in the index.

Determining actual success will always have a high cost. But determining potential success could maybe be run on all the fallback methods?

If some of the fallback methods provide a confidence value it could be used to make a decision.

DCausse (WMF) (talkcontribs)

Concerning confidence, I know it's hard but I think it's worth the effort.

For quote stripping we could have some "potential success" confidence, if only one word is quoted then its confidence is very low, but if the number of quoted words is 2 or 3 it's more likely to return results.

TJones (WMF) (talkcontribs)

Do you think we need explicit confidence measures, or are implicit ones good enough? For language detection, TextCat can be configured to give its best guess no matter what, and that was how I started looking at it (to maximize recall), but it can also be configured to be more conservative, and only give an answer that's likely to be right. The work I've been doing recently is focussing on that kind of configuration (the current prod config is in between). So, there's an implicit confidence built into the process, but it's hard to convert into a numerical score (though I'm thinking about ways to do that).

Even with a numerical score, would we want to do something different based on that score, or would it just be a yes/no decision? If it's a yes/no decision, that can probably be pushed back into the module (as with language detection, which can just return no answer if the confidence isn't there). If it's a more complex process based on the score, I worry about maintaining it, and needing different values for different languages or different wikis.

We'd also have to come up with a useful confidence measure in each case. For quote stripping, for example, it's also language dependent. Do we want to run tokenizing on the string to see how many tokens are inside the quotes? For languages with spaces between words, it's easy, for Chinese, it's hard (unless we can get the tokenization for the original query back from Elasticsearch).

If we're looking at a simple set of criteria (setting aside Chinese tokenization for a moment), they could be folded into the initial criteria—"presence of 2+ tokens inside paired double quotes". (Though it's worth noting that single quoted words stop spelling correction, at least, so even quoted single words might do better without quotes.)

In my experience, it's a lot harder to bolt on confidence after the fact if it hasn't been part of the system from the original design, so I'm also worried about the amount of effort. For quotes, I wonder if it's even worth it. They are fairly uncommon, so we wouldn't be wasting a ton of processing time if we just stripped and ran them every time.

Smalyshev (WMF) (talkcontribs)

I think "potential success" may be part of the accept/reject step I mentioned in the design topic. I.e. quotes part would check if there are quotes and may be if there are "interesting" quotes - e.g. quotes including more than one word - and then either try search or refuse to. Quantifying this beyond yes/no though would be very tricky I think. I don't see a good way to compare two methods without actually running the searches.

DCausse (WMF) (talkcontribs)

My point here is to clarify how the fallback methods are evaluated.

Because a fallback method needs to actually send a query to elastic to determine its actual success it's unlikely we'll be able to evaluate multiple ones. The cost to send a query to elastic is always high.

As you said: In my experience, it's a lot harder to bolt on confidence after the fact if it hasn't been part of the system from the original design

This is for this particular reason I'd prefer to build the logic that evaluate fallback methods based on a confidence rather than first comes first served.

I'm perfectly fine having fallback methods that returns 0 or 1 because they can't/don't know how to compute a confidence value. If we want to have a first comes first served approach we can simply optimize by stating that a confidence of 1 means: stop looking others it'll work trust me, you can send the query to elastic.

If you think that order is sufficient and that we'll never have to evaluate more than on fallback methods at a time then basing the logic solely on order is probably sufficient.

I'm not strongly opposed to just use the order but I find the confidence approach a bit more flexible for the future. Note that here I'm just talking about the "orchestration" code that will be in charge of running fallback methods. I completely agree that introducing an accurate confidence value into an existing fallback method is extremely hard and I'm not advocating to put a lot of effort into this. For instance with TextCat we can certainly start by returning 0 if the detection failed and 1 otherwise. If we figure out a way to have some confidence hints from the Textcat algorithm that would be awesome, if not no big deal.

Smalyshev (WMF) (talkcontribs)

Mostly agree, one more note - given that there are things we don't exactly know how they'd work out (e.g. which method is better, etc.) I think it's good to move incrementally - i.e. implement something simple first, see how it works and then decide how we proceed. E.g. if some query enhancement doesn't work that well at all, it'd be waste of time to spend any effort on trying to build confidence measure for it now.

Reply to "Should we clarify/distinguish actual success vs potential success"