Wikimedia Discovery/So Many Search Options

December 2016 — See TJones_(WMF)/Notes for other projects.

Background
We’d always prefer providing some sort of search results to users rather than giving no results. And if we can’t give results, helpful suggestions and links are a better alternative than no results at all.

Right now we have, or have in the works, a large enough number of search modifications, extensions, and alternatives that we need to think clearly and carefully about how to order them and how to let them interact.

Current or Near Term Options
A brief summary of the features includes: ''* Search results being “poor” enough to trigger an alternative approach is not defined for all future projects. Common criteria are < 3 results, or no results.''
 * Question mark stripping: ? characters are removed unless that are escaped with a slash \?, because most people use them when asking questions, not as one-character wildcards. This is before searching.
 * ASCII/ICU folding, stemming, case folding, etc.: this happens right before search, and is done by Elastic Search as part of the language analysis step, but it’s worth mentioning explicitly, since we could theoretically use components of this type outside Elastic at some point. Currently, characters are mapped to other characters (lower case, some apostrophe-like marks are converted to apostrophes), words are reduced to their approximate roots (run, running, ran, runs all become run), etc.
 * Interwiki / Cross-project searching: On Wikipedias, provide one result from each sister project in the same language, if projects and results are available.
 * Did You Mean (DYM) Spelling suggestions: If search terms don’t look very likely, and another similar term does, provide a clickable link with those changes made. If the original query gave zero results, go ahead and try the suggested search.
 * Quote stripping: If a query has quotes and does poorly,* try the query again without the quotes.
 * Language detection / identification (TextCat / cross-language searching): If a query has fewer than 3 results, do language detection on it. If the language detected is not the “host” language (the language of the current wiki), try to get results from the corresponding project in that language, if it exists, and show any results.
 * Wrong keyboard detection: Using the same technique as language detection, detect when a user has typed a query in one language (e.g., Russian) while using the keyboard of another language (e.g., English), if the query does poorly.* This can be run concurrently with language detection, or separately. If a non-host language is detected, convert the query to the correct keyboard and run again.

Stopping Criteria
Ideally, it would be interesting to run everything and see what gives the best result and show that, but realistically, that’s probably too expensive. So it makes sense to order them carefully and thoughtfully, and consider stopping criteria. Potential stopping criteria include:
 * a certain amount of time has gone by or CPU has been used
 * a certain number of options have been tried (they don’t all have the same initial criteria, so aren’t all eligible to run on every query, and different options could be weighted based on the cost of running them, too)
 * an option achieves “success” (e.g., returns a certain number of results).

Off-Wiki Discussions
(January 2017)

David, Stas, Erik, and I recently discussed this for an hour or two on a couple of occasions. We came up with a framework that's a step in the right direction in fleshing out a generic approach.
 * One component is modularity. In the general case, we don't want to have to write code in Cirrus of the "if TextCat then X, else if QuoteStripping then Y" variety. There will be some exceptions, particularly interwiki search, but most query-modifying options should meet this criteria. The modular interface is relatively simple:
 * Input is a query (usually the original query), the source wiki (usually the wiki it was originally submitted to), and the results count (so that different modules can use different results count thresholds if they need to—"Did you mean" (DYM) might use 0, most others might use 3).
 * Output is a list, each element consisting of a modified query, the wiki it should be submitted to, and a human readable information string (such as "Did You Mean X" or "Results from Russian Wikipedia". The returned list can be empty if the module has nothing to suggest (e.g., there are no quotes to strip, or the language detected is the language of the current wiki, etc.).
 * Given general modularity, the modules need to be ordered, both for the order of considering modified queries to run and for the order of considering results to show. We can use likelihood of applicability and accuracy of results to initially order the modules, but this is also something that could be A/B tested.
 * Another component is simultaneous search. It takes too long to issue a modified query, check the results, and repeat, say, five times. However, as the recent interwiki search work has shown, issuing multiple queries at once doesn't necessarily bog down the Elasticsearch cluster. So we propose defining some maximum number of simultaneous queries (say, five), and working through the search-modifying modules until we run out of modules or we fill all the available simultaneous search slots. Once the slots are filled (which should be fast compared to searching), we issue the five queries at once.
 * The final component, which we did not fully work through, is displaying the results. I'd originally written that it does not seem helpful to the users who most need help searching to display seven kinds of results all smashed together, put we're going to chat with UX pros about that and make sure.
 * Earlier notes: In general, it seems that later modules are more likely to be more "desperate" (i.e., more likely to favor recall over precision so as to give some results), so raw number of results is not a good selection criteria. It's also possible for some results to overlap with previous results (e.g., a query with quotes may get one result, while the same query without quotes may get that same result, plus many others). Until we think of something better, the current draft proposal is to stick with the "success criteria"—original query results (if any) would be shown, along with the first "successful" modified query (that probably means having 3+ results). If there is no successful modified query, we could show the results of the earliest modified query.

Exceptions and Notes
We decided that interwiki results (i.e., showing results from same-language Wiktionary, Wikiquote, Wikisource, Wikivoyage, etc.) are independent from the other query-modifying modules we're considering. Those results should be shown if available, but whether or not they are available doesn't affect anything else.

Cirrus-internal modifications (i.e, ?-stripping and language analysis) don't affect and aren't affected by the modular framework. It's just good to keep in mind that they exist.

We've decided that Did You Mean (DYM) is not an exception, and should be refactored to behave like other potential modules. We've also generalized the DYM suggestion text and Language ID/TextCat "Showing Results from..." above.

For off-wiki searchers using the API, we've mocked up potential combined JSON results below.

Advanced Options
One idea we touched on but didn't fully address was combining quote stripping and language detection—we might want to strip quotes and send it to the query to a wiki in another language if both of those options fail independently. One way to approach this would be to have a special module that just calls the other two modules in turn and combines their results. Another option would be to have a mechanism for explicitly composing two or more modules—taking the output of one and using it as input to the other. That's not something to consider for the initial version, but an idea to keep in mind.

One idea that just occurred to me, and which is incorporated above but is also worth explicitly mentioning, is that a module could have multiple suggestions. A language-aware quote stripper could suggest the stripped query on the current wiki, but also on another wiki if the query seems to be in another language. A wrong-keyboard detector could suggest the modified query on the current wiki, and on the wiki corresponding to the presumed keyboard (e.g., from enwiki, and query detected as "Latin Cyrillic" could be converted to proper Cyrillic and run on both enwiki and ruwiki). This seems easy to include in the first pass, even if all modules return only one suggestion to start.

Depending on how complex and expensive the module internals are, we might want to cache results on a blackboard. Such a mechanism allows the modules to know about each other, without the framework having to know about them. For example, quote stripping is probably so fast and easy it's okay to do it more than once in different modules. Language identification with TextCat is lightweight, but much more intensive than quote stripping. The language ID module could detect the language and write the results to the blackboard. The quote stripper could check the backboard and see that a language ID has occurred, and make two suggestions, one for the current wiki, and one for the wiki of the detected language. Similarly, with a composed language-aware quote stripper that calls TextCat a second time on the same query, TextCat could check the blackboard and not re-run the exact same query again. This is probably overkill in general, and certainly too much for an initial implementation, so it's just an idea to keep around in case we need it.

Open Questions

 * Results selection: As mentioned above, we haven't really carefully worked out a good method for selecting results when our five-or-so simultaneous queries all return results. We're going to consult with some UX folks on how best to display results and whether multiple sets of potentially overlapping results is a good idea (kind doesn't sound that way when it's phrased like that).
 * Confidence: In our discussion the idea of confidence came up, such as having each module giving some confidence score to its suggested query, which would allow us to order them based on that confidence, rather than on a fixed order. TextCat has eluded confidence measures so far, and it's unfortunately hard to imagine assigning a well-founded confidence after stripping quotes from a query. We could have simple categories ("high, medium, low" or maybe "bold, in vain, desperate") that could sort results, especially when a module has multiple suggestions. For example, quote stripping might be "medium" while quote stripping + language ID is "low", even though both come from the same module.
 * API: We've mocked up some possible API results below, but it's a very early draft and needs more thought and potential updates.

A Worked Example
Suppose we get a query, "los lobos locos", which gets 1 result on enwiki. We run our first module, language detection with TextCat, and it determines the query is Spanish and queues up "los lobos locos" to search on eswiki. The second module, the quote stripper, suggests los lobos locos on both enwiki and eswiki. Wrong keyboard detection (having lost its mind) suggests "дщы дщищы дщсщы" on enwiki and ruwiki. That's five suggestions, so we stop processing modules.

Scenario 1
Suppose: Since none are "successful", we just show the original 1 result from enwiki, along with the earliest "unsuccessful" result—the 1 result from los lobos locos on enwiki.
 * "los lobos locos" on eswiki returns 0 results.
 * los lobos locos on enwiki returns 1 result.
 * los lobos locos on eswiki returns 2 results.
 * "дщы дщищы дщсщы" on enwiki returns 0 results.
 * "дщы дщищы дщсщы" on ruwiki returns 2 results.

Scenario 2
Suppose (difference from above in bold): Since the final query (in defiance of all likelihood in the real world) is "successful", we show the original 1 result from enwiki, along with the earliest "successful" result—the 5 results from "дщы дщищы дщсщы" on ruwiki.
 * "los lobos locos" on eswiki returns 0 results.
 * los lobos locos on enwiki returns 1 result.
 * los lobos locos on eswiki returns 2 results.
 * "дщы дщищы дщсщы" on enwiki returns 0 results.
 * "дщы дщищы дщсщы" on ruwiki returns 5 results.

Draft Proposal
Below is the updated draft proposal, for further discussion of both generalities and specifics. Important elements include:
 * Order with respect to default search and to each other. Options below are roughly sorted into groups that happen at the same time. Exact sorting is a point of discussion.
 * Initial eligibility criteria: “automatic” always happens; “no previous successful results” is always assumed (see below) except for “automatic” actions; the number of main search results or results from previous options is probably the most common criterion.
 * Marginal cost estimate: start with very rough low/medium/high estimates of the marginal cost of the various options, if activated. The marginal cost of determining initial criteria is presumed to be low.
 * “Success” criteria: here defined as giving good enough results so as to stop processing and trying other alternatives—so while question mark stripping is probably always going to be successful in terms of removing question mark characters, its success criterion is “none” because it will never stop processing. Success criteria could include the number of results, the “quality” of results, and maybe the length of the query (short, one-word queries seem like they could be a different class than very long and/or multi-word queries).
 * Results shown: One way to cut down on UI complexity is to only show the “best” set of results from extra search options, so if stripping quotes gives 1 result, wrong keyboard gives 2 results, and language identification gives 200 results, only the final 200 results would be added to the original main search results (which are likely fewer than 3).

API Proposal
TBD.