Wikimedia Discovery/So Many Search Options

December 2016 — See TJones_(WMF)/Notes for other projects.

Background
We’d always prefer providing some sort of relevant search results to users rather than giving no results. And if we can’t give results, helpful suggestions and links are a better alternative than no results at all.

Right now we have, or have in the works, a large enough number of search modifications, extensions, and alternatives that we need to think clearly and carefully about how to order them and how to let them interact.

Current or Near Term Options
A brief summary of the features includes: ''* Search results being “poor” enough to trigger an alternative approach is not defined for all future projects. Common criteria are < 3 results, or no results.''
 * Question mark stripping: ? characters are removed unless that are escaped with a slash \?, because most people use them when asking questions, not as one-character wildcards. This is before searching.
 * ASCII/ICU folding, stemming, case folding, etc.: this happens right before search, and is done by Elastic Search as part of the language analysis step, but it’s worth mentioning explicitly, since we could theoretically use components of this type outside Elastic at some point. Currently, characters are mapped to other characters (lower case, some apostrophe-like marks are converted to apostrophes), words are reduced to their approximate roots (run, running, ran, runs all become run), etc.
 * Inter-wiki / Cross-project searching: On Wikipedias, provide one result from each sister project in the same language, if projects and results are available.
 * Did You Mean (DYM) Spelling suggestions: If search terms don’t look very likely, and another similar term does, provide a clickable link with those changes made. If the original query gave zero results, go ahead and try the suggested search.
 * Quote stripping: If a query has quotes and does poorly,* try the query again without the quotes.
 * Language detection / identification (TextCat / cross-language searching): If a query has fewer than 3 results, do language detection on it. If the language detected is not the “host” language (the language of the current wiki), try to get results from the corresponding project in that language, if it exists, and show any results.
 * Wrong keyboard detection: Using the same technique as language detection, detect when a user has typed a query in one language (e.g., Russian) while using the keyboard of another language (e.g., English), if the query does poorly.* This can be run concurrently with language detection, or separately. If a non-host language is detected, convert the query to the correct keyboard and run again.

Stopping Criteria
Ideally, it would be interesting to run everything and see what gives the best result and show that, but realistically, that’s probably too expensive. So it makes sense to order them carefully and thoughtfully, and consider stopping criteria. Potential stopping criteria include:
 * a certain amount of time has gone by or CPU has been used
 * a certain number of options have been tried (they don’t all have the same initial criteria, so aren’t all eligible to run on every query, and different options could be weighted based on the cost of running them, too)
 * an option achieves “success” (e.g., returns a certain number of results).

Initial Draft Proposal
Based on all this, I’m going to make a draft proposal for further discussion of both generalities and specifics. Important elements include: ''These are numbered at random as they came to me, and not in any logical order. Sorry.''
 * Order with respect to default search and to each other. Options below are roughly sorted into groups that happen at the same time. Exact sorting is a point of discussion.
 * Initial eligibility criteria: “automatic” always happens; “no previous successful results” is always assumed (see below) except for “automatic” actions; the number of main search results or results from previous options is probably the most common criterion.
 * Marginal cost estimate: start with very rough low/medium/high estimates of the marginal cost of the various options, if activated. The marginal cost of determining initial criteria is presumed to be low.
 * “Success” criteria: here defined as giving good enough results so as to stop processing and trying other alternatives—so while question mark stripping is probably always going to be successful in terms of removing question mark characters, its success criterion is “none” because it will never stop processing. Success criteria could include the number of results, the “quality” of results, and maybe the length of the query (short, one-word queries seem like they could be a different class than very long and/or multi-word queries).
 * Results shown: One way to cut down on UI complexity is to only show the “best” set of results from extra search options, so if stripping quotes gives 1 result, wrong keyboard gives 2 results, and language identification gives 200 results, only the final 200 results would be added to the original main search results (which are likely fewer than 3).