User:TJones (WMF)/Notes/Implementation Design and Parameter Optimization for Wrong Keyboard Detection and Suggestion

November 2018–January 2019 (intermittently) — See TJones_(WMF)/Notes for other projects. See also T138958, T213931, and T213936.

Background
Way back in 2016 I looked at “wrong keyboard” detection as a 10% time project. The idea is that sometimes people who work in languages with different writing systems will accidentally use the wrong keyboard when typing. So, someone could type ,jutvcrfz hfgcjlbz on Russian Wikipedia because they tried to type богемская рапсодия (“bohemian rapsody”) but their keyboard was still set to English instead of Russian.

Since TextCat—which we already use for language detection—is based on n-grams, it is easy enough to transliterate existing models based on a keyboard mapping, giving us decent models to detect English typed on a Russian Cyrillic keyboard (en_cyr), or Russian typed on an American English keyboard (ru_lat).*"* N.B.: There is an issue with choosing which keyboard to use for our mapping. Not all Latin-alphabet keyboards are the same, and not all Cyrillic keyboards are the same. Depending on the word being transliterated, though, it may not make much difference, since keyboards in the same character set are sometimes fairly similar—though Russian and Serbian keyboards, for example, are very different.""It does not seem that a brute-force approach, like generating 15+ mappings—for, say, French, Spanish, German, US English, and UK English Latin keyboards to/from Russian, Ukrainian, and Serbian Cyrillic keyboards—is worth the effort or computational expense."

A DWIM Digression
In addition to detecting wrong keyboard for Russian/American keyboards, there is a request/ticket to do the same for Hebrew/American keyboards.

In the meantime, Hebrew Wikipedia implemented a gadget called DWIM (“DWIM” is old computer slang for “do what I mean”), which was later adapted and copied to Russian Wikipedia as an optional gadget.** These provide wrong-keyboard suggestions in the upper left (Hebrew) or upper right (Russian) search box on every page.***"** As a meta-digression: I found a slight problem with the Russian gadget, which did not properly handle ALL CAPS wrong-keyboard input. (The short version is that some Russian Cyrillic characters map to punctuation on the American English keyboard, and while → “g” and   → “д”,   does not give “;”.) The problem has since been fixed!""*** For reasons I can’t figure out, these extra wrong-keyboard suggestions can’t (yet?) be provided in the main search box on the Special:Search page. I was able to get them to show up, but only in addition to rather than instead of the results that are already there. There’s some complexity with OOUI—which I’ve never used, unfortunately—and I can’t figure it out. I’ve got feelers out to find a solution, though, so maybe it will be available for suggestions on the main search page, too."The basic mechanism is to see how many results there are from a query, and if there are fewer than 10, do the wrong-keyboard conversion and tack on any results from the converted query.

This brings up the possibility of doing a second-chance search that doesn’t rely on detection. We could just do the wrong-keyboard transformation and get the results and see what happens. The wrong-keyboard results are always “plausible” in that they are what you’d get if you typed the same keys with the other keyboard activated. Because the vowels generally don’t line up between the Russian and English keyboards, it’s hard to find real words that are ambiguous with respect to the wrong-keyboard transformation;**** however, short strings, especially those short enough to be acronyms, could possibly map to something matchable across the transformation."**** There are a few examples, though: Genby is both the wrong-keyboard version of Путин (“Putin”) and a rare surname. Фишер is both the Russian transliteration of the name “Fischer” and the wrong-keyboard version of the rare name Abith."

Enter TextCat
In my original investigation, I used TextCat because I already had models for English and Russian, and converting them to wrong-keyboard variants was easy enough since it’s a one-to-one mapping. (This time I made fresh conversions using the mapping from the Russian DWIM gadget, which was slightly different than what I originally used in my first look at wrong-keyboard mappings. I assume readers of Russian Wikipedia have a better idea of the best mapping between the most common keyboards!)

I pulled a sample of 10,000 queries from Russian Wikipedia, re-ran the queries and the wrong-keyboard–converted form of the queries against Russian Wikipedia to get results counts, generated some other textual stats (Latin or Cyrillic characters, both, or neither; how many Latin or Cyrillic consonants in a row; etc.), and started annotating.

For the purposes of wrong keyboard detection, I tagged anything in the Latin alphabet as “English” and anything in Cyrillic as “Russian”. The Russian tag is probably closer to truth than the English tag, but the Spanish, German, Portuguese, and French that I saw is all much, much closer to English than it is to the majority of the wrong-keyboard Russian, which looks like gibberish in the Latin alphabet.

I reviewed all the Latin character queries, and the wrong-keyboard–converted Cyrillic queries (i.e., in Latin). There are a few non-Latin, non-Cyrillic queries; a fair number of acronym-like queries; some that are hard to categorize because they look like ID numbers or measurements; and even a few mixed-script queries that could use partial conversion. And of course lots and lots of Russian Cyrillic, plenty of English (and other) Latin, a fair number of “Latin Russian” and a few “Cyrillic English”… and a small surprise!

An Encoding Excursus
While annotating the data for TextCat optimization, I ran across a few queries that look like this: РњРѕСЃРєРІР°.

Ahh, that takes me back to the pre-Unicode days of the Worldwide Web, when you’d stumble upon a Russian-language website and it would look like total gibberish, whether you spoke Russian or not. The problem here is that we have an encoding mixup between Unicode UTF-8 and Windows-1251. РњРѕСЃРєРІР° is what you get when you mix up the encoding for Москва (“Moscow”).

This encoding mixup generates very distinctive text—generally with every other letter being either Cyrillic Р or С—but mostly Р. Unfortunately, since UTF-8 Cyrillic characters map to two Windows-1251 characters, I couldn’t just map the Russian n-gram model. So I grabbed some Russian Wikipedia text, encoded it “incorrectly”, and built a new TextCat model with it (ru_win1251), and added it to the mix.

It’s not exactly the wrong keyboard, but it’s close enough and we’re here and we can probably fix it.

Optimizing TextCat
I created two corpora to optimize the TextCat config against:


 * All of the queries that could be categorized, excluding IDs, numbers, other queries with no Cyrillic or Latin characters,, and apparent gibberish. (9,628 queries)


 * Only the queries that got zero results (a possible filter to detection). (206 queries).

I used my TextCat optimization tool to explore and optimize the parameter space for each corpus. It was very easy for both corpora to get above 90% accuracy, and the larger corpus had 99.8% categorization F0 accuracy; because it is ~80% Russian, it pushed toward keeping precision on Russian queries high and not getting too many false positives for everything else. The smaller corpus got 96.2% F0 accuracy; with ~66% Latin Russian, it leaned toward higher recall on non-Russian queries.

After constraining the parameter space, I did a meta-optimization to simultaneously optimize both corpora to get a single set of best parameters. The smaller zero-results corpus stayed at 96.2% while the full corpus slightly decreased to 99.7%, meaning the final set of parameters was very close to optimal for both cases. (And, given that the difference between production in general and my probably-not-perfect annotations is likely more than 0.1%, it’s really no difference at all.)

Options considered include:


 * Model size—a mid-sized model (6K out of a possible 10K) was best.


 * Minimum length—input needs to be at least 3 characters to be considered


 * Language boosts—it turns out giving an 18% boost to Russian and Windows-1251 Russian is best.
 * Languages to consider—it was sometimes iffy whether Cyrillic English should be allowed in the mix, but in the end it was. The whole list to consider is Russian, English, Cyrillic English, Latin Russian, and Windows-1251 Russian. (Recall that anything correctly in Cyrillic is “Russian” and anything correctly in Latin is “English” for detection purposes.)
 * “Gibberish detection” was set of 80% of the score of pure gibberish.
 * Any language that was more than 2% worse than the best option was ignored.
 * Output was limited to one language—so if the second place language was less than 2% different the result would be no decision (“too ambiguous”).

What to Do and How to Do It?
The next question to answer is how to actually implement wrong keyboard/encoding detection. We could test everything for wrong-keyboardness, or only those with few results. The DWIM gadgets tack on additional results after the as-written results if there aren’t too many of them. The “Did you mean?” (DYM) suggestions show only a suggestion if there any results, but provide the suggested results if the current query gets zero results. TextCat, when it detects a query in a language other than that of the current wiki, presents additional separate results.

Thinking about the UI, it seems that the simplest and cleanest approach is the one we use now for DYM suggestions—if the current query has no results, show the results for the suggestion and a link to get the original (zero) results, otherwise show the suggestion as a link.

For example, searching for einsein (with zero results) on English Wikipedia gives…

"Showing results for einstein. Search instead for einsein."

...while searching for alpert einstein (with fewer than 50 results) gives a suggestion…

"Did you mean: albert einstein"

So now we have a number of options for when to transform a query or suggestion. Getting results counts for queries (or transformed queries) is expensive. Doing detection is relatively cheap, and the actual transformation is very cheap.


 * An aggressive approach would be to do detection on every query and look for transformed results from anything that passes detection. This would have higher costs and be more likely to get false hits, but would have higher recall.
 * A lazy approach would be similar to the DWIM gadget—just run the transformation for anything with fewer than, say, 3 results, and show the transformed results. This is likely to get more false positives on short queries, and on queries that partially transform into punctuation (which is mostly stripped at search time). In general, this is a higher recall, lower precision approach. It would also be a bit more expensive to run.
 * A cautious approach would be to only do detection on queries that get fewer than, say, 3 results, and only do the transformation on queries that pass detection. This would probably be the cheapest alternative, and would probably be higher precision.
 * A clever approach would be to use the new second-try search scoring mechanisms to predict the likelihood of something good happening based on the patterns I’ve seen, and use some pre- and/or post-filtering to improve the quality of the suggestions.

I am currently in favor of a cleverly aggressive approach, using heuristics to predict success and filter unlikely candidates before (i.e., “bail early”) or after transformation to improve the precision of the suggestions.

Some of my previous heuristics allow us to bail early and not bother to do detection based on properties of the string and converted string (conversion is cheap).

Don’t make suggestions for queries…


 * that are shorter than four characters (bail early)


 * that are in all caps (bail early)


 * with individual words in both character sets† (bail early)


 * with any remaining characters in the original character set after conversion (bail early)


 * without at least four target characters in a row after conversion (bail early)


 * with another language as a close contender for second place†† (built into TextCat parameters)

"† I originally discarded any queries in mixed Latin and Cyrillic characters like Уoutube (with a Cyrillic У), but I’ve seen good examples now with both character sets, though not in the same word.""†† My original heuristic done by eye suggested bailing if the second-place contender’s score was within 20% of the best score, but the optimization process whittled that down to only 2%, presumably because the other optimizations improved the accuracy of the primary detection."Additional heuristics related to second-try scoring mechanism:


 * query contains four or more of the same letter (non-punctuation, non-space) in a row (bail early)


 * query contains two or more periods in a row (decrease confidence)


 * transformed query contains six or more Latin consonants in a row (decrease confidence)


 * query contains five or more Latin consonants in a row (increase confidence)


 * query is Latin, but starts with a comma or semicolon (increase confidence)

Next Steps

 * Commit the en_cyr, ru_lat, and ru_win1251 TextCat models (to the PHP repo and Perl repo) (T213931—DONE)


 * Deploy the updated version of TextCat so it’s available when we need it. (T213936)


 * Implement second-try scoring heuristics, detection, and suggestions in David’s framework. (T138958)