User:TJones (WMF)/Notes/Typing on the Wrong Keyboard—Russian and English

From MediaWiki.org
Jump to navigation Jump to search

June 2016 — See TJones_(WMF)/Notes for other projects.

Highlights[edit]

Looking for mis-keyboarded queries in the "right" character set (ie., Latin on English Wikipedia or Cyrillic on Russian Wikipedia) can explain some gibberish queries and give some improvement in results, but it's very expensive because there are so many candidate queries.

Looking for mis-keyboarded queries in the "wrong" character set (ie., Cyrillic on English Wikipedia or Latin on Russian Wikipedia) can explain a lot of gibberish queries and give better results, especially on Russian Wikipedia, where possibly more than 1% of queries are accidentally typed on the wrong keyboard!

Limiting the scope to only zero-result queries or perhaps poorly performing (fewer than three results) queries could be computationally less expensive and much more effective!

Background[edit]

A while back Max commented on the fact that people who use multiple keyboards (e.g., English and Russian keyboards) sometimes forget to switch to the right one before they start typing, so it's possible to type "in English" on a Russian keyboard, or vice versa. If it's something we could detect, we could transliterate the query back into the correct character set (via keyboard mappings rather than the usual phonetic mappings used in transliteration).

I decided to spend my 10% project seeing what I could do about it.

Mapping characters[edit]

The character mapping I used came from just typing characters on my computer using a standard U.S. QWERTY keyboard layout and the default Russian keyboard layout as provided in OS X. The mappings are below:

фисвуапршолдьтщзйкыегмцчня
abcdefghijklmnopqrstuvwxyz

ФИСВУАПРШОЛДЬТЩЗЙКЫЕГМЦЧНЯ
ABCDEFGHIJKLMNOPQRSTUVWXYZ

]["№%:,.;хъёХЪЁжэЖЭбюБЮ
`~@#$%^&*[]\{}|;':",.<>

Identifying candidates[edit]

For identification, I used the mapping above to convert the English query-based language model for TextCat into a "Cyrillic English" language model. Fortunately, consonants and vowels generally don't align between the keyboard mappings, so words typed on the wrong keyboard tend to look like gibberish.

I'd recently gathered 100K samples of queries for looking at the issue of query-final question marks, so I took the English (enwiki) sample and looked for queries that consisted of nothing but Cyrillic characters and punctuation. There aren't a lot—only 412 (i.e., 0.412%). I excluded queries with a mix of Cyrillic and Latin characters (especially queries like Уoutube, with a Cyrillic У at the beginning) because that seems to indicate that the user had control over what keyboard they were using.

I then let TextCat attempt to categorize the queries matching the criteria above, but realized that I needed to adjust some parameters. When I first ran the identification and mapped the Cyrillic queries to the English keyboard layout, there was a lot of junk. I noticed that generally when "Cyrillic English" had barely edged out Russian, the results were bad. By default, TextCat considered any language within 5% of the best match to be plausible. I changed that to 20%, and discarded any query for which Russian was scored as a plausible second place to Cyrillic English.

I also noticed that extremely short queries (2 or 3 characters) that scored better for Cyrillic English were generally no good, either. So I filtered any result that didn't have at least four Latin characters (A-Z, case insensitive) in a row.

There were also a lot of junky queries that were made up entirely of capitalized letters. Some were probably acronyms (neither FBI nor ФБР are particularly pronounceable as word in Latin or Cyrillic), but I also discarded any queries that were in all capital letters.

I also filtered any queries that had any Cyrillic characters left (using the Unicode character class `\p{Cyrillic}`) after the conversion from Cyrillic to English.

Preliminary results[edit]

Cyrillic English on English Wikipedia[edit]

The resulting set, while small, was very good! There were only 12 potential Cyrillic English queries out of 100K on enwiki (0.012%), but all seemed to make good sense in the Latin Script. All 12 are provided below, along with their original Cyrillic forms, and a phonetic transliteration of the Cyrillic to show how they are often gibberish in Cyrillic. (The names are either of people with wiki pages or, in the case of sarter a reasonably common surname, so there is no PII here.)

  • Andrew Kehoe / Фтвкуц Лурщу / Ftvkuc Lurshhu
  • bodega / ищвупф / ishhvupf
  • crab alaskian / скфи фдфылшфт / skfi fdfylshft
  • cracidae / скфсшвфу / skfsshvfu
  • drom.ru / вкщьюкг / vkshh'jukg
  • edwin fischer / увцшт ашысрук / uvcsht ashysruk
  • german cuisine / пукьфт сгшышту / puk'ft sgshyshtu
  • hizaki / ршяфлш / rshjaflsh
  • list of chuck norris / дшые ща сргсл тщккшы / dshye shha srgsl tshhkkshy
  • oxford / щчащкв / shhchashhkv
  • sarter / ыфкеук / yfkeuk
  • technical task / еусртшсфд ефыл / eusrtshsfd efyl

I had considered using a dictionary to identify good mappings, but that wouldn't catch many names (Kehoe, sarter) or typos (alaskian for Alaskan), or URLs.

Cyrillic English on Russian Wikipedia[edit]

That worked so well that I used the same process on a 100K sample from Russian Wikipedia (ruwiki). Of course, there were many more queries that met the initial criteria (all Cyrillic characters)—in fact, most of them did: 79,877 (79.877%).

I applied the same process of language identification (TextCat using query-based language models for Russian and Cyrillic English, allowing for a second place identification within 20% of the best option) and filtering (discarding queries with any remaining Cyrillic characters, queries where Russian was identified as a plausible second language, queries in all caps, and queries without at least 4 Latin characters in a row after conversion).

Only 229 queries (0.229%) met all the criteria. I evaluated them all quickly, and 141 (~62%) seemed like good results, 34 (~15%) seemed like plausible results, 43 (19%) seemed like poor results, and 11 (5%) seemed like junk in either character set.

The junk consisted of mostly repeated letters (neither Ыыыыыыы nor Sssssss seem to be great queries—though the former seems to be used in Russian memes to indicate laughter and the latter was a horror movie in the 70s).

Some of the good results are very good, on par with the enwiki results above:

  • arch enemy / фкср утуьн / fksr utu'n
  • big data / ишп вфеф / ishp vfef
  • eagles / уфпдуы / ufpduy
  • gun n roses / пгт т кщыуы / pgt t kshhyuy
  • linux / дштгч / dshtgch
  • metallica / ьуефддшсф / 'uefddshsf
  • silk road / ышдл кщфв / yshdl kshhfv
  • tesseract / еуыыукфсе / euyyukfse
  • visual key / мшыгфд лун / mshygfd lun

Of course, the cost of finding these on ruwiki is much higher, since we had to sort through almost 80K candidates (instead of about 400 candidates) to get ~200 good/plausible results (instead of 12).

Latin Russian on Russian Wikipeida[edit]

Since everything worked so well, I decided to flip the script and look for Russian typed on an English keyboard.

I converted the TextCat Russian query-based language model to Latin characters. I extracted ruwiki queries that were all Latin characters and punctuation (16,566, or 16.566%!) and used the same parameters for language detection (within 20% is good enough for second place) and filters (no Latin characters remaining after transliteration, no queries where English was a plausible second place language, no queries in all caps, and at least 4 Cyrillic letters in a row). There were 1,473 transliterated results (1.47% of all queries!).

I took a random sample of 50 and investigated them. 48 (96%) seemed like good candidates, and 2 (4%) did not. Some examples are below, showing the original query in Latin, the Cyrillic keyboard-based transliteration, and a translation of the Cyrillic/Russian.

  • zgjybz / япония / "Japan"
  • uhepbz / грузия / "Georgia"
  • bhdby ije / ирвин шоу / "Irwin Shaw"
  • ktlybrb ehfkf / ледники урала / "the glaciers of the Urals"
  • vfhctkm / марсель / "Marseilles"
  • fdnjh / автор / "author"
  • vthndfz here / мертвая руку / "dead hand"
  • fhbcnjntkm / аристотель / "Aristotle"
  • 'qatktdf ,fiyz / эйфелева башня / "Eiffel Tower"
  • gthdjt egjvbyfybt j utjhubtdcrjq ktynjxrt / первое упоминание о георгиевской ленточке / "the first mention of St. George's Ribbon"

If the numbers hold, that's around 95% precision and a chance to improve approximately 1.4% of all queries on ruwiki!

Latin Russian on English Wikipeida[edit]

I also looked at potential Latin Russian on English Wikipedia. There were a lot of queries made up of only Latin characters and punctuation (93,510 or 93.510%). I ran them through TextCat and filtered them as above, leaving 237 possible transliterations. I sampled 50 and none were particularly good, 1 or 2 were plausible.

Caveats and future directions[edit]

Since this was a 10% project and just a proof of concept, I didn't consider Ukrainian vs Russian keyboard differences (we do get Ukrainian queries on enwiki, for example, but fewer), and I didn't pay too much attention to other languages that use the Latin alphabet (I did see one instance of carro / сфккщ / sfkkshh, which looks like Spanish carro, "cart, car, train car"). Also, there's nothing close to production-ready code—almost everything was done on the command line!

It's not clear whether we could reliably distinguish, say, Cyrillic Spanish from Cyrillic English or a Ukrainian keyboard from a Russian keyboard. Unfortunately, there's a combinatorial explosion of possibilities. But the most bang for the buck from trying the most obvious options for the given scenario. For example, we get more Russian than Ukrainian queries on enwiki, and English is more likely than Spanish on enwiki, so we might only consider a Russian keyboard and Cyrillic English on enwiki.

Also, my original query pool was all queries, and this is generally something we might want to do only for queries that get no results or poor results (fewer than three). That might increase the percentage of queries we consider for transliteration that give good results, while decreasing the overall number of queries transliterated.

In general, this seems like a plausible way to get results for some number of queries that seem to be in the right character set for a given wiki, but are in fact in another language, and this could be generalized to other keyboard mappings (e.g., Arabic, Greek, Hebrew, etc.).

This seems especially promising for Latin accidentally typed into Russian Wikipedia!