User:TJones (WMF)/Notes/Review of Language Identification in Production, with a Special Focus on Stupid Identification Tricks

Background
Back in August 2017 we discovered "the Scourge of Commas" on Chinese Wikipedia: most punctuation characters and many other characters were being indexed as a comma. This was annoying on Chinese-language wikis because essentially any punctuation character could match any other, and a lot of generally useless tokens were being indexed—about 16% of all tokens. A side effect of this was that almost any punctuation symbol that was searched on Chinese Wikipedia would get lots of results.

Generally, punctuation is ignored by our language analysis, so only exact matches on titles and redirects get results (e.g., on English Wikipedia there's a redirect from "—" to the "Em dash" section of the "Dash" article). The cumulative effect of this was that searching for certain punctuation on several Wikipedias would get only one result (an exact redirect title match), get identified as Chinese (an apparent statistical fluke), and then get lots of results on Chinese Wikipedia. T172653 fixed the "lots of results on Chinese Wikipedia" part, but it is still possible for the rest of the chain to occur.

Before looking into this more, I'd only been able to find one example that still causes problems. Searching for "..." on all of the non-Chinese Wikipedias with language detection enabled (German, English, Spanish, French, Japanese, Dutch, Portuguese, and Russian) returns one result (a redirect to the article on the ellipsis) and one result from Chinese Wikipedia. The bad news is that "..." is being detected as Chinese. The good news is that the one result is the Chinese article on the ellipsis, so at least it isn't weird and mind-bogglingly irrelevant—just weird.

So! The point is to take a look at a sample of queries that are eligible for language detection (i.e., "poorly performing" queries that get fewer than three results) and see how often they give non-sensical results like hyphens did before and ellipses do now, and see if there are any simple filters we can put in place to prevent the silly results. Extrapolating wildly from the one example, filtering queries that are made up entirely of punctuation might do the trick, but I need to see if there are other examples of problem queries, and also get a sense of how frequent the problems are overall.

There are several things that have to fall into line, so for now I'm going to only try to focus on the cases where everything goes wrong. For example, English Wikipedia queries might be incorrectly identified as, say, Chinese, but it doesn't matter if either the query gets a lot of results on English Wikipedia (so language identification doesn't fire) or it gets no results on Chinese Wikipedia (so no results are shown)—in those cases, users see nothing weird, even though one step in the process could have gone weirdly.

Also, since I'm having to dig kind of deeply into the language identification results, I'm going to do a survey of how the system is doing overall—how many queries are getting "Did You Mean" (DYM) suggestions, what languages they are in, and whether they get any results in the identified language, etc.

Side Quest: Data Extraction
I started off with my usual data extraction: pull a random sample—for each of the nine wikipedias with language identification enabled—of five to ten thousand queries that meet the following requirements (encoded in HiveQL)—


 * the sampled queries got fewer than three results when searched (that's our target group of poorly performing queries that are eligible for language identification)
 * the sample is across a week, to account for any cyclical effects (weekend queries may be different from weekday queries)
 * sampled queries are limited to one query per IP per day (to reduce bots that slip through other filters, and so power users aren't over represented)
 * we exclude any IPs with more than 30 queries in a day (to reduce the number of bots and other atypical users)
 * we require the session to include a near-match query, which runs before a full-text query and originates in the search box in the upper right (or left) corner of the page (again, to filter in favor of human searchers).

While looking through the queries I found several that got more or fewer results when re-run than they did when stored in the logs (about a month ago); being off by one or two results was no surprise. But I found a few that were off by hundreds or thousands of results.

Long story short, our randomization process was sometimes picking up a near-match variant of a query and thus dropping quotes around a phrase.† For lots of other tasks we've done, this hasn't mattered much. For random queries, a different query from a user is still a random query. For the purposes of the content of the query, losing the quotes doesn't matter. But in this case, it did matter, because I was re-running the queries to get result counts in the original wiki and the wiki selected by language identification, and the numbers were way off."† Near match tries several ways to get a title match for your query, including exact match (Albert Einstein), ignoring case (albert einstein), ignoring diacritics (Àłḇėŗṱ Ēîñšţęïń), and dropping one outer set of quotes ('Albert Einstein'). If you search for albert einstein all of these transformations give you back your original query, so there often aren't any actual variants. If you search for Albert Einstein you get an exact match right away, and no other variants have to be tried. if you search for 'àŁḇĖŗṰ ĒîÑšŢęÏń' you are a trouble-maker, but it works!"Fortunately, Erik was able to pull together a Jupyter notebook that did more reliable sampling. It also doesn't seem to hang the way Hive sometimes does. (With Hive, I usually break up my one-week sample into seven one-day samples and combine them. On the larger wikis, even one-day samples sometimes get stuck at 99% during the reduce phase, and I have to abort and pick a different day offset by 7 days—e.g., replacing a failed Wednesday sample with a different Wednesday sample from seven days earlier or later.)

With my new Jupyter notebook–based query, I was able to extract 5,000 poorly performing queries (i.e., < 3 results) for each of the German, English, Spanish, French, Japanese, Dutch, Portuguese, Russian, and Chinese Wikipedias.

Troublesome Tabs and Negative Numbers
I dropped one query each from the French, Portuguese, and Russian samples that ended in a tab followed by a -1 (so they have 4999 rather than 5K queries in them). Somewhere along the way in my ad hoc analysis pipeline the combination of a tab and a number caused problems for parsing the data, and it was easier to drop them since it was only one query out of 5,000, and the common format looks very bot-like anyway.

Also, there were some repeated queries, which were deduplicated. So if the total number of queries adds up to 5,000 (or 4,999) then that's all of them. If it adds up to anything else (4947–4987) then that's the deduplicated list. Generally, 0.25%–1% of queries were duplicates (1.06% was the largest amount).

Now, back to our regularly scheduled programming...

Gathering Stats
For each query, I extracted the number of results it got from the logs, re-ran it to see how many results it gets now (to double check that the query extraction is working as advertised), ran language identification with the config of the source wiki, and if a different language was identified,‡ I ran the query on the Wikipedia of the identified language and recorded the number of results it got there."‡ There are three and a half basic relevant outcomes of the language identification: 1) the text is identified as being in the language of the 'home wiki' so we do nothing, e.g., identified as French while searching French Wikipedia; 2) the text is identified as being in another language and thus eligible for cross-language searching; and 3) the text is not identified as being in any language because it is too ambiguous or 3.5) too short."I also recorded whether a DYM suggestion was shown, and how many results that DYM suggestion gets, because...

Blocking Suggestions
There's another wrinkle to the language identification process. If a query gets zero results and a "Did You Mean" (DYM) suggestion is available, we automatically search for the suggestion and show its results. If the DYM suggestion has fewer than three results, then we will try the cross-language search on the original query and show those results, if any. So, I recorded whether there was a DYM suggestion for each query, and the number of results the DYM suggestion got to be able to determine whether the cross-language results were actually shown to the user.

Since DYM suggestions are based on the general word statistics and can vary by shard, you aren't guaranteed to get the same DYM suggestion every time you search for a particular query, as the frequency rank for rarer words is not as consistent across shards.

For example, the search onibus espaciais ("space shuttle" in Portuguese) on English Wikipedia gets zero results. At the time of this writing, on some shards, it gets a suggestion of onions espacial, which is searched instead, and gets two results; on other shards, it gets a suggestion of onions especiais, which gets only one results. Since both are fewer than three, and the original query is recognized as Portuguese, cross-language results for onibus espaciais are shown, too. (If you are lucky, as in this case, you might manage to also get sister-search results, pushing our UI to the limit.)

The query kleine Schraub-Ösen on English Wikipedia also gets different suggestions from different shards: kleine schnaus ösel gets zero results, triggering cross-language results from German Wikipedia; kleine schramm östen gets three results, which blocks cross-language searching.

In summary, a few important things are going on here:


 * Because the DYM suggestions vary across shards (because the term frequency stats vary across shards), the suggestion behavior is not entirely predictable, and the suggestions and counts I have recorded may vary in production, but that should average out across the whole sample.
 * DYM suggestions can block cross-language searching, so in some cases the language identification may have worked, and cross-language results may exist, but they aren't shown.

Data Integrity Check
At this point, since I had the number of results at the time of the original search for each query, plus the current number of results for each query, I decided to look for any big discrepancies.

I found a few, but digging deeper revealed WikiGnomes hard at work. A number of articles were written, and references to them sprinkled throughout other articles. A few edits were made to templates.

The biggest change was from 0 results to 48 results for セルジ・サンペル, (Sergi Samper) on Japanese Wikipedia. When I went back to check the results yet again, he got almost 2400 results. Samper is a Spanish soccer player who left Spain and joined a team in Japan at the beginning of March, i.e., between the time of the original query and the time when I first re-checked it. Though the Japanese version of his name was identified as Japanese and wasn't really relevant to language identification, it still provided a good check on our data extraction process, and demonstrates what a difference of a month can make on-wiki!

A Language Identification Primer/Refresher
To understand what's going on with the data, it helps to understand a bit about how TextCat works and how it is configured for language identification. The underlying mechanism of TextCat is to compare frequency-ranked lists of 1- to 5-grams extracted from sample texts against similar lists generated for a much larger known sample of the language to be identified.

Data Selection: I originally pulled samples of queries from various Wikipedias to use for configuration and optimization. I intentionally excluded gibberish, ambiguous names,§ mixed-language queries, and "non-language" queries (like acronyms or ID numbers/strings)."§ Some names very clearly come from a particular distinctive ethnolinguistic source. Giovanni Boccaccio is very Italian, for example. Often the script gives it away: any name in Armenian script is probably sufficiently 'Armenian' to be counted as such, even Ջովաննի Բոկաչչո ('Giovanni Boccaccio'). Multi-lingual names (like Maria) or mixed-source names (like Alberto Fujimori, or the 'statistically likely' yet uncommon Mohammed Lee) were dropped. See the TextCat Demo and try the 'Names are Weird' demo for more—select it from the drop down in the upper right corner."Model Size: The original TextCat models were made up of only 400 n-grams, in part to save on computation—it was implemented in the mid-90s—and in part because it was intended to be run on larger texts, like emails, where that level of detail is probably enough. Our TextCat models are built on 10,000 n-grams, but we made the number we actually use configurable. The larger models give us more resolving power on shorter strings (like queries so often are).

Query-Based Models: We extracted models for many languages based on the text from their respective Wikipedias, but we also built models based on actual queries that users performed on many Wikipedias. Queries are likely to use less formal language, use less punctuation, and have other features differing from the more formal text of Wikipedias. We did see an improvement in language detection accuracy when using the query-based models on actual queries.

Language Selection: Enabling every language for which we have an n-gram model runs a fair bit slower, and, more importantly, does not give the best results. Scots, for example, is fairly similar to English, but fairly rare among English Wikipedia queries; enabling it would lead to a lot of false positive identifications of Scots. Even a language that does occur on a given wiki might do more harm than good; for example, Spanish and Portuguese will sometimes be mistaken for each other. If we have a lot of Spanish queries but very few Portuguese queries, enabling Portuguese could generate more mistakes (Spanish identified as Portuguese) than correct answers (Portuguese identified as Portuguese). So, for each Wikipedia where we enabled language identification, we selected a list of languages based on what was in the query sample from that wiki, and sometimes excluded languages that hurt more than they helped. (A lot of the time I spent on TextCat improvements was aimed at doing a better job handling ambiguous cases to be able to reinstate useful languages we had to originally omit because of false positives.)

Boosting: Language detection isn't done in a vacuum, and we do have some a priori notion of what is more likely. Given some narrow ambiguity between, say, French and English—where some words are identical, like adorable, which is the same in English, French, Spanish, and Catalan!—we'd like it to break toward English on English Wikipedia, and toward French on French Wikipedia, all other things being equal. So, we give a boost to some languages on each Wikipedia where we do language detection. We settled on boosting two languages: the home language of the wiki, and a second language, which in every case turned out to be English (on English Wikipedia, the second language to get a boost is Chinese). They get a 14% boost in their score to reflect the prevalence of the languages and overcome any close ambiguity versus other languages.

Languages Enabled for Detection, by Wiki
Below is Table 1, listing the languages enabled for each wiki. The first language is the home language of the wiki, the first and second languages are bolded to indicate that they also receive a boost in scoring.

—Table 1—

Some Additional Stats
Since I was digging through all this data, I decided to take the opportunity to generate a snapshot (or maybe a dozen snapshots from different angles) of how TextCat and language identification are performing. I did not look carefully at all ~45K language identification results, but I did gather some relatively interesting statistics.

Language Identification by ID Type
Too hot, too cold, too hard, too soft... just right: Table 2 below shows how many queries for each wiki were labeled "too short" (<2%), how many were "too ambiguous" (4–40%), how many were in the home language of the wiki (34-79%), and how many were left to potentially be shown as cross-language search results (12-28%).

—Table 2—

Language Identification by ID Language
By language: Table 3 below shows the breakdown by language for each sample, including the home language (dark green) and all other enabled languages (light green) for each wiki. This table includes the "home" and "remaining" items from Table 2 above. Suggestions for possible additional languages to enable are shown in light blue (see more below).

—Table 3—

For those who don't know all the language codes by heart—slackers!—here's the list (ordered by code, as in Table 3 above): Afrikaans (af), Arabic (ar), Bengali (bn), Breton (br), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Persian (fa), Finnish (fi), French (fr), Irish (ga), Hebrew (he), Hindi (hi), Croatian (hr), Hungarian (hu), Armenian (hy), Indonesian (id), Icelandic (is), Italian (it), Japanese (ja), Georgian (ka), Korean (ko), Latin (la), Latvian (lv), Burmese (my), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Russian (ru), Swedish (sv), Telugu (te), Thai (th), Tagalog (tl), Ukrainian (uk), Urdu (ur), Vietnamese (vi), and Chinese (zh).

Note that most languages are identified (not necessarily 100% correctly, of course) in a ~5K sample. Burmese (Myanmar) and Telugu are enabled for one wiki each, but no queries are identified as such. Hebrew (3/6), Hindi (1/2), Armenian (1/2), and Japanese (3/6) are identified in half of the wikis for which they are enabled. Arabic, Greek, Korean, and Thai are enabled on multiple wikis, but each have one wiki where no queries are identified as such.

Some of the less ambiguous scripts (like Hebrew) were enabled based on the presence of a very small number of examples in the original data because they are very likely to be correctly identified, even if they are unlikely to occur. I now think we should have been more aggressive in enabling these less ambiguous scripts. Even if they are possibly incorrect (e.g., you can have Yiddish in Hebrew rather than Latin script), the wiki of the "most obvious" language is often the best other wiki to search. For some languages and scripts, like Arabic, Hindi/Devanagari, and Russian/Cyrillic, it might make sense to enable them on every wiki as the most likely language for the script, given that there were no other likely candidates in their original training samples.

Language Identification Actually Shown
Below is a breakdown of cross-language results that were actually shown, further broken down by the number of results the original query got (0/1/2). The number of queries blocked by DYM suggestions with ≥ 3 results are also shown.

For example, on dewiki, 53.92% of queries with cross-language results (275 out of 510) actually had those results shown; 157 showed cross-language results on a query that had 0 results, and 59 each for both 1-result and 2-result original queries. 46.08% (235 our of 510) were blocked by suggestions.

—Table 4— For comparison, here's a breakdown of the total number of results the original (~5K, but deduplicated) queries got for each wiki. Roughly: about 75% get 0 results, 15% get 1 result, and 10% get 2 results.

—Table 5— It's interesting to note that zero-result original queries were shown more cross-language results (222 on average, see Table 4), but at a lower percentage rate: 222/3698 = 6%, vs 82/780 = 10.5% or 60/492 = 12.2%. There's an interesting pattern here: as the number of results goes up, the raw number of queries with cross-language results goes down, but the percentage with cross-language results goes up. This makes sense; assuming the language detection is generally correct, then a query in, say, Russian that gets more results on English Wikipedia is also more likely to get at least some results on Russian Wikipedia.

Below is Table 6, similar to Table 3 above, except only queries with cross-language results that were actually shown to users (or, "probably shown," modulo sharding variability; see above) are included. The last row shows the percentage shown, after ignoring home wiki results (e.g., it wouldn't make sense to show additional results for queries identified as German while on German Wikipedia, so those don't really count as "not being shown"). Columns with no results do not have a percentage, vs. columns where no results were shown, which have 0.0%.

The two main reasons for results not being shown are:


 * No results were found in the language detected—e.g., we identified a query as Hungarian, but when we searched Hungarian Wikipedia, we found nothing. This could either be because the language identification was wrong, or because the query is very much Hungarian, but not a very good query and so doesn't match anything—which is common for long queries that are snippets of cut-n-pasted text, for example.
 * A DYM suggestion was made for the query, and the suggestion got three or more results, blocking the display of the cross-language results (more details on that coming up).

—Table 6— Some observations:


 * Several languages had more than 70% of their cross-language results shown: Arabic, Greek, Persian, Armenian, Korean, Russian, and Chinese. The average was about 37%.
 * Note that these are generally languages with different writing systems from the wikis with language detection enabled, which makes it easier to do detection reasonably well.
 * The most queries with results shown were in English, Arabic, German, Russian, Chinese, and French—all world languages.
 * Breton (66K), Irish (50K), Icelandic (48K), and Urdu (144K) had no results shown, despite having some queries identified as being such. These are all small Wikipedias with ~50K–150K articles, so coverage of possible queries is far from complete.

Language Identification Blocked by Suggestions
Tables 7 & 8 below summarize the number of queries with cross-language results that could have been shown, but which were blocked by a DYM suggestion having three or more results.

Table 7 shows the number and percentage of those shown vs blocked, by wiki. (This a repeat of data shown in Table 4 above, for convenience, without the 0/1/2 results breakdown.)

—Table 7— Table 8 shows the language identification breakdown of the blocked DYM suggestions.

—Table 8— Some observations:


 * The average percentage blocked by language identified is 11%, though it is as high as 67% for Finnish (though that was only 4/6 queries).
 * English has only 22.5% blocked by DYM suggestions, but that was by far the bulk of all the suggestion-blocked queries with 1669 out of 1945. This makes sense, as English text is very common on non-English Wikipedias, and DYM suggestions are more likely to be able to find something that gives a few results.

Block Less, Show More
Since a fair number of cross-language results get blocked by DYM suggestions, I decided to look and see what would happen if we were a little looser with our blocking criteria. My thought is that since a suggestion is not exactly what the user was searching for, why not be a little more lenient about giving them a different kind of help? The current "fewer than three" cut off was chosen more or less arbitrarily, after all.

Table 9 below shows how many additional cross-language queries would show results if we changed the threshold from < 3 to < 4, then to < 5, then to < 6 (i.e, 5 or fewer results at the end).

Since the absolute numbers are fairly low, I reviewed the additional results manually for accuracy of language detection. Table 9 below shows the results.

For each threshold, the first column (< #) shows the number of queries that would have had some cross-language results shown. The second column (good/ok) shows the number of queries I rated as "good" or "okay" in terms of the language detection. The third column (%Δ) shows the percentage increase of available cross-language results that would be shown. The next to last row (avg) shows the averages of the rows above, and the final row (%corr) shows the percentage of new language IDs that are good (i.e., SUM("good/ok")/SUM("< #")).

—Table 9— Observations:


 * Upping the limit from < 3 to < 4 gives a small increase of 0.9% on average, so from about ~63% to ~64% of results would be shown on average, but almost all of the new results are good.
 * Going further to < 5 adds another small increment of mostly good results, while < 6 might be a step too far as quality drops below 80% and the increase in results shown is very small (0.3% on average).
 * Note the difference between Japanese (fewest new results shown—with none above < 4) and Dutch (most new results shown—~1% at each step). Not sure why, but it is interesting to observe the variability.

Summary of TextCat/Lang ID Stats
Though looking at this data and generating these stats wasn't the original intent of this task, it was too good of an opportunity not to take.

Some thoughts:


 * I'm happy with the performance of the "too short" category (Table 2). There was a high proportion of junk there, but we aren't filtering too much. Japanese is losing more queries to that filter (but still <2%) because of its writing system—it's easier to pack more meaning into two whole syllables rather than just two alphabetic letters.
 * The "too ambiguous" category (also Table 2) looks okay. There's a pretty good correlation between lower ambiguity and higher home-language identification, which I hypothesize is attributable to lower language-diversity in the queries.
 * The proportion of queries available for cross-language results is pretty consistently near the median of 22.5% (i.e., ±5%) except for English (12.5%) and Japanese (14.9%), so there is a reasonable opportunity to give users good "second-chance" results.
 * It's good to see that most enabled languages are being used (Table 3 and Table 6), even if in low volume.
 * It does make sense to enable some "less ambiguous" languages everywhere, just because the potential for harm is very low, and there is some small upside possibility.
 * The 0/1/2 breakdowns (Table 4 and Table 5) and the inverse correlation between number of results for the original query and the likelihood of having cross-language results does at least hint that number of results for the original query is some kind of indicator of search quality.
 * Table 7, Table 8, and Table 9 definitely point at loosening the criteria for allowing DYM suggestion results to block cross-language results being a valid approach.

Language Identification Errors
Ahh, the main event!

Here we are looking at potential "bad" language identification. I put "bad" in quotes because I'm only evaluating whether the language identification is correct, not whether the results shown were "good", which can be really hard to tell. For example, the sha part of sha512 is an acronym so any language identification is arguable wrong (though sh is very distinctively English among world languages); nonetheless, there's a perfectly fine English Wikipedia article on "Secure Hash Algorithms", so maybe the user got what they wanted. Similarly, if, say, Spanish is incorrectly identified as English on Russian Wikipedia, a decent result might be okay, because Russians are about 5x more likely to speak English than Spanish, according to Wikipedia.

Rather than try to sort out questions of user satisfaction, my goal is to get a sense of how often language identification goes wrong, with a moderately loose definition of "wrong", and look for any patterns that suggest reasonable methods of prevention or mitigation.

Table 10 shows the raw numbers for my review of language identification, and we'll be digging into some of them a little more deeply following the table. Note that I only reviewed queries that were actually shown to a user to keep the level of effort tractable. The percent is the percentage out of all queries shown to the user that fell into the "bad" language ID bucket (cf. Table 7).

—Table 10— Thinking back to the Language Identification Primer/Refresher above, note that the boost given to English increases the chance that random Latin text will be identified as English. Overall, this is a net win for identification because, a priori, English is the most likely language for a "foreign" query to be in, by at least a 5-to-1 margin—up to 30-to-1 or more!—over the next most frequently seen language using the Latin alphabet. (See the TextCat optimization write ups for the various wikis for more details.)

So, considering only "bad" identification with results shown to users, we have:


 * German Wikipedia (dewiki) had 3 non-language Latin-alphabet strings identified as English, 1 clearly incorrect identification of Spanish as English, 1 of Italian as English, and 1 website name identified as Vietnamese.


 * English Wikipedia (enwiki) also had 1 website name identified as Vietnamese.


 * Spanish Wikipedia (eswiki) has 12 non-language Latin-alphabet strings identified as English, and 1 instance of romanized Japanese (rōmaji) identified as English.‖

"‖ I don't even know what to think about transliterated/romanized text. In this case, it's not really Japanese anymore, but it isn't really English—though transliteration is often done through the lens of a particular language. For example, Russian Щедрин can be transliterated as Shchedrin (English), Ščedrin (Czech), Schtschedrin (German), Chtchedrine (French), Szczedrin (Polish), Sxedrín (Catalan), Sjtjedrin (Danish), Scsedrin (Hungarian), Sjtsjedrin (Dutch), Șcedrin (Romanian), or Štšedrin (Finnish)... but is Shchedrin really 'English' or Chtchedrine really 'French'?"


 * French Wikipedia (frwiki) had 8 non-language Latin-alphabet strings identified as English.
 * Italian Wikipedia (itwiki) had 3 non-language Latin-alphabet strings (including one website) identified as English, and 1 Danish, 1 German, & 3 French queries identified as English.
 * Note that French and Danish are not enabled for Italian Wikipedia, so a boosted English ID is understandable.


 * Japanese Wikipedia (jawiki) had 52 non-language Latin-alphabet strings identified as English, 6 mixed-script strings identified as English, and 2 rōmaji strings identified as English.
 * Note that German and English are the only Latin-script languages enabled for Japanese, and with the boost English gets, that makes it pretty much the catch-all for miscellaneous Latin-script queries.
 * For mixed-script queries, Latin seems to win over CJK languages, probably because of the larger character set for CJK languages. One random Latin letter and one random Chinese character is likely to match the distribution of English better than that of Chinese.


 * Dutch Wikipedia (nlwiki) had the most diverse set of problem queries. 6 non-language Latin-alphabet strings identified as English, 1 as Danish. 1 Spanish query was identified as English, 1 English as Latin, 1 English as Polish, 1 name (sudan plus numbers) as Croatian, 1 name (lego plus numbers) as Polish, and 1 website name identified as Vietnamese.


 * Portuguese Wikipedia (ptwiki) had 23 non-language Latin-alphabet strings identified as English, 3 website names identified as English, and 1 German, 1 Spanish & 3 Italian queries identified as English.
 * Note that Spanish, Italian, and German are not enabled for Portuguese Wikipedia, so a boosted English ID is understandable.


 * Russian Wikipedia (ruwiki) had 38 non-language Latin-alphabet strings identified as English, and a number of incorrect identifications for other languages as English: 1 Spanish, 1 Swedish, 1 Dutch, 3 Polish, 1 Latin, and 1 Vietnamese, and 14 cases of reasonably certain (8) or likely (6) wrong-keyboard queries identified as English.
 * The merely likely wrong-keyboard queries are short—3 or 4 letters—and most 3- or 4-letter sequences exist on English Wikipedia, so they get results. The longer ones are often broken up by punctuation (as the apostrophe, comma, and semicolon on the American keyboard map to Cyrillic letters on the Russian keyboard) so only 3- or 4-letter sequences need to match on English Wikipedia. The longest wrong-keyboard query that gets results on English Wikipedia is hjccbz (россия, "Russia") because there is a wrong-keyboard redirect from Hjccbz to Russia!
 * Note that Swedish, Dutch, Polish, Latin, and Vietnamese are not enabled for Russian Wikipedia, so a boosted English ID is understandable.

Some observations:


 * No queries that are primarily punctuation showed up in the samples.


 * The most common "problem" is that non-language Latin-alphabet strings are being identified as English, which makes sense given the 14% boost English gets as the #2 language on the other eight Wikipedias.
 * English Wikipedia performance is really good, probably because it doesn't suffer from the boosted English problem—anything identified as English on English Wikipedia gets no cross-language results.


 * Languages that are not enabled on a given wiki can't be recognized, of course. Some are not enabled because they didn't occur in the original sample (and we didn't enable anything not in the original sample partly to save on CPU cycles, which may not be necessary, given the size of the list English eventually ended up with). Others are not enabled because they caused too many false positives; we wouldn't want to re-enable those unless their bad behavior could be further mitigated.
 * Some incorrect identifications are unavoidable¶—Gmail likes to tag many of my emails as Haitian Creole for no apparent reason. I don't consider a small number of obviously incorrect language identifications to be a real problem because—out of thousands of identifications—it's always going to happen.
 * Names and transliterations (like the rōmaji examples) are another unavoidable source of potential errors, and of course a sufficiently famous name could get results on any large wiki.

"¶ Ages ago I generated the following text which is, at the trigram level, statistically English: Carapes the ditl isch prentele whic che fiene Unincip-ikedfuls Que pland trial laing expror, no the thent acards, wal of of Eng Evis, forigh Worics on ousunt heard In youle not to linet med, mants of sen gic spers of at nam at mands wouremay. Many online translators, including Google and Bing, use n-gram language detection, at least for unknown words, and recognize this as English. This is a hard task we're working on here!"


 * The website xnxx seems to often be tagged as Vietnamese when Vietnamese is enabled, otherwise English. (It recently got an article on English Wikipedia, and has historically been one of the most common elements of poorly-performing or zero-result queries. I think it comes up most often from misplaced queries—i.e., someone meant to search on Google, or go directly to the site—especially when it occurs with other, more specific terms.


 * The wrong-keyboard problem on Russian Wikipedia is definitely something we could improve; the previous implementation effort stalled, but we'll get back to it.

Punctuation Power: Looking for Trouble
Since the original source of the concern about "weird" cross-language results was driven by all-punctuation queries, I decided to go specifically in search of such queries.

Data
I started with the same criteria as before, but loosened the threshold intended to exclude "heavy" searchers, from 30 to 1000 queries in a day (keeping the limit of one query per searcher—so it might let in more bots, but only one query per bot). I added an additional filter that required the query be made up entirely of certain characters. I expanded the search to include all Unicode punctuation (across many languages), symbols, separators (including white space), and "other numbers" (i.e., Unicode patterns,  ,  , and  ; see Unicode Regular Expressions for more). I originally included all numbers, but there were lots of queries made up mostly of 0-9 and a few spaces or hyphens.

There are a lot of emoji in these queries. I don't expect them to have much cross-language impact, but I'll run them and see what happens. I'm surprised how many of the long emoji strings get repeated. It doesn't seem like it could be accidental. My two best hypotheses at the moment are a shared link somewhere that issues the query (but why?), and someone with a changing IP address reloading a page (maybe from switching wifi networks/hotspots—I'm not sure if, say, having your phone connected to campus-wide wifi at a university would give you a changing IP address as you moved around campus; probably depends in part on the implementation of the network).

I intentionally removed duplicate queries from the start with the punctuation data, but looked at the duplicates. As mentioned above, emoji duplicates are common (especially for one- or two-character strings). Another commonly repeated query on some wikis is space plus U+FFFC, the "object replacement character"; I don't know what that's about.

The most repeated query was definitely ^_^ which occurred 36 times out of 193 punctuation/symbol queries on Japanese Wikipedia. In general, sequences of various punctuation characters are common, usually just one or two different characters repeated from a few to a couple hundred times (e.g., 248 +'s—again, why?).

Brief Stats Overview
Table 11 below summarizes the duplicates (total vs unique) and the number of too short/too ambiguous/remaining queries as categorized by the language identification, and the number shown to users (i.e., because there were actually some cross-wiki results after identification).

Compared to Table 2, we can see that there are many more punctuation/symbol poorly-performing queries that are too short compared to general poorly-performing queries (>37% vs <1%) and similarly for ones that are too ambiguous (~54% vs ~24%). The home-language numbers are much lower (2% vs 54%, with most of the 2% average coming from the Japanese outlier of 11.6%). The "remaining" numbers are considerably lower (~7% vs ~21%).

The number actually shown to users if very small—none in most cases, and only one each for Spanish Wikipedia ( with cross-wiki results from Chinese Wikipedia), Japanese Wikipedia (  with results from English Wikipedia), and Russian Wikipedia (  with results from Chinese Wikipedia).

—Table 11— Table 12 gives a truncated breakdown of the languages detected by TextCat on the punctuation corpus. Chinese is far and away the most common result (123 out of 184 total), with English, Japanese, and Thai trailing far behind.

—Table 12—

Digging Into the Models
I expected languages with writing systems with many more characters in them (Chinese, Japanese, Korean) to rank punctuation more highly within their models. In English, for example, the majority of sentences end in a period, but most also have multiple instances of the most frequent letters in them, and often have multiple occurrences of "the" (in the, their, there, them, they, then, these, etc., etc.), so you wouldn't expect period to be in the top 20 n-grams. In Korean, there are more than 11,000 syllable characters, so while some are very common, most of them don't occur in most sentences, so a period unigram could be ranked very highly in the Korean model.

But Chinese and Japanese usually use fullwidth punctuation characters (like . and ，), so the "plain" period and comma would be much less common (but still used in non-CJK texts, or in names, like Mr. Smith).

And of course the boosted languages (i.e., the home language, plus Chinese on English Wikipedia and English on all the others) are more likely to be detected for punctuation/symbol-only queries.

That explains some of the unevenness of the distribution in identification—home language identification, English identification, more Chinese identification for English Wikipedia—but not the over-representation of Chinese almost everywhere and the dominance of Chinese on English Wikipedia.

I looked specifically at period, dash, and comma, plus their fullwidth counterparts, in Chinese, English, Japanese, and Thai—based on Table 12—plus Korean, French, German, and Russian for further comparison.

I extracted the rank of all n-grams made up entirely of those characters. The results are presented in Table 13.

Note that rank 1 would be the most common n-gram (which is a space for every model, except wrongly-encoded Russian). The top n-grams for English are " ", e, a, t, n, i, o, r, s, h, l, d, "e ", c, and u... which is pretty close to what you'd expect based on "etaoin shrdlu".

Also note that an empty cell in the table indicates that the n-gram in question was not among the 10,000 most frequent n-grams for the model.

It is very surprising that sequences of 1 to 5 periods or dashes are in the top ~50 n-grams for Chinese query-based model!

Looking at the text columns (based on Wikipedia text), we see a more expected pattern: the most-used period (. or . ) is in the top 10 for CJK scripts, and the top 50 for alphabetic scripts). Dash is in the top 250 across the board (again, for the Wiki-text models).

For query text—except for Chinese!—the punctuation ranks much lower, since people tend to use less punctuation in queries, which tend to use less formal language and are often not full sentences.

—Table 13—

Digging Into the Training Data
I was able to dig up a copy of the data used to generate the query models. This data was gathered from raw logs, back in the day, when we didn't have the ability to filter likely bots and power users, limit ourselves to one query per user, or even specify a sample size (instead we got all the queries in a particular time box).

I did lightly sanitize the data as best I could at the time (e.g., queries with only characters from other scripts were excluded), but I didn't think to exclude excess punctuation. Looking now I see 60 queries that consist of one Chinese character, followed by 52 periods, then another with 53, etc, up to 111 periods. Other, less obviously ridiculous queries still contain many periods.

I ran a quick count of period-only n-grams across the samples for the languages above. I counted non-overlapping instances of 1, 2, 3, 4, or 5 periods. So, a sequence of 12 periods would count as 12 single periods, 6 double periods, 4 triples, 3 quadruples, and 2 quintuples (with 2 left over, uncounted).

Since the data samples were time-boxed, they are of wildly varying sizes (2.4M to 283M characters), so I normalized the counts against the number of characters in the sample. (Arguably, the percentages for n periods in a row should be divided by n, since there are 1/n as many n-grams in the full sample, but it just makes the small numbers smaller, and the point is to compare the relative values between samples, so a constant multiplier doesn't really matter.)

—Table 14— So, single periods are more common in English, French, and German than in Chinese. Thai and Russian are comparable to Chinese.

When we get to double periods, they are more than 10x more common in Chinese than French, and more than 25x for English and German!

For the triple periods, Chinese has 20-25x more than English (the numbers shown here are too small to have enough significant digits to make the exact proportions obvious).

Quadruple and quintuple periods are 50x and 70x more common in Chinese than Thai, which is the only other contender for so many periods in a row.

Dashing (and Other Symbolic) Tales
I didn't do as detailed an analysis, but dashes (-) have a similar distribution as periods in the Chinese data.

I did a quick scan for other punctuation in the query-based models (i.e., any n-gram that was two or more punctuation characters, as defined by ), and no others rank in the top 1000 n-grams.

What to Do About It?
Arguably, the excess of periods is in fact more characteristic of Chinese queries than of others. Maybe there is even a distinction between Chinese/Mandarin (zh) and Cantonese (zh_yue) to be made based on period use—though intuitively I wouldn't want to rely on it!

However, it makes sense to retrain the Chinese query-based model after re-filtering the training data to somehow minimize the impact of punctuation—such as dropping queries with too much punctuation, trimming query initial or query-final repeated punctuation, or collapsing more than n instances of punctuation down to n (e.g., for n = 3).

Summary

 * Current language identification performance looks good; not too many queries are being filtered as too short or too ambiguous, and almost all of the languages enabled get used in a 5K sample.
 * Language identification errors that have results shown to users are very rare on English Wikipedia (which is by far the largest by volume), but more common elsewhere. The most common "bad" result is a non-language string of Latin characters being identified as English (because it's boosted) and then results are found (because English Wikipedia has so many non-word things in it). This isn't terrible, just not always very helpful.


 * Poorly-performing all-punctuation/symbol queries are uncommon, and most don't get cross-language results, though we dug up a small number of examples that do.
 * The Chinese query-based model is nonetheless ridiculously skewed in favor of multiple periods and multiple dashes, and should be fixed.


 * It makes sense to enable some "less ambiguous" languages everywhere, just because the potential for harm is very low, and there is some small upside possibility.
 * Loosening the criteria for allowing DYM suggestion results to block cross-language results seems like a good idea.


 * Our query sampling process is now much improved, though working that out did cause a delay.

Possible Next Steps
Below are some possible next steps to improve the overall functioning of language identification on the various Wikipedias, along with priority ratings (high/medium/low).


 * (High) Retrain the Chinese query-based model after re-filtering the training data to somehow minimize the impact of punctuation. (See T219911)


 * (Medium-High) Working on wrong-keyboard detection for Russian would be a good thing. We'll get around to it. (See T138958)


 * (Medium) Consider loosening the restriction on showing cross-language results based on the number of results a DYM suggestion gets for a zero-result query, from < 3 to < 5. (See T219912)
 * Consider looking into whether we should set the range for all cross-language results in a more data-driven way.
 * Depending on the implementation, it may be easier to change both to < 5 than having one as < 3 and the other as < 5.


 * (Medium) Consider enabling more of the unambiguous (or less ambiguous) scripts across all nine wikis. Bengali (bn), Greek (el), Hebrew (he), Armenian (hy), Georgian (ka), Korean (ko), Burmese (my), Telugu (te), and Thai (th) are generally unambiguous (Hebrew script could, for example, actually be Yiddish or Judeo-Spanish, but it is unlikely). (See T219915)
 * Arabic (ar), Hindi (hi), Russian (ru) do not have particularly unambiguous scripts, but those three are the overwhelming best guesses for the rare queries that show up in their respective scripts (if no other languages using those scripts are in the sample—which in this case they are not). Japanese (ja) is an unusual case because hiragana and katakana are unambiguously Japanese, while the kanji are borrowed Chinese characters. Chinese might fall into one of these categories, but it is already enabled for all nine wikis, as is Arabic. Russian is enabled everywhere but jawiki. Other relatively unambiguous language/script pairs for which we have models include Tamil (query-based), and Gujarati & Oriya (wiki-text–based).

[Note: I'm not creating tickets for the low-priority tasks yet.]


 * (Low) Consider taking a very large sample from the nine wikis with language identification enabled and looking for identifiable languages that occur infrequently (in a relatively low-effort way, perhaps by running language identification with all/most/many models enabled and looking for very clear examples of particular languages), and then enabling a larger selection of languages for some wikis.
 * This could introduce more errors, but they could be minimized by testing against the original training sets used to optimize the current parameters and enabled-language sets, to make sure that no large-scale errors were introduced (e.g., identifying many English queries as Scots). Other rarer errors could be accepted as a sacrifice on the altar of recall.
 * Alternatively, we could look to enabled the largest set of languages that doesn't "cause problems"—by regression testing on the original TextCat training data and/or by running it against a larger sample like we did for this one, and looking at per-language hit rate (e.g., if we considered enabling Danish but no Danish results would be shown to users out of a sample of 20K (or 50K) poorly-performing queries, then maybe we don't really need Danish on that wiki.)


 * (Low) Add a punctuation/symbol-only filter to restrict what is eligible for language identification. However, given the low rate of occurrence of actual results being shown for punctuation-only queries, and the likely impact of retraining the Chinese query-based model, this won't have a big impact.


 * (Low) Consider enabling English cross-language searching as a fallback everywhere. This is more for second-chance searching in general than for language identification. Lots of queries unexpectedly get results from English Wikipedia, so why not leverage that? Might lead to more false positive results on English Wikipedia, though.