User:TJones (WMF)/Notes/Language Detection Evaluation

September 2015 — See TJones_(WMF)/Notes for other projects.

Introduction
A non-trivial number of unsuccessful queries (i.e., zero results returned) to enwiki are in another language. Searching for foreign words on enwiki is not necessarily a bad search strategy, especially for proper nouns, since there are many redirects from the source language names to their English equivalents, such as Росси́я to Russia, ภาษาไทย to Thai language, or Magyar Állami Operaház to Hungarian State Opera House.

One way to improve results for unsuccessful foreign languages searches on enwiki is to detect the language in question and issue a search on the corresponding wiki. Language detection in this context is particularly challenging for a number of reasons, which include:
 * query length: language detection works best on longer texts; detectors have been trained on tweets from Twitter, but search queries are often much shorter than 140 characters.
 * non-language queries: we see all sorts of non-language queries, including gibberish (hhhkhghhhhhhhhhbbbb), acronyms, DOIs, emoji, names (what language is Nokia or Maria in?), numbers, URLs, etc.
 * mixed language queries: some queries are largely in one language but include a few words in another language, but others mix names and languages such that there's no right answer. For example, the query The Keytones Götz Alsmann Perfidia appears to be looking for a performance by the band The Keytones (which is English, even if "keytone" isn't an English word) featuring Götz Alsmann (a German gentleman with a Germanic name that has stereotypical German orthography) of a cover of the song Perfidia (Spanish for "perfidy"). There are techniques for segmenting multi-lingual texts, but they are beyond the scope of the current investigation.
 * transliteration: a number of queries appear to be searching for transliterations of words in another language—or perhaps users are using the wrong multilingual keyboard layout. A couple of examples
 * zeimpekika, which online machine translation services can map back to Greek ζεϊμπέκικα. A query on Greek Wikipedia gives a result for Ζεϊμπέκικο, which links to an enwiki article for Zeibekiko, a kind of dance. I'm not sure if there's a distinction between zeimpekika and zeibekiko music/dance, and I don't know what language to consider zeimpekika to be in.
 * Bollywood and Tamil movie titles and Japanese video game titles can be transliterated into the Latin alphabet, and can appear in queries with other English words (video game, film, book, box art, etc.). It isn't clear what language these queries are in.
 * typos and incomplete queries: add to all of these the presence of typos, in any language of character set, as well as queries that end mid-word.

Data sources
Building on the previous work in which I hand categorized 1047 zero-results queries from 2015-07-29, I created a new corpus of zero-result queries categorized by language.

I dropped from the previous corpus the "title_1" AND "title 2" queries, since those are no longer being submitted to enwiki, and re-categorized the remaining queries primarily by language. (Some foreign language queries were previously categorized as "music" or "movies" since they were titles of such. For the purposes of language detection, I recategorized queries like "O Menino Quadradinho" from "movies" to "Portuguese" because even though it's a movie title, it is made up of Portuguese words, in the same way that "The Matrix" is a movie title, but also an English phrase.)

I added a further 529 queries sampled randomly from the day of 2015-08-24, to increase my sample to 1452 queries.

Manual language identification and categorization
Beyond the languages I'm relatively familiar with, there were many I was not in my sample. Some can be reliably identified by character set (Tamil and Kannada). For others I used Wiktionary, web search engines (good for names!), and online machine translation services. Translation is a more demanding task than language identification, and can be verified in part by the quality of the translation; a bad translation doesn't necessarily indicate an incorrect identification, but a good translation is reasonably indicative of proper language identification. When translation and dictionary methods failed, I used a web search engine to find larger snippets of text that feature the words in our queries, and then used these larger texts for better language identification via the online machine translation services.

Generally these methods were successful, though there is still one term, in Cyrillic, that does not appear anywhere on the web (other than here, now!): Алматйфй. It is vaguely similar to the name of the Kazakh city, Алма-Ата (also Алматы), but it is not clearly a match (so, typing йфй instead of ата is not an obvious mistake to make on a Cyrillic keyboard, for example). This was the only non-Latin query that I couldn't categorize. Max says the last consonants are nonsense. So, it isn't clear if it is even likely to be Russian, Kazakh, or something else, other than it's obviously Cyrillic.

In addition to the various languages, categories in my sample included DOIs, abbreviations, transliterations, emoji, gibberish, names, searches for resumes on LinkedIn and related sites, numbers (including a few letters, for example in an ISBN number), terms that appear to be OCR errors, species names, chemical symbols, URLs, online usernames (which have different features than typical real world names), and 69 queries I just couldn't figure out.

Query corpus characteristics
Before looking at language detection, it's interesting to review the characteristics of this corpus. All of the corpus info is available here.


 * 1452 queries that got zero results on enwiki.
 * Only 775 (53.4%) are tagged as being in some language, the rest are non-language (including names).

The most commonly identified (manually, by me) languages among the queries are below:


 * %lang is the percentage of queries in *some* language that are in this language
 * %total is the percentage of all queries (i.e., %lang * .534) that are in this language
 * Other languages identified, in order of volume, are Turkish, Indonesian, Persian, Swahili, Korean, Bengali, Bulgarian, Hindi, Italian, Norwegian, Croatian, Dutch, Estonian, Finnish, Greek, Hmong, Japanese, Kannada, Latin, Polish, Serbian, Somali, Swedish, Tamil, Thai, Uzbek.

Token Counts
David asked about the number of tokens per query. Below is a breakdown of the number of tokens in queries that are considered "language", up to ten tokens. The token count breakdown for all queries, non-language queries, and language queries beyond 10 tokens is available in the full corpus info report, linked above.

Loading indexes in labs
If we're going to load additional wiki indexes in labs, these seem like the ones to choose.

Stupid "language" "detection"
Based on these numbers we could conceivably get decent results (certainly better than what we get now) from doing simple character set identification, and redirecting Chinese, Arabic, and Cyrillic to the Chinese, Arabic, and Russian wikis, (and similarly for Japanese, Korean, Hindi, Tamil, Kannada and other writing systems) and redirecting Latin alphabet queries to one or more of Spanish, Portuguese, French, possibly with some simple additional dumb "distinctive" character recognition, such as sending queries with umlauts or ß to the German wiki, or dot-less i's (ı) to Turkish, Persian-specific characters to Persian, etc. This would have systematic errors (many Persian queries would go to Arabic) but would be simple and cheap to implement.

A more sophisticated but still dumb version could look at distribution of letters in the query and match based on relative entropy. A more inclusive option would be to search multiple wikis for a given character set (e.g., Arabic and Persian for anything in the Arabic character set, since there are Persian words with no Persian-specific characters in them; similarly for Cyrillic and Russian, Bulgarian, etc.)

Caveats

 * Long Chinese queries (such as the .xyz queries) failed with "Regular expression is too complex" errors, regardless of the wiki searched!

Language detection evaluation method
The queries, including language and non-language queries, are run through the language detector, and various statistics are gathered on the results, including:

I believe F0.5, which favors precision over recall, is the best measure for this task, but stats are available for F1 and F2 as well.
 * percentage of correct identifications available at different thresholds
 * total languages reported by threshold
 * language identification hits (true positives) and misses (false positives), by threshold
 * recall, precision, F1, F2, and F0.5 measures, by threshold—overall and by languagef
 * most frequent incorrect identification for each language, by threshold
 * and the most frequent identification for non-language categories, by threshold

Invocation and return
The ElasticSearch language detection plugin (based on Cybozu) was set up by Erik and is available via to you in your local MediaWiki Vagrant instance (if you are up-to-date) from the command line thusly: curl -XPOST localhost:9200/_langdetect -d" " Be sure to escape quotes and leading at-signs in your query. There are possibly others that need escaping, but those were the ones I encountered.

Results are JSON snippets. Here are the results for the misspelling "soburbia" (instead of "suburbia"):

{"profile":"/langdetect/short-text/","languages":[{"language":"ro","probability":0.7142841533904064},{"language":"id","probability":0.14285819281811468},{"language":"hr","probability":0.14285622069158416}]} (Yes, that's Romanian, Indonesian, and Croatian—language identification is hard.)

Probabilities sum to no more than 1.0 (and usually less, presumably there's some internal 0.0000014 probability that the query is actually Martian—or it's a rounding error).

The maximum number of candidate languages offered in this sample was four, and the minimum was zero. (Kannada, emoji, and numbers, for example, sometimes were not labeled as being any language.)

Chinese language codes
The ElasticSearch language detection plugin returns two values for Chinese, zh-cn and zh-tw, for simplified and traditional Chinese character sets. I consider both of these to be just zh (Chinese) and used the higher probaility, since I did not try to distinguish them myself. Fortunately, the Chinese Wikipedia doesn't make this distinction either.

"Thresholds"
There are two obvious methods of thresholding the results from the ElasticSearch language detection plugin: number of languages returned, and probability assigned. In the first case, we may want to take the most probably language even if it isn't very likely, just to try something (perhaps emphasizing recall if lots of primary results are low probability). In the second case, we only take those that have a sufficient probability (emphasizing precision).

I've allowed multiple identification values. For example, using the first two or three suggested languages, or all languages that score above a given probability level. This increases recall at the expense of precision, since when multiple options are considered, at least one of them must be incorrect.

It is possible to mix these two threshold criteria (e.g., minimum score of 0.25, but only the best one), but I didn't do that for this first test. Note that for thresholds above 0.5, only one language can score high enough, so it does effectively limit results to one language.

For limits based on number of candidates considered, it is theoretically possible to get non-deterministic results, depending on how identically scoring languages are ordered by The ElasticSearch language detection plugin. However, there were no identical scores (though differences were as little as 10e-6).

Results
These results do not include non-language queries. Following the principle of "garbage in, garbage out", I didn't think it right to count errors on non-language queries against the language detector but see notes on performance of non-language categories below).

Thresholds labeled 1-4 indicate the maximum number of languages to consider. Thresholds labeled 0.00-0.95 are probability-based. Note that 4 and 0.00 should be the same, since they both mean "take everything!".

Recall, precision, F-scores, etc.
Below are the results tables for overall ("TOTAL") performance, and for English, Spanish, Chinese, Portuguese, Arabic and French (the languages with at least 10 manually identified instances in our queries).

In general, overall results are mixed, English has very high precision but recall is weak, Spanish is better than average all around, and Chinese and Arabic are very good, though sample sizes are low. Portuguese is okay, and French is terrible, though again sample sizes are low.

Considering a third or fourth language isn't worth it, even if you want to emphasize recall. There are no probability scores in certain ranges, 0.90-0.95, 0.6-0.7, 0.3-0.4, or below 0.1, which is explained by the fact that most if not all of the scores are very close (<0.01) to some multiple of 1/7.

Emphasizing precision, the best approach seems to be a probability threshold of 0.95 (which, in practice, seems to be the same as 0.99), though this still gets less than 60% precision.

By number of languages
The results here are limited to languages with 10+ examples. For other languages, see the full report.

thresh f0.5    f1      f2      recall  prec    total   hits    misses TOTAL (775) 1       49.3%   49.3%   49.3%   49.3%   49.3%  775     382     393 2        45.4%   49.2%   53.6%   57.0%   43.2%  775     442     581 3        44.2%   48.5%   53.7%   57.8%   41.8%  775     448     624 4        44.1%   48.4%   53.6%   57.8%   41.6%  775     448     629 English (599) 1       79.2%   61.0%   49.6%   44.1%   98.9%  599     264     3 2        84.4%   69.0%   58.4%   52.9%   99.1%  599     317     3 3        84.8%   69.7%   59.2%   53.8%   99.1%  599     322     3 4        84.8%   69.7%   59.2%   53.8%   99.1%  599     322     3 Spanish (43) 1       55.1%   59.2%   63.9%   67.4%   52.7%  43      29      26 2        47.8%   55.2%   65.3%   74.4%   43.8%  43      32      41 3        45.1%   52.9%   64.0%   74.4%   41.0%  43      32      46 4        45.1%   52.9%   64.0%   74.4%   41.0%  43      32      46 Chinese (20) 1      100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 2       100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 3       100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 4       100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 Portuguese (19) 1       47.2%   52.2%   58.3%   63.2%   44.4%  19      12      15 2        46.8%   53.1%   61.3%   68.4%   43.3%  19      13      17 3        49.0%   56.0%   65.4%   73.7%   45.2%  19      14      17 4        49.0%   56.0%   65.4%   73.7%   45.2%  19      14      17 Arabic (10) 1       87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 2        87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 3        87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 4        87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 French (10) 1        6.6%    9.4%   16.0%   30.0%    5.6%  10      3       51 2         7.8%   11.4%   21.2%   50.0%    6.4%  10      5       73 3         7.4%   10.9%   20.5%   50.0%    6.1%  10      5       77 4         7.4%   10.9%   20.5%   50.0%    6.1%  10      5       77

By probability
The results here are limited to languages with 10+ examples. For other languages, see the full report. thresh f0.5    f1      f2      recall  prec    total   hits    misses TOTAL (775) 0.95    52.7%   46.6%   41.8%   39.1%   57.7%  775     303     222 0.90     52.7%   46.6%   41.8%   39.1%   57.7%  775     303     222 0.80     52.3%   48.9%   45.9%   44.1%   54.9%  775     342     281 0.70     50.5%   48.9%   47.3%   46.3%   51.7%  775     359     335 0.60     50.5%   48.9%   47.3%   46.3%   51.7%  775     359     335 0.50     49.6%   49.2%   48.8%   48.5%   49.9%  775     376     378 0.40     48.6%   49.5%   50.5%   51.1%   48.1%  775     396     428 0.30     48.6%   49.5%   50.5%   51.1%   48.1%  775     396     428 0.20     47.5%   50.0%   52.7%   54.7%   46.0%  775     424     497 0.10     44.1%   48.4%   53.6%   57.8%   41.6%  775     448     629 0.00     44.1%   48.4%   53.6%   57.8%   41.6%  775     448     629 English (599) 0.95    70.9%   49.8%   38.3%   33.2%   99.0%  599     199     2 0.90     70.9%   49.8%   38.3%   33.2%   99.0%  599     199     2 0.80     75.4%   55.6%   44.1%   38.7%   98.7%  599     232     3 0.70     77.2%   58.2%   46.7%   41.2%   98.8%  599     247     3 0.60     77.2%   58.2%   46.7%   41.2%   98.8%  599     247     3 0.50     78.7%   60.3%   48.9%   43.4%   98.9%  599     260     3 0.40     80.5%   62.9%   51.6%   46.1%   98.9%  599     276     3 0.30     80.5%   62.9%   51.6%   46.1%   98.9%  599     276     3 0.20     82.7%   66.4%   55.4%   49.9%   99.0%  599     299     3 0.10     84.8%   69.7%   59.2%   53.8%   99.1%  599     322     3 0.00     84.8%   69.7%   59.2%   53.8%   99.1%  599     322     3 Spanish (43) 0.95    68.1%   65.0%   62.2%   60.5%   70.3%  43      26      11 0.90     68.1%   65.0%   62.2%   60.5%   70.3%  43      26      11 0.80     65.2%   64.3%   63.4%   62.8%   65.9%  43      27      14 0.70     56.5%   58.7%   61.1%   62.8%   55.1%  43      27      22 0.60     56.5%   58.7%   61.1%   62.8%   55.1%  43      27      22 0.50     55.8%   58.9%   62.5%   65.1%   53.8%  43      28      24 0.40     53.8%   58.8%   64.9%   69.8%   50.8%  43      30      29 0.30     53.8%   58.8%   64.9%   69.8%   50.8%  43      30      29 0.20     52.8%   59.3%   67.5%   74.4%   49.2%  43      32      33 0.10     45.1%   52.9%   64.0%   74.4%   41.0%  43      32      46 0.00     45.1%   52.9%   64.0%   74.4%   41.0%  43      32      46 Chinese (20) 0.95    95.2%   88.9%   83.3%   80.0%  100.0%  20      16      0 0.90     95.2%   88.9%   83.3%   80.0%  100.0%  20      16      0 0.80     96.6%   91.9%   87.6%   85.0%  100.0%  20      17      0 0.70     96.6%   91.9%   87.6%   85.0%  100.0%  20      17      0 0.60     96.6%   91.9%   87.6%   85.0%  100.0%  20      17      0 0.50     99.0%   97.4%   96.0%   95.0%  100.0%  20      19      0 0.40    100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 0.30    100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 0.20    100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 0.10    100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 0.00    100.0%  100.0%  100.0%  100.0%  100.0%  20      20      0 Portuguese (19) 0.95    46.0%   44.4%   43.0%   42.1%   47.1%  19      8       9 0.90     46.0%   44.4%   43.0%   42.1%   47.1%  19      8       9 0.80     51.4%   53.7%   56.1%   57.9%   50.0%  19      11      11 0.70     46.2%   50.0%   54.5%   57.9%   44.0%  19      11      14 0.60     46.2%   50.0%   54.5%   57.9%   44.0%  19      11      14 0.50     48.8%   53.3%   58.8%   63.2%   46.2%  19      12      14 0.40     47.2%   52.2%   58.3%   63.2%   44.4%  19      12      15 0.30     47.2%   52.2%   58.3%   63.2%   44.4%  19      12      15 0.20     50.4%   57.1%   66.0%   73.7%   46.7%  19      14      16 0.10     49.0%   56.0%   65.4%   73.7%   45.2%  19      14      17 0.00     49.0%   56.0%   65.4%   73.7%   45.2%  19      14      17 Arabic (10) 0.95    95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0 0.90     95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0 0.80     95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0 0.70     95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0 0.60     95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0 0.50     87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 0.40     87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 0.30     87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 0.20     87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 0.10     87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 0.00     87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1 French (10) 0.95     7.0%    9.3%   13.7%   20.0%    6.1%  10      2       31 0.90      7.0%    9.3%   13.7%   20.0%    6.1%  10      2       31 0.80      7.9%   10.9%   17.6%   30.0%    6.7%  10      3       42 0.70      7.4%   10.3%   17.0%   30.0%    6.2%  10      3       45 0.60      7.4%   10.3%   17.0%   30.0%    6.2%  10      3       45 0.50      6.6%    9.4%   16.0%   30.0%    5.6%  10      3       51 0.40      6.4%    9.1%   15.6%   30.0%    5.4%  10      3       53 0.30      6.4%    9.1%   15.6%   30.0%    5.4%  10      3       53 0.20      7.3%   10.5%   18.9%   40.0%    6.1%  10      4       62 0.10      7.4%   10.9%   20.5%   50.0%    6.1%  10      5       77 0.00      7.4%   10.9%   20.5%   50.0%    6.1%  10      5       77

Pretty pictures
The plot below shows recall vs. precision at the 0.95 probability threshold for the languages with 10+ examples in our sample.

Error bars are calculated in both dimensions using the Wilson Score Interval.

The curves plotted are lines of constant F0.5 score, since they are not intuitive. The blue curve is F0.5 = 0.5, increasing by 0.1 up and to the right, and decreasing in the other direction.



False positives
Recall and precision numbers are 0 (or perhaps NaN) for all languages that have no examples in our data. However, that does not capture the rate of false positives we get, especially Romanian (the ElasticSearch language detection plugin seems to really, really like Romanian, even though we don't have any Romanian examples).

The table below shows the number of actual items for each language, and the number identified (with 0.95 probability, maximizing precision) for each language. Note that not all identifications are correct, but it's clear that we're missing almost 400 English identifications, and we have 41 Romanian identifications that are incorrect.

For more details on performance at other probability thresholds (or by language) see the full reports linked to above. Actual			0.95 599	English		201	English 43	Spanish		41	Romanian 20	Chinese		37	Spanish 19	Portuguese	33	French 10	Arabic		31	Italian 10	French		30	Tagalog 9	Tagalog		21	German 8	German		17	Portuguese 6	Malay		16	Chinese 5	Russian		12	Indonesian 5	Turkish		10	Dutch 4	Indonesian	8	Arabic 4	Persian		7	Albanian 4	Swahili		7	Norwegian 3	Korean		7	Estonian 2	Bengali		6	Turkish 2	Bulgarian	5	Danish 2	Hindi		4	Persian 2	Italian		3	Russian 2	Norwegian	3	Polish 1	Croatian	3	Lithuanian 1	Dutch		3	Korean 1	Estonian	3	Finnish 1	Finnish		2	Macedonian 1	Greek		2	Bengali 1	Hmong		2	Swedish 1	Japanese	2	Hindi 1	Kannada		1	Greek 1	Latin		1	Japanese 1	Polish		1	Czech 1	Serbian		1	Croatian 1	Somali		1	Tamil 1	Swedish		1	Thai 1	Tamil		1	Bulgarian 1	Thai		1	Ukrainian 1	Uzbek		1	Hungarian

Pretty pictures
The image below plots reported vs actual instances of each language at the 0.95 probability threshold. The graph axes are swapped from the more natural arrangement so that more language names are readable. The blue line indicates parity between actual and reported. Languages above the line are under-reported, those below and to the right are over-reported. Spanish (slightly under-reported) and English (massively under-reported) are not shown to keep the scale of the graph reasonable.



Most frequent incorrect identification by language
The most often incorrectly reported identifications by language (at the 0.95 probability threshold) are:


 * English (599): Romanian (39), French (30), Italian (25), Tagalog (18), German (13), and Dutch (10), with many others.
 * Spanish (43): German (2), Romanian (2), English (1), Italian (1), Lithuanian (1), Portuguese (1), Tagalog (1)
 * Portuguese (19): Spanish (3), Italian (1)
 * Arabic (10): Persian (1)
 * French (10): English (1), Estonian (1), German (1), Italian (1), Tagalog (1)
 * Tagalog (9): Italian (1)
 * Malay (6): Indonesian (5)
 * Russian (5): Macedonian (1), Ukrainian (1)
 * Turkish (5): Estonian (1), Tagalog (1)

For more details on other languages and on performance at other probability thresholds (or by language) see the full reports linked to above.

Most frequent identifications for non-languages
These aren't exactly wrong, but they would represent additional searches that might not be effective (though some of them might be, since a name that identifies as German may be someone who is in the German wiki, but not the English wiki).

The most often reported identifications by category (at the 0.95 probability threshold) are:


 * Name (361): English (34), Italian (25), German (23), French (20), Tagalog (20), Romanian (17), Indonesian (13), Albanian (7), and many others.
 * ?? (69): Indonesian (8), Tagalog (7), English (6), Albanian (2), Dutch (2), Italian (2), Polish (2), Spanish (2), and others.
 * URL (67): English (23), Italian (5), French (3), Portuguese (3), Tagalog (3), Chinese (2), Dutch (2), Polish (2), and others.
 * Junk (46): Albanian (4), Dutch (3), French (3), English (2), Indonesian (2), Polish (2), Portuguese (2), and others.
 * DOI (33): French (8), Croatian (2), English (2), German (1), Polish (1)
 * User (16): Spanish (2), Croatian (1), English (1), German (1), Italian (1), Lithuanian (1), Romanian (1), Tagalog (1), Turkish (1)
 * Species (13): Italian (3), Romanian (2), English (1), Portuguese (1), Tagalog (1)
 * Number (12): Chinese (2), Portuguese (1)

For more details on other categories and on performance at other probability thresholds (or by language) see the full reports linked to above.

Alternative approaches
In addition to the simple thresholds of number of candidate languages or probability, we could try to improve performance at the expense of complexity (and effort).

One option would be to determine by-language thresholds. So, English, Chinese and Arabic might have thresholds of 0.0 (all of them!), while Romanian gets a threshold of 1.1 (none of them!). With more data for some of the less well-represented languages, which is most of them, we should be able to boost both recall and precision. We'd also have to determine a method for dealing with cases where, for example, Romanian scores 0.714... and English scores 0.285..., of which there are plenty. The simplest approach might be to take the first acceptable candidate (i.e., skip over Romanian, then take English). This approach is essentially doing very simple machine learning on the output of the ElasticSearch language detection plugin, but could improve performance.

More complex approaches are possible, but it isn't clear they are worth the effort at this point.

ElasticSearch language detection plugin, with spaces
David noticed that the language detector plugin seems to use spaces as word boundaries, and gets better results on at least some queries with an extra space at the beginning and end of the query. This was easy enough to test, so I re-ran the numbers.

The short version is that it definitely works. There's a general improvement to recall, precision, and F-score of a few percentage points. Probability is still better than by number of languages (and so a summary for that is presented below). There were fewer over-eager assignments of Romanian, Indonesian, Tagalog and French, but still plenty of each.

The full report by language is here.

The full report by probability threshold is here.

Recall and precision by probability threshold
thresh	f0.5	f1	f2	recall	prec	total	hits	misses TOTAL (775) 0.95	 54.4%	 47.4%	 41.9%	 39.0%	 60.4%	775	302	198 0.90	 54.4%	 47.4%	 41.9%	 39.0%	 60.4%	775	302	198 0.80	 53.7%	 49.7%	 46.3%	 44.3%	 56.7%	775	343	262 0.70	 53.0%	 51.0%	 49.2%	 48.0%	 54.5%	775	372	311 0.60	 53.0%	 51.0%	 49.2%	 48.0%	 54.5%	775	372	311 0.50	 52.0%	 51.5%	 51.1%	 50.8%	 52.3%	775	394	360 0.40	 50.7%	 51.7%	 52.7%	 53.4%	 50.1%	775	414	412 0.30	 50.7%	 51.7%	 52.7%	 53.4%	 50.1%	775	414	412 0.20	 48.6%	 51.4%	 54.5%	 56.8%	 47.0%	775	440	497 0.10	 45.0%	 49.8%	 55.8%	 60.6%	 42.3%	775	470	642 0.00	 45.0%	 49.8%	 55.8%	 60.6%	 42.3%	775	470	642 English (599) 0.95	 71.8%	 50.9%	 39.4%	 34.2%	 99.0%	599	205	2 0.90	 71.8%	 50.9%	 39.4%	 34.2%	 99.0%	599	205	2 0.80	 75.5%	 55.8%	 44.3%	 38.9%	 98.7%	599	233	3 0.70	 78.5%	 60.0%	 48.6%	 43.1%	 98.9%	599	258	3 0.60	 78.5%	 60.0%	 48.6%	 43.1%	 98.9%	599	258	3 0.50	 80.4%	 62.7%	 51.4%	 45.9%	 98.9%	599	275	3 0.40	 82.1%	 65.3%	 54.3%	 48.7%	 99.0%	599	292	3 0.30	 82.1%	 65.3%	 54.3%	 48.7%	 99.0%	599	292	3 0.20	 84.2%	 68.7%	 58.0%	 52.6%	 99.1%	599	315	3 0.10	 86.4%	 72.5%	 62.4%	 57.1%	 99.1%	599	342	3 0.00	 86.4%	 72.5%	 62.4%	 57.1%	 99.1%	599	342	3 Spanish (43) 0.95	 62.8%	 61.0%	 59.2%	 58.1%	 64.1%	43	25	14 0.90	 62.8%	 61.0%	 59.2%	 58.1%	 64.1%	43	25	14 0.80	 64.9%	 66.7%	 68.5%	 69.8%	 63.8%	43	30	17 0.70	 59.8%	 63.2%	 67.0%	 69.8%	 57.7%	43	30	22 0.60	 59.8%	 63.2%	 67.0%	 69.8%	 57.7%	43	30	22 0.50	 59.9%	 64.6%	 70.2%	 74.4%	 57.1%	43	32	24 0.40	 58.3%	 64.1%	 71.1%	 76.7%	 55.0%	43	33	27 0.30	 58.3%	 64.1%	 71.1%	 76.7%	 55.0%	43	33	27 0.20	 52.4%	 59.5%	 68.8%	 76.7%	 48.5%	43	33	35 0.10	 49.0%	 57.1%	 68.5%	 79.1%	 44.7%	43	34	42 0.00	 49.0%	 57.1%	 68.5%	 79.1%	 44.7%	43	34	42 Chinese (20) 0.95	 90.3%	 78.8%	 69.9%	 65.0%	100.0%	20	13	0 0.90	 90.3%	 78.8%	 69.9%	 65.0%	100.0%	20	13	0 0.80	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	20	16	0 0.70	 97.8%	 94.7%	 91.8%	 90.0%	100.0%	20	18	0 0.60	 97.8%	 94.7%	 91.8%	 90.0%	100.0%	20	18	0 0.50	 99.0%	 97.4%	 96.0%	 95.0%	100.0%	20	19	0 0.40	 99.0%	 97.4%	 96.0%	 95.0%	100.0%	20	19	0 0.30	 99.0%	 97.4%	 96.0%	 95.0%	100.0%	20	19	0 0.20	100.0%	100.0%	100.0%	100.0%	100.0%	20	20	0 0.10	 92.6%	 95.2%	 98.0%	100.0%	 90.9%	20	20	2 0.00	 92.6%	 95.2%	 98.0%	100.0%	 90.9%	20	20	2 Portuguese (19) 0.95	 44.0%	 43.2%	 42.6%	 42.1%	 44.4%	19	8	10 0.90	 44.0%	 43.2%	 42.6%	 42.1%	 44.4%	19	8	10 0.80	 45.5%	 46.2%	 46.9%	 47.4%	 45.0%	19	9	11 0.70	 46.7%	 48.8%	 51.0%	 52.6%	 45.5%	19	10	12 0.60	 46.7%	 48.8%	 51.0%	 52.6%	 45.5%	19	10	12 0.50	 52.2%	 55.8%	 60.0%	 63.2%	 50.0%	19	12	12 0.40	 47.2%	 52.2%	 58.3%	 63.2%	 44.4%	19	12	15 0.30	 47.2%	 52.2%	 58.3%	 63.2%	 44.4%	19	12	15 0.20	 46.4%	 53.8%	 64.2%	 73.7%	 42.4%	19	14	19 0.10	 41.9%	 50.0%	 61.9%	 73.7%	 37.8%	19	14	23 0.00	 41.9%	 50.0%	 61.9%	 73.7%	 37.8%	19	14	23 Arabic (10) 0.95	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.90	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.80	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.70	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.60	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.50	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.40	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.30	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.20	 95.2%	 88.9%	 83.3%	 80.0%	100.0%	10	8	0 0.10	 87.0%	 84.2%	 81.6%	 80.0%	 88.9%	10	8	1 0.00	 87.0%	 84.2%	 81.6%	 80.0%	 88.9%	10	8	1 French (10) 0.95	 13.6%	 17.1%	 23.1%	 30.0%	 12.0%	10	3	22 0.90	 13.6%	 17.1%	 23.1%	 30.0%	 12.0%	10	3	22 0.80	 10.0%	 13.3%	 20.0%	 30.0%	 8.6%	10	3	32 0.70	  8.6%	 11.8%	 18.5%	 30.0%	  7.3%	10	3	38 0.60	  8.6%	 11.8%	 18.5%	 30.0%	  7.3%	10	3	38 0.50	  7.0%	  9.8%	 16.5%	 30.0%	  5.9%	10	3	48 0.40	  8.1%	 11.6%	 20.2%	 40.0%	  6.8%	10	4	55 0.30	  8.1%	 11.6%	 20.2%	 40.0%	  6.8%	10	4	55 0.20	  7.0%	 10.1%	 18.3%	 40.0%	  5.8%	10	4	65 0.10	  7.1%	 10.5%	 20.0%	 50.0%	  5.9%	10	5	80 0.00	  7.1%	 10.5%	 20.0%	 50.0%	  5.9%	10	5	80

Always "English" detector
The current behavior of the system is to treat the queries as if they were English. so I thought I'd go ahead and score that for reference. thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL (775) .       77.4%   77.4%   77.4%   77.4%   77.4%  775     600     175 English (599) .        81.1%   87.3%   94.6%  100.2%   77.4%  599     600     175 This makes it clear that F-score is not necessarily the only relevant measure for search purposes (though doing better than defaulting to English would be good for language detector), since what we really want to do is send some of these queries off to other wikis to hopefully get some results.

ElasticSearch language detection plugin, with thresholds by language (and spaces)
It's pretty clear from the performance of the ElasticSearch language detection plugin that we need to tell it, "Stop trying to make Romanian happen. It's not going to happen." That is, if we just ignored every returned value of Romanian, performance would be better. To a lesser degree, we could tell the plugin to emphasize some languages (English and Chinese) and downplay others (Romanian, French, Italian, Indonesian, German—see the Reported vs Actual Instances of a Language graph above).

We can do this by setting a minimum threshold for each language, and ignoring any instances that score below that threshold. For English, we set the minimum at 0, for example, because it's right so much more often than it's wrong. For Romanian, we set the threshold at 1.01 (i.e., higher than possible) to make sure it never gets assigned.

Because adding leading and trailing spaces slightly increased the performance of the ElasticSearch language detection plugin, we are using that data.

I think precision is more important than recall, so I optimized based on F0.5 for each language. If recall is more important, you could optimize on F2 (which I do, below, just to see what happens).

As a reminder, here's the overall performance table for the ElasticSearch language detection plugin, with spaces: thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL (775) 0.95    54.4%   47.4%   41.9%   39.0%   60.4%  775     302     198 0.90     54.4%   47.4%   41.9%   39.0%   60.4%  775     302     198 0.80     53.7%   49.7%   46.3%   44.3%   56.7%  775     343     262 0.70     53.0%   51.0%   49.2%   48.0%   54.5%  775     372     311 0.60     53.0%   51.0%   49.2%   48.0%   54.5%  775     372     311 0.50     52.0%   51.5%   51.1%   50.8%   52.3%  775     394     360 0.40     50.7%   51.7%   52.7%   53.4%   50.1%  775     414     412 0.30     50.7%   51.7%   52.7%   53.4%   50.1%  775     414     412 0.20     48.6%   51.4%   54.5%   56.8%   47.0%  775     440     497 0.10     45.0%   49.8%   55.8%   60.6%   42.3%  775     470     642 0.00     45.0%   49.8%   55.8%   60.6%   42.3%  775     470     642 I've bolded the lines with maximum F0.5 and F2 for reference, and italicized the relevant score.

CAVEAT
This is actually a terrible way to do this!

All of these thresholds except English are based on too little data (as few as 2 data points!), and we're going to evaluate performance on our training set (major no-no!), but this is the best we can do with the limited data that we have, without investing in annotating a lot more data. So take everything with a grain of salt because it's going to be overly optimistic, but still indicative of the trend of the result of such changes.

Optimizing for F0.5
Using the full report for queries with spaces, I've chosen thresholds for each language that maximizes F0.5 score.

Since overall best performance of F0.5, F1, and F2 were all in the general vicinity of 50%, for any language that had no optimal performance of at least 50%, I set the threshold to 1.01 (i.e., always ignore). I originally tested a cutoff of 30%, but 50% gave better results. This does mean, for example, that we'll never detect German or French, which are not super rare, but not super common, either.

Note that to be used in production, these thresholds would have to be optimized on a per-wiki basis (in addition to being optimized on a larger sample). One assumes that French is common on frwiki, German on dewiki, and Romanian on rowiki.

The thresholds are below: 0.00 en    0.95  bn    0.95  th    1.01  id    1.01  ro 0.20  zh    0.95  cs    0.95  tr    1.01  it    1.01  sq 0.50  pt    0.95  el    1.01  da    1.01  lt    1.01  sv 0.80  es    0.95  fi    1.01  de    1.01  lv    1.01  tl 0.80  fa    0.95  hi    1.01  et    1.01  mk    1.01  uk 0.80  ko    0.95  ja    1.01  fr    1.01  nl    1.01  vi 0.95  ar    0.95  ru    1.01  hr    1.01  no 0.95  bg    0.95  ta    1.01  hu    1.01  pl

Results for F0.5
The overall results are shown below. thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL (775) 0.95    69.5%   51.6%   41.1%   36.1%   90.3%  775     280     30 0.90     69.5%   51.6%   41.1%   36.1%   90.3%  775     280     30 0.80     72.8%   56.5%   46.2%   41.2%   90.1%  775     319     35 0.70     75.2%   59.9%   49.8%   44.8%   90.6%  775     347     36 0.60     75.2%   59.9%   49.8%   44.8%   90.6%  775     347     36 0.50     76.9%   62.3%   52.4%   47.4%   91.1%  775     367     36 0.40     78.2%   64.3%   54.5%   49.5%   91.4%  775     384     36 0.30     78.2%   64.3%   54.5%   49.5%   91.4%  775     384     36 0.20     80.0%   66.9%   57.6%   52.6%   91.9%  775     408     36 0.10     81.8%   69.8%   60.9%   56.1%   92.4%  775     435     36 0.00     81.8%   69.8%   60.9%   56.1%   92.4%  775     435     36 F0.5 is more than 25 points higher, recall increased by 17% (because we can be more aggressive about accepting language identification) and precision increased by 32% (because we're more accurate)!

The optimal threshold is now 0.00 because everything "bad" that happens at lower scores has been disallowed. (In the more general case, especially if we didn't set a threshold for lower frequency languages, the optimal threshold might not be 0.00.)

As noted above, these are overly optimistic improvements because the threshold selection is overfitted to the data, but clearly there is something to be gained by such tuning.

The full report is here. Highlights: The number of languages assigned to non-language text—names, URLs, DOIs, etc.—has gone down dramatically, too. Also note that a non-language query string being tagged as "English" on enwiki is a sort of freebie, in that even if it is wrong, the result is the same as if it had been assigned no language: we wouldn't do anything.

Optimizing for F2
Just to see what happens if we turn the dial in the other direction, I optimized the language thresholds for F2 score and re-ran the experiment. I used the same 50% minimum cutoff as above. The final thresholds are below, with differences from the F0.5 thresholds in bold. 0.00 en    0.95  ar    0.95  pl    1.01  fr    1.01  no 0.20  pt    0.95  bg    0.95  ru    1.01  hr    1.01  ro 0.20  zh    0.95  bn    0.95  ta    1.01  hu    1.01  sq 0.40  es    0.95  cs    0.95  th    1.01  it    1.01  sv 0.80  de    0.95  el    0.95  tl    1.01  lt    1.01  uk 0.80  fa    0.95  fi    0.95  tr    1.01  lv    1.01  vi 0.80  id    0.95  hi    1.01  da    1.01  mk 0.80  ko    0.95  ja    1.01  et    1.01  nl

Results for F2
The overall results are shown below. thresh	 f0.5	f1	f2	recall	prec	total	hits	misses TOTAL (775) 0.95	 65.7%	 51.7%	 42.7%	 38.2%	 80.2%	775	296	73 0.90	 65.7%	 51.7%	 42.7%	 38.2%	 80.2%	775	296	73 0.80	 68.6%	 56.4%	 47.9%	 43.5%	 80.2%	775	337	83 0.70	 70.4%	 59.4%	 51.4%	 47.1%	 80.4%	775	365	89 0.60	 70.4%	 59.4%	 51.4%	 47.1%	 80.4%	775	365	89 0.50	 72.0%	 61.8%	 54.1%	 49.9%	 81.0%	775	387	91 0.40	 72.8%	 63.4%	 56.2%	 52.3%	 80.7%	775	405	97 0.30	 72.8%	 63.4%	 56.2%	 52.3%	 80.7%	775	405	97 0.20	 74.2%	 66.0%	 59.3%	 55.6%	 81.0%	775	431	101 0.10	 76.1%	 68.7%	 62.6%	 59.1%	 81.9%	775	458	101 0.00	 76.1%	 68.7%	 62.6%	 59.1%	 81.9%	775	458	101 F2 score is about 7 points higher, recall is slightly lower (by 1.5%), but precision is almost 40% higher!

The gains are not quite as dramatic when when optimizing for F2 (recall), but that's to be expected when the tool we are using can only remove false positives.

The full report is here. Highlights: as above, fewer false positives all around!

Conclusion
Even though all of these results are overly optimistic because of overfitting the small data set, it's clear that reasonable (and potentially extraordinary) gains can be made by customizing the thresholds for accepting language identification by language—and per data set (i.e., per wiki).

This technique should be applicable to other language identification algorithms, too, as a general method of reining in over eager language identifiers.

ElasticSearch Plugin—Limiting Languages & Retraining
David dug into the Cybozu code and config, and figured out how to limit the languages available, and how to retrain models.

Based on the TextCat results, David suggested limiting the original language models to en, es, zh-cn, zh-tw, pt, ar, ru, fa, ko, bn, bg, hi, el, ta, and th.

For reference, the baseline results from the ES Plugin (with spaces) is below. (The evaluation set is the manually tagged enwiki sample.) f0.5   recall  prec   total   hits    misses 54.4%  39.0%   60.4%  775     302     198 Limiting languages improved performance significantly (full report here): f0.5   recall  prec    total   hits    misses TOTAL       75.6%   64.5%   79.0%  775     500     133 English     88.0%   69.4%   94.3%  599     416     25 Spanish     38.7%   79.1%   34.3%  43      34      65 Chinese     83.3%   70.0%   87.5%  20      14      2 Portuguese  26.1%   57.9%   22.9%  19      11      37 Arabic      95.2%   80.0%  100.0%  10      8       0 Russian     88.2%   60.0%  100.0%  5       3       0 Persian     75.0%   75.0%   75.0%  4       3       1 Korean      90.9%   66.7%  100.0%  3       2       0 Bengali    100.0%  100.0%  100.0%  2       2       0 Bulgarian   45.5%  100.0%   40.0%  2       2       3 Hindi      100.0%  100.0%  100.0%  2       2       0 Greek      100.0%  100.0%  100.0%  1       1       0 Tamil      100.0%  100.0%  100.0%  1       1       0 Thai       100.0%  100.0%  100.0%  1       1       0 David also retrained the models using the (admittedly messy) query data I used for training the TextCat models, and the results improved again! (full report here): f0.5    recall  prec    total   hits    misses TOTAL       81.8%   75.4%   83.5%  775     584     115 English     89.4%   83.5%   91.1%  599     500     49 Spanish     48.8%   65.1%   45.9%  43      28      33 Chinese     89.3%   75.0%   93.8%  20      15      1 Portuguese  36.6%   73.7%   32.6%  19      14      29 Arabic      92.1%   70.0%  100.0%  10      7       0 Russian    100.0%  100.0%  100.0%  5       5       0 Persian     71.4%  100.0%   66.7%  4       4       2 Korean     100.0%  100.0%  100.0%  3       3       0 Bengali    100.0%  100.0%  100.0%  2       2       0 Bulgarian   50.0%   50.0%   50.0%  2       1       1 Hindi      100.0%  100.0%  100.0%  2       2       0 Greek      100.0%  100.0%  100.0%  1       1       0 Tamil      100.0%  100.0%  100.0%  1       1       0 Thai       100.0%  100.0%  100.0%  1       1       0 Just to be sure, I re-ran the retrained models on data without spaces, and all metrics are roughly 1.5% worse. f0.5    recall  prec    total   hits    misses 80.4%   74.1%   82.1%  775     574     125

Conclusion
David suggested that this means we should go with TextCat, since it's easier to integrate, and I agree. However, this test was pretty quick and easy to run, so if we improve the training data, we can easily rebuild these models and test them again.

Overall, it's clear that limiting languages to the "useful" ones for a given wiki makes sense, and training on query data rather than generic language data helps, too!