User:TJones (WMF)/Notes/Balanced Language Identification Evaluation Set for Queries

Building the Corpus
The goal of this task was to create a balanced language identification evaluation set for queries for top 21 wikis by query volume. It would have been the top 20, but I accidentally grabbed the top 20 after English, so we get 21. The purpose of a hand-selected balanced query set is to be able to test the accuracy of language identification where all languages are competing equally (by volume) and all queries are decent exemplars of the language in question.

The 21 languages are: Arabic, Chinese, Czech, Dutch, English, French, German, Hebrew, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, Ukrainian, and Vietnamese.

I extracted a few day’s worth of full text queries from all wikis (19,273,806 queries total). For each of the 21 languages, I randomly selected several hundred queries for each language, and whittled them down to 200 queries each, removing queries composed primarily of names of people, places, and products, text in the wrong language, bad misspellings, numbers or acronyms, appeared bot-like (i.e., a very large number of very similar queries), etc. Names made up of normal words were kept—e.g., “The Revenant” and “Bridge of Spies” are names of movies, but they are made up of non-name words. Longer queries were allowed a small bit of text from any of the unacceptable categories.

I did not filter out queries that would obviously be hard for language identification, such as very short, unaccented queries in the Latin script, like Portuguese os (“the”), Swedish ur (“from”), English the, and French rue (“street”). The longest queries are hundreds of characters.

TextCat Evaluation
I tested TextCat against the balanced corpus of 200 queries in each of 21 languages (4,200 queries total) in two ways:
 * against the known list of 21 languages
 * against the full list of 59 languages for which language models have been built on query data.

Note that some of the full set of 59 models are known to be pretty poor (Igbo has way too much English in the training data, for example) and part of the purpose of this set is to let us better evaluate these models.

In each case, I tested language models in increments of 500 ngrams up to 10,000 ngrams. Previous work on a sample derived from enwiki queries showed an optimal model size of 3,000 ngrams (on messy data that was also heavily unbalanced—i.e., mostly English). In this case, surprisingly, the best results came from the maximum 10,000 ngram models! However, the improvement probably isn’t enough to warrant the extra cost in speed and memory of using the 10K model—it’s no more than 4% F0.5 score.

Looking at Model Sizes
F0.5 scores against the known 21 languages:

ngrams	1000	2000	3000	4000	5000	6000	7000	8000	9000	10000 TOTAL	84.0%	85.6%	86.5%	87.1%	87.2%	87.5%	87.8%	87.9%	88.2%	88.3% Arabic	92.6%	92.1%	92.5%	92.7%	93.0%	93.9%	93.9%	93.9%	93.7%	93.7% Chinese	81.5%	85.5%	86.9%	87.7%	87.1%	87.9%	89.0%	89.0%	89.0%	89.4% Czech	89.9%	91.1%	92.9%	91.9%	91.8%	92.6%	93.2%	93.0%	93.8%	93.8% Dutch	72.8%	75.6%	78.0%	78.3%	78.2%	79.4%	79.6%	80.2%	80.9%	81.1% English	77.6%	83.7%	86.6%	87.3%	86.2%	84.9%	85.4%	85.3%	86.2%	86.8% French	85.7%	88.2%	89.3%	88.9%	89.0%	90.1%	88.7%	88.9%	89.6%	88.8% German	75.7%	77.6%	80.5%	79.3%	79.5%	80.2%	80.8%	81.7%	82.6%	82.8% Hebrew	99.3%	100.0%	100.0%	99.8%	99.8%	99.8%	100.0%	100.0%	100.0%	99.8% Indonesian	80.5%	83.1%	83.4%	83.9%	85.4%	86.1%	86.3%	86.7%	86.7%	86.1% Italian	74.8%	74.6%	73.3%	74.6%	76.3%	77.3%	78.2%	78.2%	78.6%	78.8% Japanese	79.9%	83.9%	85.4%	87.2%	86.2%	86.7%	87.9%	87.9%	88.5%	88.8% Korean	99.2%	99.5%	99.7%	99.7%	99.5%	99.7%	99.7%	99.7%	99.7%	99.7% Persian	91.9%	91.7%	92.0%	92.0%	92.5%	93.6%	93.5%	93.1%	93.0%	93.0% Polish	90.0%	91.7%	93.3%	93.6%	93.3%	94.3%	94.1%	94.8%	95.3%	96.0% Portuguese	73.4%	73.1%	74.6%	76.1%	77.1%	75.8%	78.4%	78.4%	79.7%	79.5% Russian	85.5%	84.8%	85.0%	85.0%	84.4%	84.4%	83.9%	83.9%	83.9%	84.5% Spanish	72.5%	74.2%	73.4%	78.1%	78.7%	77.4%	78.1%	77.9%	78.0%	78.4% Swedish	68.9%	72.3%	76.6%	77.5%	78.0%	78.0%	78.9%	79.1%	80.1%	80.4% Turkish	89.6%	92.4%	92.1%	93.4%	93.2%	93.6%	93.6%	93.9%	93.6%	93.1% Ukrainian	82.9%	81.0%	82.1%	82.4%	81.6%	81.3%	80.8%	80.8%	81.2%	81.7% Vietnamese	97.5%	98.5%	99.0%	99.3%	99.0%	98.8%	98.8%	98.3%	98.0%	97.8%

F0.5 scores against the all 59 available languages:

ngrams	1000	2000	3000	4000	5000	6000	7000	8000	9000	10000 TOTAL	69.8%	73.3%	74.7%	76.0%	76.6%	76.9%	77.4%	77.5%	78.1%	78.5% Arabic	92.2%	91.5%	91.9%	92.6%	93.6%	93.8%	94.1%	93.9%	93.6%	93.9% Chinese	52.7%	55.8%	58.0%	61.1%	60.2%	60.6%	62.2%	64.3%	65.9%	67.2% Czech	83.9%	87.5%	88.6%	87.6%	88.3%	87.7%	87.4%	87.6%	87.9%	87.7% Dutch	67.3%	71.1%	76.1%	75.7%	76.6%	77.6%	78.5%	79.1%	80.0%	79.6% English	66.7%	72.3%	73.2%	74.5%	74.2%	74.2%	72.8%	72.2%	72.8%	74.2% French	84.6%	87.6%	86.7%	87.2%	87.7%	87.3%	87.3%	87.7%	88.0%	87.9% German	74.5%	76.3%	80.0%	79.7%	81.5%	81.6%	80.9%	82.2%	83.6%	83.9% Hebrew	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0% Indonesian	42.1%	50.7%	54.1%	58.7%	62.6%	66.0%	68.7%	68.7%	69.7%	71.2% Italian	71.9%	74.6%	74.4%	75.1%	76.9%	78.4%	79.5%	78.8%	78.5%	78.5% Japanese	80.4%	82.9%	85.2%	86.2%	85.9%	86.4%	87.6%	87.6%	88.2%	88.2% Korean	99.5%	99.5%	99.7%	99.7%	99.7%	99.7%	99.7%	99.7%	99.7%	99.7% Persian	85.3%	86.6%	87.0%	87.6%	87.6%	88.4%	88.0%	87.4%	86.8%	86.8% Polish	92.1%	93.1%	93.8%	93.9%	93.8%	94.1%	94.1%	95.4%	95.6%	96.4% Portuguese	70.2%	70.4%	72.2%	74.0%	74.5%	74.7%	76.6%	77.1%	78.2%	78.2% Russian	72.8%	77.0%	76.8%	79.6%	77.5%	77.2%	77.8%	78.4%	78.7%	79.1% Spanish	66.7%	70.4%	72.0%	74.6%	75.5%	75.8%	76.5%	75.7%	78.4%	78.7% Swedish	55.2%	59.2%	62.0%	62.8%	65.3%	65.9%	65.2%	66.3%	66.9%	68.5% Turkish	85.8%	88.8%	89.8%	90.7%	91.5%	91.0%	91.3%	91.2%	90.2%	90.5% Ukrainian	78.9%	79.9%	80.2%	80.8%	79.7%	78.5%	78.5%	77.8%	77.9%	77.9% Vietnamese	97.5%	99.0%	99.0%	99.2%	98.7%	98.5%	98.5%	98.5%	98.5%	98.5%

Obviously, performance is noticeably worse when additional “spoiler” languages are available to be selected.

Looking at Languages with a 3,000-ngram model
Since we are using 3,000-ngram models for our current A/B tests, we’ll evaluate those models by language.

21 Known Languages
Here is the detailed accuracy report by language when using the set of 21 known languages, with 3,000 ngram models:

f0.5   f1      f2      recall  prec   total   hits    misses TOTAL    86.5%   86.5%   86.5%   86.5%   86.5%  4200    3635    565 Arabic    90.8%   92.5%   94.3%   95.5%   89.7%  200     191     22 Chinese    83.8%   86.9%   90.2%   92.5%   81.9%  200     185     41 Czech    91.9%   92.9%   93.8%   94.5%   91.3%  200     189     18 Dutch    81.6%   78.0%   74.6%   72.5%   84.3%  200     145     27 English    90.4%   86.6%   83.2%   81.0%   93.1%  200     162     12 French    86.9%   89.3%   91.8%   93.5%   85.4%  200     187     32 German    81.1%   80.5%   79.9%   79.5%   81.5%  200     159     36 Hebrew   100.0%  100.0%  100.0%  100.0%  100.0%  200     200     0 Indonesian    80.9%   83.4%   86.1%   88.0%   79.3%  200     176     46 Italian    72.4%   73.3%   74.3%   75.0%   71.8%  200     150     59 Japanese    91.0%   85.4%   80.5%   77.5%   95.1%  200     155     8 Korean    99.9%   99.7%   99.6%   99.5%  100.0%  200     199     0 Persian    93.6%   92.0%   90.5%   89.5%   94.7%  200     179     10 Polish    92.6%   93.3%   94.0%   94.5%   92.2%  200     189     16 Portuguese    75.8%   74.6%   73.3%   72.5%   76.7%  200     145     44 Russian    81.3%   85.0%   89.1%   92.0%   79.0%  200     184     49 Spanish    70.6%   73.4%   76.4%   78.5%   68.9%  200     157     71 Swedish    79.0%   76.6%   74.4%   73.0%   80.7%  200     146     35 Turkish    91.3%   92.1%   92.9%   93.5%   90.8%  200     187     19 Ukrainian    86.6%   82.1%   78.0%   75.5%   89.9%  200     151     17 Vietnamese    98.7%   99.0%   99.3%   99.5%   98.5%  200     199     3 f0.5   f1      f2      recall  prec   total   hits    misses

The poorest performers in recall are Dutch (72.0%), Swedish (72.5%), Ukrainian (75.0%), Portuguese (75.0%), Italian (77.0%), Japanese (79.5%), and Spanish (79.5%).

The poorest performers in precision are Spanish (72.6%), Italian (73.0%), Portuguese (76.9%), and Russian (78.3%).

Below are the most common identification errors for each language (all cases ≥10, plus highest for each language), grouped by similarity (language and/or script family) when there is considerable confusion within the group.

Most common identification errors:

Arabic    Persian (9) Persian   Arabic (21)

Chinese   Japanese (8) Japanese  Chinese (41)

Dutch     German (17) German    Dutch (16)

French    Italian (4) Italian   Spanish (12)    Indonesian (11) Portuguese (11) Portuguese Spanish (37)   Italian (11) Spanish   Portuguese (21) Italian (11)

Russian   Ukrainian (16) Ukrainian Russian (49)

Czech     Polish (4) English   Dutch (5)       French (5)      German (5)      Spanish (5) Indonesian Italian (6) Korean    Turkish (1) Polish    Indonesian (3) Swedish   Indonesian (15) Turkish   Indonesian (3)  Swedish (3) Vietnamese Italian (1)

So, confusion among Arabic/Persian, Chinese/Japanese, Dutch/German, French/Italian/Portuguese/Spanish, and Russian/Ukrainian is not too surprising.

Indonesian seems to be the most obvious outlier here, incorrectly claiming a fair number of Italian and Swedish queries.

59 Available Language Models
Keep in mind that some of these are known to be a bit dodgy.

Here is the detailed accuracy report by language when using the full set of 59 languages, with 3,000 ngram models:

f0.5   f1      f2      recall  prec   total   hits    misses TOTAL    74.7%   74.7%   74.7%   74.7%   74.7%  4200    3138    1062 Arabic    91.0%   91.9%   92.9%   93.5%   90.3%  200     187     20 Chinese    67.5%   58.0%   50.9%   47.0%   75.8%  200     94      30 Czech    94.8%   88.6%   83.2%   80.0%   99.4%  200     160     1 Dutch    82.9%   76.1%   70.4%   67.0%   88.2%  200     134     18 English    85.0%   73.2%   64.3%   59.5%   95.2%  200     119     6 French    88.0%   86.7%   85.4%   84.5%   88.9%  200     169     21 German    84.1%   80.0%   76.3%   74.0%   87.1%  200     148     22 Hebrew   100.0%  100.0%  100.0%  100.0%  100.0%  200     200     0 Indonesian    66.1%   54.1%   45.8%   41.5%   77.6%  200     83      24 Italian    80.5%   74.4%   69.1%   66.0%   85.2%  200     132     23 Japanese    91.8%   85.2%   79.4%   76.0%   96.8%  200     152     5 Korean    99.9%   99.7%   99.6%   99.5%  100.0%  200     199     0 Persian    91.7%   87.0%   82.6%   80.0%   95.2%  200     160     8 Polish    95.6%   93.8%   92.1%   91.0%   96.8%  200     182     6 Portuguese    76.9%   72.2%   68.0%   65.5%   80.4%  200     131     32 Russian    80.7%   76.8%   73.2%   71.0%   83.5%  200     142     28 Spanish    74.6%   72.0%   69.5%   68.0%   76.4%  200     136     42 Swedish    71.7%   62.0%   54.5%   50.5%   80.2%  200     101     25 Turkish    92.5%   89.8%   87.2%   85.5%   94.5%  200     171     10 Ukrainian    87.9%   80.2%   73.8%   70.0%   94.0%  200     140     9 Vietnamese    99.0%   99.0%   99.0%   99.0%   99.0%  200     198     2 Albanian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       12 Azerbaijani     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       11 Basque     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       25 Bosnian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       16 Bulgarian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       50 Cantonese     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       107 Catalan     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       39 Croatian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       7 Danish     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       26 Estonian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       7 Finnish     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       25 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       7 Igbo     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       37 Kazakh     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       8 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       65 Latvian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       10 Lithuanian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       8 Macedonian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       21 Malay     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       85 Malayalam     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       1 Mongolian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2 Norwegian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       28 Romanian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       29 Serbian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2 Serbo-Croatian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       15 Slovak     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       27 Slovenian     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       14 Tagalog     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       19 Tamil     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2 Urdu     0.0%    0.0%    0.0%    0.0%    0.0%  0       0       25 f0.5   f1      f2      recall  prec   total   hits    misses

The poorest performers in recall are Indonesian (41.5%), Chinese (47.0%), Swedish (50.5%), English (59.5%), Portuguese (65.5%), Italian (66.0%), Dutch (67.0%), and Spanish (68.0%).

The poorest performers in precision are Chinese (75.8%), Indonesian (77.6%), and Spanish (76.4%).

The poorest performers in terms of false positives among the languages not in the balanced query set are Cantonese (107), Malay (85), Latin (65), Bulgarian (50), Catalan (39), and Igbo (37).

Below are the most common identification errors for each language (all cases ≥10, plus highest for each language), grouped by similarity (language and/or script family) when there is considerable confusion within the group.

Most common identification errors:

Arabic     Persian (8) Persian    Arabic (20)     Urdu (20)

Chinese    Cantonese (94) Japanese   Chinese (30)    Cantonese (13)

Dutch      German (12) German     Dutch (11)

French     Catalan (7) Italian    Latin (10) Portuguese Spanish (24)    Latin (17) Spanish    Portuguese (18) Catalan (13)

Russian    Bulgarian (28)  Macedonian (15) Ukrainian  Russian (28)    Bulgarian (22)

Czech      Slovak (20)

Indonesian Malay (75)

Swedish    Norwegian (17)  Danish (11)

English    Igbo (32)

Korean     Azerbaijani (1) Polish     Latin (3)       Serbo-Croatian (3) Turkish    Azerbaijani (7) Vietnamese Italian (1)     Latin (1)

As before, confusion among Arabic/Persian/Urdu, Chinese/Japanese/Cantonese, Dutch/German, French/Italian/Portuguese/Spanish/Catalan/Latin, and Russian/Ukrainian/Bulgarian/Macedonian is not too surprising. Neither are Czech/Slovak, Indonesian/Malay, nor Swedish/Norwegian/Danish.

English/Igbo would be a surprise, but we already know there’s a lot of English in the Igbo training data.

Conclusions
For the 21 languages we should be able to release these query-based models and include them with the PHP version of TextCat used for our A/B tests.

Indonesian needs the most work, since it is performing poorly in unexpected ways (i.e., with Swedish and Italian).

The other language/script families that perform poorly may also benefit from additional work to improve the quality of their training data.

For the full list of 59 languages, Igbo sticks out as the worst performing. As expected, language/script families are generally more easily confused.

Next Steps
To Do:
 * Release the rest of the 21 languages in the balanced query set, because they seem to be working reasonably well on reasonably clean and balanced data.

To Consider:
 * Try to improve the training data for Indonesian, and re-assess against this test set.
 * Try to improve the training data for the various language/script families, and re-asses against this test set.
 * Release improved models.


 * Add to the balanced test set additional languages, based on query volume, the uniqueness of the language-script mapping (e.g., Thai, Armenian), by language family, or some other criteria of desirability. Assess performance on this set.
 * Determine which models need improvement, and release the acceptable models.