User:TJones (WMF)/Notes/TextCat Optimization for frwiki eswiki itwiki and dewiki

April 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T132466)

Summary of Results
Using the default 3K models, the best options for each wiki are presented below:

frwiki
 * languages: French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean
 * lang codes: fr, en, ar, ru, zh, th, el, hy, he, ko
 * relevant poor-performing queries: 29%
 * f0.5: 89.0%

eswiki
 * languages: Spanish, English, Russian, Chinese, Arabic, Japanese
 * lang codes: es, en, ru, zh, ar, ja
 * relevant poor-performing queries: 47%
 * f0.5: 95.8%

itwiki
 * languages: Italian, English, Russian, Arabic, Chinese, Japanese, Greek, Korean
 * lang codes: it, en, ru, ar, zh, ja, el, ko
 * relevant poor-performing queries: 29%
 * f0.5: 92.2%

dewiki
 * languages: German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese
 * lang codes: de, en, zh, el, ru, ar, hi, th, ko, ja
 * relevant poor-performing queries: 35%
 * f0.5: 88.2%

Background
I’ve previously looked at optimizing the set of languages to use with TextCat for language detection on enwiki, and we have an A/B test in the works.

The next step is to do a similar analysis for other big wikis, based on query volume. The next four wikis are Italian, German, Spanish, and French Wikipedias. Due to technical difficulties and personal preferences, I will be looking at French, Spanish, Italian, and then German.

I’ve also done some preliminary work—corpus creation and initial filtering—on the next four candidate wikis: Russian, Japanese, Portuguese, and Indonesian. This was useful for defining and streamlining the process, especially with non-Latin Russian and Japanese.

Query Corpus Creation
The first step for each wiki is to extract a recent corpus of relevant queries, and then do some initial filtering to remove some of the dreck and other undesirable queries in the corpora.

Random sampling
Select 10K random queries from a recent one-week period (generally in March 2016) that meet the following criteria: On one occasion, the Hive query got stuck in the reduce stage. The only work-around I found was to use a different week-long time period from which to extract the queries.
 * Query came from the search box on .wikipedia.org
 * No more than one query from any given IP for any given day
 * No more than 30 queries per day from that IP
 * Only the _content index was searched (except for wikis that search multiple indexes by default)
 * Query had < 3 results

The Hive query for frwiki is available as an example.

Initial filtering
There is a lot of junk in our query logs, and some of it is relatively easy to identify with relative accuracy. The process is as follows: This cuts down the query pool by 5-25% (median of 8 languages so far: about 11%.).
 * Extract queries that have the same 1-to-10 character sequence at least three times in a row and manually review—these are mostly junk, but there are some good ones. Put the good ones back.
 * Extract queries that are nothing but consonants and spaces—there are more than you would think, and it works on non-Latin languages/wikis (like ru and ja), too! These are all junk, but they are reviewed anyway.
 * Extract queries that have four Latin characters in a row that are not in [aeiouhy]. Again, more than you would think, most are junk. Works in non-Latin languages, too. Works less well in German, but still found a lot of junk.
 * Review remaining queries. Sort and review:
 * Remove most queries with www, http, @, .com, .org, .net, .mobi, .biz, .xxx, .co.uk, and common TLDs for the language under review.
 * Remove queries that aren’t words—mostly numbers, things that look like serial numbers, ID numbers, phone numbers, addresses, etc.
 * Remove queries that are mostly or completely emoji.
 * Note any obvious “other” languages in use in the sample. Different scripts are really obvious because they are grouped when sorted. Other languages using the same script are hit or miss.
 * Incidentally remove any unwanted queries as they go by: proper names, chemical names, any other obvious gibberish not caught by the gibberish filters above.
 * Sort, uniq, and randomly order remaining queries.

Language annotation and further filtering
Once we’ve removed the really obvious junk, it’s time to manually review queries to create a corpus.
 * Take a the first 1000 queries from the filtered and randomized sample.
 * Run current language identification on it, using the language of the wiki, English (which is everywhere), and any other languages noted during initial filtering. This is far from perfect, but when it works decently, it’s helpful and reduces context switching. For example, most of the non-junk queries identified from frwiki as French are in fact French.
 * Skim language ID results and see if anything is obviously terrible (e.g., most of the “German” queries are obviously French) or obviously missing (oops, there’s a query in Armenian) and run again if necessary.
 * Review and manually tag the queries, removing queries that are proper names (people, places, language names, companies, products, fictional characters, etc., etc.), acronyms, more gibberish (there’s always more gibberish), scientific terminology and other words that are extremely ambiguous and not specific to any one language, and anything that’s unidentifiable.
 * Queries with typos are often left in, even though they make automatic identification hard(er).
 * Longer queries that include a few “undesirable” words are kept. (e.g., “Le declin du système éducatif haïtien. Quelles en sont les causes fondamentales?” would be kept since it is mostly French, but “Haïti” would not because it’s a name.)
 * Proper names that are made up of common nouns are kept. (e.g., names of movies, like “Seeking a Friend for the End of the World”, are often phrases made up of normal words, and are kept. Similarly country names made up of normal words are kept: "Costa Rica", "Puerto Rico", "Côte d'Ivoire", etc.)

For French, this cut the query pool by about two thirds, leaving only one third (three hundred and something) of the queries. So the process was repeated on the next 1000 queries from the filtered and randomized sample. For Spanish, just less than half of queries were eliminated, so I stopped with 520. The goal is > 500 queries.

For frwiki, the result is a corpus of 682 queries (from 2000 reviewed, after ~15% were previously removed).

Thus, for French, only ~30% of the queries that meet the criteria for possible language detection (< 3 results) are actually in an identifiable language. In production, much of the other 70% would also often be labelled as being in a particular language, but those results (on names, acronyms, gibberish, etc.) are unpredictable and any results from another wiki may or may not be helpful. Hence the need for A/B testing after this analysis is done.

Corpus size
While 682 (the size of the French query corpus) is not a huge sample, it’s enough to get a sense of what languages are commonly present among these poor-performing queries, and optimize the choice of what languages to detect. The 95% confidence interval for the margin of error on a proportion (read more; calculator) maxes out at 50%. For a sample of size 500, that’s 4.38%. For a smaller proportion, the error is smaller, but larger relative to the proportion (e.g., 0.87% for a proportion of 1% out of 500).

Overall, though, that’s good enough for us to say things like, “Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.”—which is enough for us to optimize the languages to be used for language detection for accuracy and run-time performance.

French Results
About 15% of the original 10K corpus was removed in the initial filtering. A 2,000-query random sample was taken, and about 66% of those queries were discarded, leaving a 682-query corpus. Thus only about 29% of poor-performing queries are in an identifiable language.

Other languages searched on frwiki
Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, about 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.

Below are the results for frwiki, with raw counts, percentage, and 95% margin of error.

In order, those are French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Corsican, Thai, Swahili, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton.

We don’t have query-trained language models for all of the languages represented here, such as Corsican, Swahili, Breton, Icelandic, Latin, or Hungarian. Since these each represent very small slices of our corpus (1-2 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,517 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Hebrew, and Korean queries.

Analysis and Optimization
Using all of the language models available, the performance report (for the 3000-ngram models* we use in enwiki) is below.

* ''I also ran tests on other model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. 3000 is still the best model size.''

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    83.0%   83.1%   83.1%   83.1%   83.0%  681     566     116 French    95.5%   91.7%   88.1%   85.9%   98.3%  468     402     7 English    80.6%   75.5%   70.9%   68.2%   84.5%  88      60      11 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese    62.5%   66.7%   71.4%   75.0%   60.0%  12      9       6 German    44.3%   50.0%   57.4%   63.6%   41.2%  11      7       10 Spanish    15.4%   20.8%   32.1%   50.0%   13.2%  10      5       33 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch    21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9 Corsican     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian     6.1%    9.1%   17.9%   50.0%    5.0%  2       1       19 Polish    29.4%   40.0%   62.5%  100.0%   25.0%  2       2       6 Armenian   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish     7.7%   11.8%   25.0%  100.0%    6.2%  1       1       15 Thai   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 f0.5   f1      f2      recall  prec    total   hits    misses

Spanish, Dutch, Italian, Polish, and Swedish do very poorly. They have too few actual instances that they can get correct, which are heavily outweighed by the false positives they do get.

Portuguese and German are not great, either. I reran the analysis without Portuguese and German, and it was better. I added them each back into the mix separately and in both cases the results were worse.

As noted above, Greek, Hebrew, and Korean are present in the larger sample, and from earlier work on the balanced query sets, our models for these languages are very high accuracy.

So, I dropped Portuguese and German, added Greek, Hebrew, and Korean, and re-ran the performance report with the 3000-ngram models (to check the performance and double-check that Greek, Hebrew, and Korean aren’t causing problems). The results are below:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    89.0%   89.1%   89.1%   89.1%   89.0%  681     607     75 French    94.8%   95.1%   95.5%   95.7%   94.5%  468     448     26 English    67.0%   74.9%   84.9%   93.2%   62.6%  88      82      49 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  12      0       0 German     0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0 Spanish     0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Corsican     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Armenian   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for French and English, but overall performance improved. Queries in unrepresented languages were all identified as either French or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

frwiki: Best Options
The optimal settings for frwiki, based on these experiments, would be to use models for French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, hy, he, ko), using the default 3000-ngram models.

Spanish Results
About 10% of the original 10K corpus was removed in the initial filtering. A 1,000-query random sample was taken, and 48% of those queries were discarded, leaving a 520-query corpus. Thus only about 47% of poor-performing queries are in an identifiable language.

Other languages searched on eswiki
Based on the sample of 520 poor-performing queries on eswiki that are in some language, about 90% are in Spanish, 4-8% are in English, and fewer than 2% each are in a handful of other languages.

Below are the results for eswiki, with raw counts, percentage, and 95% margin of error. In order, those are Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, Guarani*, French, German, Catalan.

*  Mbaé’chepa!

We don’t have query-trained language models for all of the languages represented here, such as Latin, Guarani, and Catalan. Since these each represent very small slices of our corpus (1-3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 9,003 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Arabic and Japanese queries, and one each for Cherokee and Aramaic (for which we do not have models).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better, are here:

model size   3000    3500    6000    7000    8000    9000    10000 TOTAL   83.2%   84.4%   84.8%   85.0%   85.4%   85.9%   86.3% Spanish   91.6%   92.2%   92.3%   92.6%   92.8%   93.2%   93.4% English   73.0%   75.0%   75.8%   73.8%   75.0%   76.2%   76.2% Latin   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Russian   100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0% Catalan   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0% French   0.0%    0.0%    33.3%   20.0%   22.2%   22.2%   22.2% German   22.2%   22.2%   25.0%   22.2%   20.0%   20.0%   20.0% Italian   8.0%    9.1%    9.1%    10.0%   11.8%   11.8%   11.1% Portuguese   4.7%    4.8%    4.8%    5.3%    5.0%    5.3%    5.7%

Performance details for the 3K models are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    83.2%   83.2%   83.2%   83.2%   83.2%  519     432     87 Spanish    96.3%   91.6%   87.3%   84.7%   99.8%  476     403     1 English    73.7%   73.0%   72.3%   71.9%   74.2%  32      23      8 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       7 German    15.2%   22.2%   41.7%  100.0%   12.5%  1       1       7 Italian     5.2%    8.0%   17.9%  100.0%    4.2%  1       1       23 Portuguese     3.0%    4.7%   10.9%  100.0%    2.4%  1       1       41 f0.5   f1      f2      recall  prec    total   hits    misses

Italian, Portuguese, French, and German all do very poorly, with too many false positives.

As noted above, Arabic and Japanese are present in the larger sample, and as our models for these languages are high accuracy, I’ve included them.

The final language set is Spanish, English, Russian, Chinese, Arabic, and Japanese. As above, 3K is not the optimal model size—my current unsupported hypothesis is that 3K isn’t the best here because there are really only two languages in contention. The 3K results are shown below along with the best performing model sizes:

model size   1500    2000    2500    3000    9000    10000 TOTAL   96.5%   96.1%   96.0%   95.8%   96.0%   96.1% Spanish   98.7%   98.5%   98.4%   98.2%   98.3%   98.4% English   78.9%   76.3%   75.3%   76.5%   76.9%   77.9% Latin   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Russian   100.0%  100.0%  100.0%  100.0%  100.0%  100.0% Catalan   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  100.0% French   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% German   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Italian   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Portuguese   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%

However, the accuracy is very high, and the differences are not huge, so it makes sense to stick with the default 3K models for now, but I'll continue to keep an eye out for significant performance improvements with other model sizes when working with other corpora.

The detailed report for the 3K model is here*:

[ * I inadvertently forgot to include "Guarani" as a known language, so Spanish totals were off by one (519 instead of 520). Since we don't have a Guarani language detector, it is of course incorrect, slightly lowering the overall score, but not really changing the final recommendations. The report below is corrected, those above are not.]

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL   95.6%   95.6%   95.6%   95.6%   95.6%  520     497     23 Spanish   98.8%   98.2%   97.6%   97.3%   99.1%  476     463     4 English   66.8%   75.6%   87.1%   96.9%   62.0%  32      31      19 Latin    0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian  100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Spanish and English, but overall performance improved. Queries in unrepresented languages were all identified as either Spanish or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

eswiki: Best Options
Non-optimal settings for eswiki (while being consistent with other wikis using 3K models), based on these experiments, would be to use models for Spanish, English, Russian, Chinese, Arabic, Japanese (es, en, ru, zh, ar, ja), using the default 3000-ngram models.

Italian Results
About 15% of the original 10K corpus was removed in the initial filtering. A 1,600-query random sample was taken, and 65% of those queries were discarded, leaving a 550-query corpus. Thus only about 29% of low-performing queries are in an identifiable language.

Other languages searched on itwiki
Based on the sample of 550 poor-performing queries on itwiki that are in some language, about 75% are in Italian, 20% are in English, and fewer than 1% each are in a handful of other languages.

Below are the results for itwiki, with raw counts, percentage, and 95% margin of error. In order, those are Italian, English, Spanish, German, Latin, French, Russian, Romanian, Portuguese, Arabic, Chinese, Polish, Czech.

We don’t have query-trained language models for all of the languages represented here, such as Latin and Romanian. Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,533 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Greek and Japanese queries, and one each for Korean and Bengali, and one for Punjabi (for which we do not have a model).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

3000   3500    4000    4500    10000       TOTAL    83.5%   84.0%   84.0%   83.8%   84.0% Italian   93.1%   93.4%   93.4%   93.8%   93.8% English   80.9%   82.1%   80.4%   77.8%   77.4% Spanish   33.3%   28.6%   22.9%   27.8%   32.4% German   46.2%   46.2%   42.9%   40.0%   41.4% French   20.7%   22.2%   32.0%   32.0%   33.3% Latin   0.0%    0.0%    0.0%    0.0%    0.0% Arabic   100.0%  100.0%  100.0%  100.0%  100.0% Portuguese   19.0%   20.0%   30.0%   31.6%   27.3% Romanian   0.0%    0.0%    0.0%    0.0%    0.0% Russian   100.0%  100.0%  100.0%  100.0%  100.0% Chinese   100.0%  100.0%  100.0%  100.0%  100.0% Czech   25.0%   25.0%   0.0%    0.0%    0.0% Polish   50.0%   50.0%   66.7%   66.7%   66.7%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    83.5%   83.5%   83.5%   83.5%   83.5%  550     459     91 Italian    96.2%   93.1%   90.2%   88.4%   98.3%  404     357     6 English    89.4%   80.9%   73.8%   69.7%   96.2%  109     76      3 Spanish    25.0%   33.3%   50.0%   75.0%   21.4%  8       6       22 German    34.9%   46.2%   68.2%  100.0%   30.0%  6       6       14 French    14.4%   20.7%   36.6%   75.0%   12.0%  4       3       22 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese    13.3%   19.0%   33.3%   66.7%   11.1%  3       2       16 Romanian     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech    17.2%   25.0%   45.5%  100.0%   14.3%  1       1       6 Polish    38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2 f0.5   f1      f2      recall  prec    total   hits    misses

Spanish, German, French, and Portuguese all do very poorly, with too many false positives. Czech and Polish aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Greek, Japanese, and Korean are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them. I did not include Bengali because it hasn't been well tested as this point.

The final language set is Italian, English, Russian, Arabic, Chinese, Japanese, Greek, and Korean. As above, 3K is not the optimal model size, but it is within 0.2%. The 3K results are shown below along with the best performing model sizes:

3000   3500    4000    4500    10000       TOTAL    92.2%   92.4%   92.2%   91.8%   92.2% Italian   96.7%   96.9%   96.6%   96.4%   96.6% English   87.3%   87.8%   87.8%   86.8%   87.7% Spanish   0.0%    0.0%    0.0%    0.0%    0.0% German   0.0%    0.0%    0.0%    0.0%    0.0% French   0.0%    0.0%    0.0%    0.0%    0.0% Latin   0.0%    0.0%    0.0%    0.0%    0.0% Arabic   100.0%  100.0%  100.0%  100.0%  100.0% Portuguese   0.0%    0.0%    0.0%    0.0%    0.0% Romanian   0.0%    0.0%    0.0%    0.0%    0.0% Russian   100.0%  100.0%  100.0%  100.0%  100.0% Chinese   100.0%  100.0%  100.0%  100.0%  100.0% Czech   0.0%    0.0%    0.0%    0.0%    0.0% Polish   0.0%    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are very small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    92.2%   92.2%   92.2%   92.2%   92.2%  550     507     43 Italian    95.4%   96.7%   98.1%   99.0%   94.6%  404     400     23 English    84.9%   87.3%   89.9%   91.7%   83.3%  109     100     20 Spanish     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 German     0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Romanian     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Italian and English, but overall performance improved. Queries in unrepresented languages were all identified as either Italian or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

itwiki: Best Options
The barely sub-optimal settings (though consistent with others using 3K models) for itwiki, based on these experiments, would be to use models for Italian, English, Russian, Arabic, Chinese, Japanese, Greek, Korean (it, en, ru, ar, zh, ja, el, ko), using the default 3000-ngram models.

German Results
About 6% of the original 10K corpus was removed in the initial filtering. A 1400-query random sample was taken, and ~63% of those queries were discarded, leaving a 520-query corpus. Thus only about 35% of low-performing queries are in an identifiable language.

Other languages searched on dewiki
Based on the sample of 520 poor-performing queries on dewiki that are in some language, about 70% are in German, about 25% are in English, and fewer than 2% each are in a handful of other languages.

Below are the results for dewiki, with raw counts, percentage, and 95% margin of error. In order, those are German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Turkish, Swedish, Dutch.

We don’t have query-trained language models for all of the languages represented here, in particular Latin. Since it represents a very small slice of our corpus (8 queries), we aren’t going to worry about it, and accept that it will not be detected correctly.

Looking at the larger corpus of 9,439 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese queries, and a one Odia query (for which we do not have a model).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

2500   3000    3500    4000    6000    7000    8000    9000    10000       TOTAL    74.6%   74.4%   74.6%   75.3%   75.7%   76.3%   76.9%   77.0%   77.6% German   88.9%   88.9%   89.2%   89.5%   90.0%   90.5%   90.9%   90.9%   91.2% English   73.8%   74.5%   74.1%   75.7%   74.9%   74.0%   74.0%   73.8%   74.1% Italian   25.0%   40.0%   40.0%   38.5%   37.0%   42.9%   46.2%   48.0%   51.9% Latin   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Spanish   50.0%   48.3%   50.0%   46.2%   43.5%   45.5%   47.6%   47.6%   50.0% French   34.8%   36.4%   34.8%   33.3%   32.0%   32.0%   33.3%   32.0%   32.0% Chinese   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Dutch   6.9%    5.9%    6.2%    6.2%    7.1%    7.4%    7.1%    6.9%    7.1% Polish   25.0%   25.0%   22.2%   25.0%   28.6%   28.6%   22.2%   22.2%   25.0% Swedish   5.0%    0.0%    0.0%    0.0%    0.0%    0.0%    5.7%    5.9%    5.9% Turkish   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    22.2%   22.2% Vietnamese   40.0%   50.0%   50.0%   50.0%   66.7%   100.0%  100.0%  100.0%  100.0%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    74.5%   74.4%   74.3%   74.2%   74.5%  520     386     132 German    94.5%   88.9%   83.9%   80.8%   98.6%  360     291     4 English    85.6%   74.5%   66.0%   61.3%   95.0%  124     76      4 Italian    32.9%   40.0%   51.0%   62.5%   29.4%  8       5       12 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish    38.0%   48.3%   66.0%   87.5%   33.3%  8       7       14 French    27.4%   36.4%   54.1%   80.0%   23.5%  5       4       13 Chinese     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch     3.8%    5.9%   13.5%  100.0%    3.0%  1       1       32 Polish    17.2%   25.0%   45.5%  100.0%   14.3%  1       1       6 Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       38 Turkish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       7 Vietnamese    38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2 f0.5   f1      f2      recall  prec    total   hits    misses

Dutch and Swedish do very poorly, with too many false positives. Italian, Spanish, French, Polish, Turkish, and Vietnamese aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese. As above, 3K is not the optimal model size, but it is within half a percent. The 3K results are shown below along with the best performing model sizes:

3000  4000    4500    5000    9000    10000       TOTAL    88.2%  88.7%   88.7%   88.7%   88.9%   88.7% German   94.8%  95.3%   95.2%   95.4%   95.5%   95.2% English   81.9%  82.8%   83.2%   82.7%   83.3%   83.2% Italian   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Latin   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Spanish   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% French   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Chinese   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Dutch   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Polish   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Swedish   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Turkish   0.0%   0.0%    0.0%    0.0%    0.0%    0.0% Vietnamese   0.0%   0.0%    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are very small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    88.2%   88.2%   88.1%   88.1%   88.2%  520     458     61 German    93.9%   94.8%   95.8%   96.4%   93.3%  360     347     25 English    77.9%   81.9%   86.3%   89.5%   75.5%  124     111     36 Italian     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 French     0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for German and English, but overall performance improved. Queries in unrepresented languages were all identified as either German or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

Observations
Interestingly, even though Chinese was enabled, neither of the two queries in Chinese were tagged as such. Generally, Chinese is a relatively high-accuracy identifier. In this case, the queries include a random string of numbers and Latin letters and one includes ".html". They also include a number of less common Chinese characters. As a result, the less common Chinese characters get the same score from the Chinese and German language detectors (the maximum penalty for an "unknown" character), and the individual letters score well in German, which the known Chinese characters score less well in Chinese. The Chinese model includes not only individual characters, but also bigrams and larger n-grams, so there aren't even 3,000 singleton Chinese characters in the model.

The Chinese model did a better job on the other Chinese examples in the larger un-tagged dewiki sample.

dewiki: Best Options
The barely sub-optimal settings (though consistent with others using 3K models) for dewiki, based on these experiments, would be to use models for German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese (de, en, zh, el, ru, ar, hi, th, ko, ja), using the default 3000-ngram models.

Next Up

 * One or more of Russian, Japanese, Portuguese, and Indonesian, if we continue. (See T121541)

Other thoughts

 * It would be easy to build high-accuracy identifiers for languages that have unique character sets—or at least character sets that are effectively unique in practice. For example, Yiddish can be written with the Hebrew alphabet, but on most wikis, we'd expect most Hebrew characters to actually be Hebrew (and identifying Yiddish as Hebrew is better than identifying it as anything else other than Yiddish). Similarly, Persian and Arabic writing share many characters, but on frwiki, for example, we only see Arabic. Cherokee is rare, but examples have shown up in our samples. Korean, Armenian, Hebrew, Greek, Georgian, Thai, and others could be used on most wikis because they are low risk, if the run-time cost of enabling them is not too high.