User:TJones (WMF)/Notes/TextCat Optimization for frwiki eswiki itwiki and dewiki

April 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T132466)

TextCat Optimization for frwiki, eswiki, itwiki, and dewiki

Summary of Results

Using the default 3K models, the best options for each wiki are presented below:

frwiki

languages: French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean
lang codes: fr, en, ar, ru, zh, th, el, hy, he, ko
relevant poor-performing queries: 29%
f_0.5: 89.0%

eswiki

languages: Spanish, English, Russian, Chinese, Arabic, Japanese
lang codes: es, en, ru, zh, ar, ja
relevant poor-performing queries: 47%
f_0.5: 95.6%

itwiki

languages: Italian, English, Russian, Arabic, Chinese, Japanese, Greek, Korean
lang codes: it, en, ru, ar, zh, ja, el, ko
relevant poor-performing queries: 29%
f_0.5: 92.2%

dewiki

languages: German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese
lang codes: de, en, zh, el, ru, ar, hi, th, ko, ja
relevant poor-performing queries: 35%
f_0.5: 88.2%

Background

I’ve previously looked at optimizing the set of languages to use with TextCat for language detection on enwiki, and we have an A/B test in the works.

The next step is to do a similar analysis for other big wikis, based on query volume. The next four wikis are Italian, German, Spanish, and French Wikipedias. Due to technical difficulties and personal preferences, I will be looking at French, Spanish, Italian, and then German.

I’ve also done some preliminary work—corpus creation and initial filtering—on the next four candidate wikis: Russian, Japanese, Portuguese, and Indonesian. This was useful for defining and streamlining the process, especially with non-Latin Russian and Japanese.

Query Corpus Creation

The first step for each wiki is to extract a recent corpus of relevant queries, and then do some initial filtering to remove some of the dreck and other undesirable queries in the corpora.

Random sampling

Select 10K random queries from a recent one-week period (generally in March 2016) that meet the following criteria:

Query came from the search box on <wiki>.wikipedia.org
Exclude any IP that made more than 30 queries per day
Include not more than one query from any given IP for any given day
Only the <wiki>_content index was searched (except for wikis that search multiple indexes by default)
Query had < 3 results

On one occasion, the Hive query got stuck in the reduce stage. The only work-around I found was to use a different week-long time period from which to extract the queries.

The Hive query for frwiki is available as an example.

Initial filtering

There is a lot of junk in our query logs, and some of it is relatively easy to identify with relative accuracy. The process is as follows:

Extract queries that have the same 1-to-10 character sequence at least three times in a row and manually review—these are mostly junk, but there are some good ones. Put the good ones back.
Extract queries that are nothing but consonants and spaces—there are more than you would think, and it works on non-Latin languages/wikis (like ru and ja), too! These are all junk, but they are reviewed anyway.
Extract queries that have four Latin characters in a row that are not in [aeiouhy]. Again, more than you would think, most are junk. Works in non-Latin languages, too. Works less well in German, but still found a lot of junk.
Review remaining queries. Sort and review:
- Remove most queries with www, http, @, .com, .org, .net, .mobi, .biz, .xxx, .co.uk, and common TLDs for the language under review.
- Remove queries that aren’t words—mostly numbers, things that look like serial numbers, ID numbers, phone numbers, addresses, etc.
- Remove queries that are mostly or completely emoji.
- Note any obvious “other” languages in use in the sample. Different scripts are really obvious because they are grouped when sorted. Other languages using the same script are hit or miss.
Incidentally remove any unwanted queries as they go by: proper names, chemical names, any other obvious gibberish not caught by the gibberish filters above.
Sort, uniq, and randomly order remaining queries.

This cuts down the query pool by 5-25% (median of 8 languages so far: about 11%.).

Language annotation and further filtering

Once we’ve removed the really obvious junk, it’s time to manually review queries to create a corpus.

Take a the first 1000 queries from the filtered and randomized sample.
Run current language identification on it, using the language of the wiki, English (which is everywhere), and any other languages noted during initial filtering. This is far from perfect, but when it works decently, it’s helpful and reduces context switching. For example, most of the non-junk queries identified from frwiki as French are in fact French.
Skim language ID results and see if anything is obviously terrible (e.g., most of the “German” queries are obviously French) or obviously missing (oops, there’s a query in Armenian) and run again if necessary.
Review and manually tag the queries, removing queries that are proper names (people, places, language names, companies, products, fictional characters, etc., etc.), acronyms, more gibberish (there’s always more gibberish), scientific terminology and other words that are extremely ambiguous and not specific to any one language, and anything that’s unidentifiable.
Queries with typos are often left in, even though they make automatic identification hard(er).
Longer queries that include a few “undesirable” words are kept. (e.g., “Le declin du système éducatif haïtien. Quelles en sont les causes fondamentales?” would be kept since it is mostly French, but “Haïti” would not because it’s a name.)
Proper names that are made up of common nouns are kept. (e.g., names of movies, like “Seeking a Friend for the End of the World”, are often phrases made up of normal words, and are kept. Similarly country names made up of normal words are kept: "Costa Rica", "Puerto Rico", "Côte d'Ivoire", etc.)

For French, this cut the query pool by about two thirds, leaving only one third (three hundred and something) of the queries. So the process was repeated on the next 1000 queries from the filtered and randomized sample. For Spanish, just less than half of queries were eliminated, so I stopped with 520. The goal is > 500 queries.

For frwiki, the result is a corpus of 682 queries (from 2000 reviewed, after ~15% were previously removed).

Thus, for French, only ~30% of the queries that meet the criteria for possible language detection (< 3 results) are actually in an identifiable language. In production, much of the other 70% would also often be labelled as being in a particular language, but those results (on names, acronyms, gibberish, etc.) are unpredictable and any results from another wiki may or may not be helpful. Hence the need for A/B testing after this analysis is done.

Corpus size

While 682 (the size of the French query corpus) is not a huge sample, it’s enough to get a sense of what languages are commonly present among these poor-performing queries, and optimize the choice of what languages to detect. The 95% confidence interval for the margin of error on a proportion (read more; calculator) maxes out at 50%. For a sample of size 500, that’s 4.38%. For a smaller proportion, the error is smaller, but larger relative to the proportion (e.g., 0.87% for a proportion of 1% out of 500).

Overall, though, that’s good enough for us to say things like, “Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.”—which is enough for us to optimize the languages to be used for language detection for accuracy and run-time performance.

French Results

About 15% of the original 10K corpus was removed in the initial filtering. A 2,000-query random sample was taken, and about 66% of those queries were discarded, leaving a 682-query corpus. Thus only about 29% of poor-performing queries are in an identifiable language.

Other languages searched on frwiki

Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, about 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.

Below are the results for frwiki, with raw counts, percentage, and 95% margin of error.

count	lg	%	+/-
468	fr	68.62%	3.48%
89	en	13.05%	2.53%
66	ar	9.68%	2.22%
12	pt	1.76%	0.99%
11	de	1.61%	0.95%
10	es	1.47%	0.90%
5	ru	0.73%	0.64%
4	zh	0.59%	0.57%
3	nl	0.44%	0.50%
2	pl	0.29%	0.41%
2	it	0.29%	0.41%
2	co	0.29%	0.41%
1	th	0.15%	0.29%
1	sw	0.15%	0.29%
1	sv	0.15%	0.29%
1	la	0.15%	0.29%
1	is	0.15%	0.29%
1	hy	0.15%	0.29%
1	hu	0.15%	0.29%
1	br	0.15%	0.29%

In order, those are French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Corsican, Thai, Swahili, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton.

We don’t have query-trained language models for all of the languages represented here, such as Corsican, Swahili, Breton, Icelandic, Latin, or Hungarian. Since these each represent very small slices of our corpus (1-2 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,517 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Hebrew, and Korean queries.

Analysis and Optimization

Using all of the language models available, the performance report (for the 3000-ngram models* we use in enwiki) is below.

* I also ran tests on other model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. 3000 is still the best model size.

                f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     83.0%   83.1%   83.1%   83.1%   83.0%  681     566     116
     French     95.5%   91.7%   88.1%   85.9%   98.3%  468     402     7
    English     80.6%   75.5%   70.9%   68.2%   84.5%  88      60      11
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese     62.5%   66.7%   71.4%   75.0%   60.0%  12      9       6
     German     44.3%   50.0%   57.4%   63.6%   41.2%  11      7       10
    Spanish     15.4%   20.8%   32.1%   50.0%   13.2%  10      5       33
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0
      Dutch     21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9
   Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian      6.1%    9.1%   17.9%   50.0%    5.0%  2       1       19
     Polish     29.4%   40.0%   62.5%  100.0%   25.0%  2       2       6
   Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      7.7%   11.8%   25.0%  100.0%    6.2%  1       1       15
       Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
               f0.5    f1      f2      recall  prec    total   hits    misses

Spanish, Dutch, Italian, Polish, and Swedish do very poorly. They have too few actual instances that they can get correct, which are heavily outweighed by the false positives they do get.

Portuguese and German are not great, either. I reran the analysis without Portuguese and German, and it was better. I added them each back into the mix separately and in both cases the results were worse.

As noted above, Greek, Hebrew, and Korean are present in the larger sample, and from earlier work on the balanced query sets, our models for these languages are very high accuracy.

So, I dropped Portuguese and German, added Greek, Hebrew, and Korean, and re-ran the performance report with the 3000-ngram models (to check the performance and double-check that Greek, Hebrew, and Korean aren’t causing problems). The results are below:

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     89.0%   89.1%   89.1%   89.1%   89.0%  681     607     75
     French     94.8%   95.1%   95.5%   95.7%   94.5%  468     448     26
    English     67.0%   74.9%   84.9%   93.2%   62.6%  88      82      49
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  12      0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
   Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
               f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for French and English, but overall performance improved. Queries in unrepresented languages were all identified as either French or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

frwiki: Best Options

The optimal settings for frwiki, based on these experiments, would be to use models for French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, hy, he, ko), using the default 3000-ngram models.

Spanish Results

About 10% of the original 10K corpus was removed in the initial filtering. A 1,000-query random sample was taken, and 48% of those queries were discarded, leaving a 520-query corpus. Thus only about 47% of poor-performing queries are in an identifiable language.

Other languages searched on eswiki

Based on the sample of 520 poor-performing queries on eswiki that are in some language, about 90% are in Spanish, 4-8% are in English, and fewer than 2% each are in a handful of other languages.

Below are the results for eswiki, with raw counts, percentage, and 95% margin of error.

count	lg	%	+/-
476	es	91.54%	2.39%
32	en	6.15%	2.07%
3	la	0.58%	0.65%
2	ru	0.38%	0.53%
1	zh	0.19%	0.38%
1	pt	0.19%	0.38%
1	it	0.19%	0.38%
1	gn	0.19%	0.38%
1	fr	0.19%	0.38%
1	de	0.19%	0.38%
1	ca	0.19%	0.38%

In order, those are Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, Guarani*, French, German, Catalan.

* Mbaé’chepa!

We don’t have query-trained language models for all of the languages represented here, such as Latin, Guarani, and Catalan. Since these each represent very small slices of our corpus (1-3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 9,003 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Arabic and Japanese queries, and one each for Cherokee and Aramaic (for which we do not have models).

Analysis and Optimization

Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better, are here:

 model size    3000    3500    6000    7000    8000    9000    10000
      TOTAL    83.2%   84.4%   84.8%   85.0%   85.4%   85.9%   86.3%
    Spanish    91.6%   92.2%   92.3%   92.6%   92.8%   93.2%   93.4%
    English    73.0%   75.0%   75.8%   73.8%   75.0%   76.2%   76.2%
      Latin    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0%
    Catalan    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0%
     French    0.0%    0.0%    33.3%   20.0%   22.2%   22.2%   22.2%
     German    22.2%   22.2%   25.0%   22.2%   20.0%   20.0%   20.0%
    Italian    8.0%    9.1%    9.1%    10.0%   11.8%   11.8%   11.1%
 Portuguese    4.7%    4.8%    4.8%    5.3%    5.0%    5.3%    5.7%

Performance details for the 3K models are here (details for larger models are similar in terms of which language models perform the most poorly):

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     83.2%   83.2%   83.2%   83.2%   83.2%  519     432     87
    Spanish     96.3%   91.6%   87.3%   84.7%   99.8%  476     403     1
    English     73.7%   73.0%   72.3%   71.9%   74.2%  32      23      8
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       7
     German     15.2%   22.2%   41.7%  100.0%   12.5%  1       1       7
    Italian      5.2%    8.0%   17.9%  100.0%    4.2%  1       1       23
 Portuguese      3.0%    4.7%   10.9%  100.0%    2.4%  1       1       41
               f0.5    f1      f2      recall  prec    total   hits    misses

Italian, Portuguese, French, and German all do very poorly, with too many false positives.

As noted above, Arabic and Japanese are present in the larger sample, and as our models for these languages are high accuracy, I’ve included them.

The final language set is Spanish, English, Russian, Chinese, Arabic, and Japanese. As above, 3K is not the optimal model size—my current unsupported hypothesis is that 3K isn’t the best here because there are really only two languages in contention. The 3K results are shown below along with the best performing model sizes:

 model size    1500    2000    2500    3000    9000    10000
      TOTAL    96.5%   96.1%   96.0%   95.8%   96.0%   96.1%
    Spanish    98.7%   98.5%   98.4%   98.2%   98.3%   98.4%
    English    78.9%   76.3%   75.3%   76.5%   76.9%   77.9%
      Latin    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  100.0%
    Catalan    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  100.0%
     French    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
     German    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
    Italian    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
 Portuguese    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%

However, the accuracy is very high, and the differences are not huge, so it makes sense to stick with the default 3K models for now, but I'll continue to keep an eye out for significant performance improvements with other model sizes when working with other corpora.

The detailed report for the 3K model is here*:

[* I inadvertently forgot to include "Guarani" as a known language, so Spanish totals were off by one (519 instead of 520). Since we don't have a Guarani language detector, it is of course incorrect, slightly lowering the overall score, but not really changing the final recommendations. The report below is corrected, those above are not.]

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    95.6%   95.6%   95.6%   95.6%   95.6%  520     497     23
    Spanish    98.8%   98.2%   97.6%   97.3%   99.1%  476     463     4
    English    66.8%   75.6%   87.1%   96.9%   62.0%  32      31      19
      Latin     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Spanish and English, but overall performance improved. Queries in unrepresented languages were all identified as either Spanish or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

eswiki: Best Options

Non-optimal settings for eswiki (while being consistent with other wikis using 3K models), based on these experiments, would be to use models for Spanish, English, Russian, Chinese, Arabic, Japanese (es, en, ru, zh, ar, ja), using the default 3000-ngram models.

Italian Results

About 15% of the original 10K corpus was removed in the initial filtering. A 1,600-query random sample was taken, and 65% of those queries were discarded, leaving a 550-query corpus. Thus only about 29% of low-performing queries are in an identifiable language.

Other languages searched on itwiki

Based on the sample of 550 poor-performing queries on itwiki that are in some language, about 75% are in Italian, 20% are in English, and fewer than 1% each are in a handful of other languages.

Below are the results for itwiki, with raw counts, percentage, and 95% margin of error.

count	lg	%	+/-
404	it	73.45%	3.69%
109	en	19.82%	3.33%
8	es	1.45%	1.00%
6	de	1.09%	0.87%
4	la	0.73%	0.71%
4	fr	0.73%	0.71%
3	ru	0.55%	0.62%
3	ro	0.55%	0.62%
3	pt	0.55%	0.62%
3	ar	0.55%	0.62%
1	zh	0.18%	0.36%
1	pl	0.18%	0.36%
1	cs	0.18%	0.36%

In order, those are Italian, English, Spanish, German, Latin, French, Russian, Romanian, Portuguese, Arabic, Chinese, Polish, Czech.

We don’t have query-trained language models for all of the languages represented here, such as Latin and Romanian. Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,533 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek and Japanese queries, and one each for Korean and Bengali, and one for Punjabi (for which we do not have a model).

Analysis and Optimization

Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

               3000    3500    4000    4500    10000
      TOTAL    83.5%   84.0%   84.0%   83.8%   84.0%
    Italian    93.1%   93.4%   93.4%   93.8%   93.8%
    English    80.9%   82.1%   80.4%   77.8%   77.4%
    Spanish    33.3%   28.6%   22.9%   27.8%   32.4%
     German    46.2%   46.2%   42.9%   40.0%   41.4%
     French    20.7%   22.2%   32.0%   32.0%   33.3%
      Latin    0.0%    0.0%    0.0%    0.0%    0.0%
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%
 Portuguese    19.0%   20.0%   30.0%   31.6%   27.3%
   Romanian    0.0%    0.0%    0.0%    0.0%    0.0%
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%
      Czech    25.0%   25.0%   0.0%    0.0%    0.0%
     Polish    50.0%   50.0%   66.7%   66.7%   66.7%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     83.5%   83.5%   83.5%   83.5%   83.5%  550     459     91
    Italian     96.2%   93.1%   90.2%   88.4%   98.3%  404     357     6
    English     89.4%   80.9%   73.8%   69.7%   96.2%  109     76      3
    Spanish     25.0%   33.3%   50.0%   75.0%   21.4%  8       6       22
     German     34.9%   46.2%   68.2%  100.0%   30.0%  6       6       14
     French     14.4%   20.7%   36.6%   75.0%   12.0%  4       3       22
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese     13.3%   19.0%   33.3%   66.7%   11.1%  3       2       16
   Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech     17.2%   25.0%   45.5%  100.0%   14.3%  1       1       6
     Polish     38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2
               f0.5    f1      f2      recall  prec    total   hits    misses

Spanish, German, French, and Portuguese all do very poorly, with too many false positives. Czech and Polish aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Greek, Japanese, and Korean are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them. I did not include Bengali because it hasn't been well tested as this point.

The final language set is Italian, English, Russian, Arabic, Chinese, Japanese, Greek, and Korean. As above, 3K is not the optimal model size, but it is within 0.2%. The 3K results are shown below along with the best performing model sizes:

               3000    3500    4000    4500    10000
      TOTAL    92.2%   92.4%   92.2%   91.8%   92.2%
    Italian    96.7%   96.9%   96.6%   96.4%   96.6%
    English    87.3%   87.8%   87.8%   86.8%   87.7%
    Spanish    0.0%    0.0%    0.0%    0.0%    0.0%
     German    0.0%    0.0%    0.0%    0.0%    0.0%
     French    0.0%    0.0%    0.0%    0.0%    0.0%
      Latin    0.0%    0.0%    0.0%    0.0%    0.0%
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%
 Portuguese    0.0%    0.0%    0.0%    0.0%    0.0%
   Romanian    0.0%    0.0%    0.0%    0.0%    0.0%
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%
      Czech    0.0%    0.0%    0.0%    0.0%    0.0%
     Polish    0.0%    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are very small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     92.2%   92.2%   92.2%   92.2%   92.2%  550     507     43
    Italian     95.4%   96.7%   98.1%   99.0%   94.6%  404     400     23
    English     84.9%   87.3%   89.9%   91.7%   83.3%  109     100     20
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Italian and English, but overall performance improved. Queries in unrepresented languages were all identified as either Italian or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

itwiki: Best Options

The barely sub-optimal settings (though consistent with others using 3K models) for itwiki, based on these experiments, would be to use models for Italian, English, Russian, Arabic, Chinese, Japanese, Greek, Korean (it, en, ru, ar, zh, ja, el, ko), using the default 3000-ngram models.

German Results

About 6% of the original 10K corpus was removed in the initial filtering. A 1400-query random sample was taken, and ~63% of those queries were discarded, leaving a 520-query corpus. Thus only about 35% of low-performing queries are in an identifiable language.

Other languages searched on dewiki

Based on the sample of 520 poor-performing queries on dewiki that are in some language, about 70% are in German, about 25% are in English, and fewer than 2% each are in a handful of other languages.

Below are the results for dewiki, with raw counts, percentage, and 95% margin of error.

count	lg	%	+/-
360	de	69.23%	3.97%
123	en	23.65%	3.65%
8	la	1.54%	1.06%
8	it	1.54%	1.06%
8	es	1.54%	1.06%
5	fr	0.96%	0.84%
2	zh	0.38%	0.53%
2	pl	0.38%	0.53%
1	vi	0.19%	0.38%
1	tr	0.19%	0.38%
1	sv	0.19%	0.38%
1	nl	0.19%	0.38%

In order, those are German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Turkish, Swedish, Dutch.

We don’t have query-trained language models for all of the languages represented here, in particular Latin. Since it represents a very small slice of our corpus (8 queries), we aren’t going to worry about it, and accept that it will not be detected correctly.

Looking at the larger corpus of 9,439 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese queries, and a one Odia query (for which we do not have a model).

Analysis and Optimization

Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

               2500    3000    3500    4000    6000    7000    8000    9000    10000
      TOTAL    74.6%   74.4%   74.6%   75.3%   75.7%   76.3%   76.9%   77.0%   77.6%
     German    88.9%   88.9%   89.2%   89.5%   90.0%   90.5%   90.9%   90.9%   91.2%
    English    73.8%   74.5%   74.1%   75.7%   74.9%   74.0%   74.0%   73.8%   74.1%
    Italian    25.0%   40.0%   40.0%   38.5%   37.0%   42.9%   46.2%   48.0%   51.9%
      Latin    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
    Spanish    50.0%   48.3%   50.0%   46.2%   43.5%   45.5%   47.6%   47.6%   50.0%
     French    34.8%   36.4%   34.8%   33.3%   32.0%   32.0%   33.3%   32.0%   32.0%
    Chinese    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%
      Dutch    6.9%    5.9%    6.2%    6.2%    7.1%    7.4%    7.1%    6.9%    7.1%
     Polish    25.0%   25.0%   22.2%   25.0%   28.6%   28.6%   22.2%   22.2%   25.0%
    Swedish    5.0%    0.0%    0.0%    0.0%    0.0%    0.0%    5.7%    5.9%    5.9%
    Turkish    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    22.2%   22.2%
 Vietnamese    40.0%   50.0%   50.0%   50.0%   66.7%   100.0%  100.0%  100.0%  100.0%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     74.5%   74.4%   74.3%   74.2%   74.5%  520     386     132
     German     94.5%   88.9%   83.9%   80.8%   98.6%  360     291     4
    English     85.6%   74.5%   66.0%   61.3%   95.0%  124     76      4
    Italian     32.9%   40.0%   51.0%   62.5%   29.4%  8       5       12
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish     38.0%   48.3%   66.0%   87.5%   33.3%  8       7       14
     French     27.4%   36.4%   54.1%   80.0%   23.5%  5       4       13
    Chinese      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Dutch      3.8%    5.9%   13.5%  100.0%    3.0%  1       1       32
     Polish     17.2%   25.0%   45.5%  100.0%   14.3%  1       1       6
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       38
    Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       7
 Vietnamese     38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2
               f0.5    f1      f2      recall  prec    total   hits    misses

Dutch and Swedish do very poorly, with too many false positives. Italian, Spanish, French, Polish, Turkish, and Vietnamese aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese. As above, 3K is not the optimal model size, but it is within half a percent. The 3K results are shown below along with the best performing model sizes:

               3000   4000    4500    5000    9000    10000
      TOTAL    88.2%  88.7%   88.7%   88.7%   88.9%   88.7%
     German    94.8%  95.3%   95.2%   95.4%   95.5%   95.2%
    English    81.9%  82.8%   83.2%   82.7%   83.3%   83.2%
    Italian    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
      Latin    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
    Spanish    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
     French    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
    Chinese    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
      Dutch    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
     Polish    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
    Swedish    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
    Turkish    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%
 Vietnamese    0.0%   0.0%    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are very small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     88.2%   88.2%   88.1%   88.1%   88.2%  520     458     61
     German     93.9%   94.8%   95.8%   96.4%   93.3%  360     347     25
    English     77.9%   81.9%   86.3%   89.5%   75.5%  124     111     36
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for German and English, but overall performance improved. Queries in unrepresented languages were all identified as either German or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

Observations

Interestingly, even though Chinese was enabled, neither of the two queries in Chinese were tagged as such. Generally, Chinese is a relatively high-accuracy identifier. In this case, the queries include a random string of numbers and Latin letters and one includes ".html". They also include a number of less common Chinese characters. As a result, the less common Chinese characters get the same score from the Chinese and German language detectors (the maximum penalty for an "unknown" character), and the individual letters score well in German, which the known Chinese characters score less well in Chinese. The Chinese model includes not only individual characters, but also bigrams and larger n-grams, so there aren't even 3,000 singleton Chinese characters in the model.

The Chinese model did a better job on the other Chinese examples in the larger un-tagged dewiki sample.

dewiki: Best Options

The barely sub-optimal settings (though consistent with others using 3K models) for dewiki, based on these experiments, would be to use models for German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese (de, en, zh, el, ru, ar, hi, th, ko, ja), using the default 3000-ngram models.

Next Up

English re-do, Russian, Japanese, Portuguese, and Indonesian (See T138315)
Others, if we continue. (See T121541)

Other thoughts

It would be easy to build high-accuracy identifiers for languages that have unique character sets—or at least character sets that are effectively unique in practice. For example, Yiddish can be written with the Hebrew alphabet, but on most wikis, we'd expect most Hebrew characters to actually be Hebrew (and identifying Yiddish as Hebrew is better than identifying it as anything else other than Yiddish). Similarly, Persian and Arabic writing share many characters, but on frwiki, for example, we only see Arabic. Cherokee is rare, but examples have shown up in our samples. Korean, Armenian, Hebrew, Greek, Georgian, Thai, and others could be used on most wikis because they are low risk, if the run-time cost of enabling them is not too high.