User:TJones (WMF)/Notes/Favoring Recall in Language Identification

May 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T134431)

TL; DR
I prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.

Introduction
Unfortunately language detection generally becomes more difficult as strings become shorter—ambiguity increases (see “a” in Wiktionary for an extreme example) and language-level statistics on letters and letter combinations are less reliable because of the small sample size available in only a few words. Thus deciding what to do about the tradeoff between recall and precision has a sizable effect on the language identification results when working with queries.

In my various analyses related to Language Identification I have been favoring precision (F0.5) over recall (F2—note that F1 is balanced between the two). My thought was that it is better to give fewer correct answers than to make lots of “silly” mistakes. However, the contrary philosophy, favoring recall, is also reasonable: i.e., provide enough answers and there’s a better chance you’ll provide something that’s useful.

In the context of the performance of language detection on the annotated query sets I’ve been using, overall recall and precision are tightly coupled when only one language is allowed per query. This is because any false positive for one language is a false negative for another language. Recall and precision for individual languages can vary wildly (and they do!), but overall recall and precision are the same, except for occasional rounding errors and very rare cases where no language is detected (resulting in a penalty to recall but not precision). Similarly, F0.5, F1, and F2 are approximately the same at the overall level because a weighted average of two nearly identical values still has to be between the two values.

Approaches
I’m investigating two main approaches for improving recall. The first, suggested by David, is to ignore the language of the wiki. The second is to provide more than one language as a result for the language detected.

Ignoring the Language of the Wiki
Even though we are generally looking at doing language detection only on poor-performing queries (e.g., those that get fewer than three results), most of the queries that are in a language are in the language of the wiki we are looking at. That is, for example, most of the poor-performing queries on French Wikipedia are still in French, and it’s the same for the other wikis I’ve looked at.

In order to improve precision, I’ve always included the language of the wiki I’m looking at among the language options for a given wiki. David pointed out that we obviously aren’t going to get many results (fewer than three in fact) in the language of the wiki we are on, so we might as well ignore the language of the wiki we are on and look for results elsewhere—that is, worry less about precision (i.e., about avoiding “silly” results), and focus more on recall (i.e., offering some sort of result, because no result is not very helpful).

Of course, if about 70% of poor-performing queries on French Wikipedia are in French, then most of our language detection guesses will be wrong if French is not among the possible answers. However, we have a better chance of finding something relevant on another wiki if we are looking.

Returning Multiple Languages
Another approach to improve recall is to give more answers. At a ridiculous extreme, we could return every known language as an answer for every query. The right answer would be in there, but it wouldn’t be very helpful. However, returning two or three languages (i.e., including the language detector’s second or third choice option if provided) is more manageable.

TextCat, for example, returns the best-matching language, plus any alternatives that are, by default, within 5% of the best match. Returning two results doubles the chances that at least one of them is right.

Returning multiple languages is terrible for precision, of course. If we give two languages as results for every query, then many more may be “silly”, and at least half of them will be wrong—and so precision would max out at 50%. In practice, not every query will get multiple results (despite the difficulties overall, some queries are actually pretty easy to get right), so precision higher than 50% is possible.

Another option is to tune individual languages based on position in the returned list. On the German Wikipedia data, for example, detecting Polish may be 85% accurate when Polish is the first option, but only 3% accurate when it is the second option, in which case it makes sense to only accept Polish as an answer when it is the first option returned. English, on the other hand, may be 85% accurate in the first position, but still 80% accurate in the third position, in which case it makes sense to consider English results even in the third position. This is of course more complex to optimize for and to implement than just returning a fixed number of results.

Note that with multiple lang-ID results per query, recall and precision are no longer tightly coupled. It’s possible for any given query to get no answer (false negative), one right answer (true positive), or one wrong answer (false positive and false negative), or two wrong answers (two false positives and one false negative), or one right and one wrong (true positive and false positive).

Combining Both Approaches
Of course, it is possible to combine both approaches—ignoring the language of the wiki and returning multiple results.

The combination also allows yet another permutation: include but ignore the language of the wiki while allowing multiple results.

For example, on the German Wikipedia, as in the example above, we might have decided that Polish is only allowed as the first result, while English is allowed as the first or second result.

If we include German among the the languages being detected, we have two possibilities when both German and Polish are detected: either Polish is first, or German is first. In this case, we might say that if the query looks more Polish than German, we will treat it as Polish. But if it looks more German than Polish, we will ignore the Polish result, even though we are ignoring all German results, too. On the other hand, we’ll consider English whether it comes first or second to German.

Another possible outcome is that the matching on German is so good that no other languages are not considered reasonable alternatives (i.e., score within 5% of German), resulting in no language detection for a given query.

Tangent on User Interface
I think the way the language detection information is used and presented affects and is affected by whether we favor recall or precision. If we have high-precision results and only one language detected, it may make more sense to just provide the cross-wiki results right on the wiki page.

On the other hand, if we are maximizing recall and providing, say, up to three language detection results per query, it might make more sense to only provide a link that says something like, “Would you like to see results on Spanish Wikipedia, French Wikipedia, or Portuguese Wikipedia?”

“Silly” language detection results may be more tolerable to users when they only result in an extra link or two on the results page for poor-performing queries.

Some users may find “silly” results confusing in all cases, and some may never mind them. We’ll definitely need to consult the user community and try out various options as A/B tests before coming to a final decision on how best to show results.

Overview of Options
In the analysis that follows, I’m going to consider 7 (!) options for each wiki. It may turn out that there isn’t much difference among them):

1) Ignore Home Language: Ignore the language of the wiki we are on when doing language detection to increase the chances of finding something on another wiki. We will only allow one result per query. We will note but not consider as errors misidentification on queries in the language of the wiki. (e.g., on German Wikipedia, a query in German identified as English isn’t counted as an error, though we will note how often it happens.)

2) Allow Multiple Lang-ID Results: Even though with multiple results, all but one of them must be wrong, the chances of getting that one correct result increase when we allow more results. Precision will take a hit, but we’ll pay more attention to recall using F2. This includes the language of the wiki.

3) Allow Multiple Lang-ID Results, with per-Language Thresholds: Allow multiple lang-ID results per query, but limit whether languages are considered based on their position within the results. (e.g., Polish counts if it is the best result, but not the second best, while English counts in either place.) This includes the language of the wiki.

4) Allow Multiple Lang-ID Results, Ignoring Home Language: Allow multiple lang-ID results per query, but do not consider the language of the wiki during detection. We will note but not consider as errors misidentification on queries in the language of the wiki.

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds: Allow multiple lang-ID results per query, but do not consider the language of the wiki during detection, and limit whether languages are considered based on their position within the results. We will note but not consider as errors misidentification on queries in the language of the wiki.

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language: Allow multiple lang-ID results per query, including the language of the wiki—however, ignore results in the language of the wiki for the purposes of calculating recall and precision, which may result in “no result” or other languages being pushed down a postion in the ranked results. We will note but not consider as errors misidentification on queries in the language of the wiki.

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds: Allow multiple lang-ID results per query, including the language of the wiki—however, ignore results in the language of the wiki for the purposes of calculating recall and precision—and limit whether languages are considered based on their position within the results. We will note but not consider as errors misidentification on queries in the language of the wiki.

Apples & Oranges
In the analysis summaries, the results are divided into those that include the “home” language of the wiki, and those that don’t. These can’t necessarily be compared directly to each other. The home language is always the largest category of queries, and removing it takes away the best source of correct language identification, and can drastically change the interactions among the remaining languages, especially if the second most common language is not as dominant over there remainder as the home language is over all the others. Another factor, for these samples, is that the non-home language sample size can be smaller than I’d like (esp. for Spanish, which is ridiculously small!). The samples were taken to reach a target of 500+ annotated queries, but not with any minimum non-home language sample size.

Coverage vs Recall
Another way of looking at the problem is as one of “coverage” rather than recall. In this sense, coverage indicates the number of queries that return some language that can be used for cross-wiki searching, even if it isn’t the correct one. This could be called this a desperate attempt return anything at all. In the options above, (1), (4), and (5) have at least 99% coverage—some result that is not the home wiki language is returned for almost all queries. If coverage rather than recall is important, we can choose the option from among (1), (4), and (5) with the best F2 score.

“Extra” Language Models
For the present analysis I’m not including the “extra” language models that I included in the precision-favoring analysis. These were high-accuracy models for queries that were found in the larger sample for each wiki, but not in the hand-coded sample used for optimization. They could and should be included (after a quick check that they don’t cause unexpected problems) if any recommendations are taken from this analysis.

Results
Compare to precision-favoring results.

French
0) Precision-Favoring Results—F2: 89.1%

1) Ignore Home Language—F2: 84.9%

The best language set is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL   84.7%   84.8%   84.9%   85.0%   84.6%  213     181     33 English   91.0%   88.8%   86.6%   85.2%   92.6%  88      75      6 Arabic  100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese   70.3%   72.0%   73.8%   75.0%   69.2%  12      9       4 German   52.6%   62.5%   76.9%   90.9%   47.6%  11      10      11 Spanish   54.1%   61.5%   71.4%   80.0%   50.0%  10      8       8 Russian  100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese  100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch    0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Corsican    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish   38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4 Armenian  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 f0.5   f1      f2      recall  prec    total   hits    misses

The 468 queries that are actually French are tagged as (i.e., potentially “silly” results): English (221), Spanish (100), German (74), Portuguese (68), Polish (5)

2) Allow Multiple Lang-ID Results—F2: 89.1%

The best language set, with a threshold of 1 language, is French, English, Arabic, Russian, Chinese, Thai, Armenian. (fr en, ar, ru, zh, th, hy)

This is the same result as (0), because the optimal threshold is 1.

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     89.0%   89.1%   89.1%   89.1%   89.0%  681     607     75 French     94.8%   95.1%   95.5%   95.7%   94.5%  468     448     26 English     67.0%   74.9%   84.9%   93.2%   62.6%  88      82      49 Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  12      0       0 German      0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0 Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (20)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 89.7%

The best language set is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, nl, pl, th, hy). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    4        79.3%   84.2%   89.7%   93.8%   76.3%  681     639     198 French    4        94.1%   94.7%   95.3%   95.7%   93.7%  468     448     30 English    3        61.4%   70.1%   81.8%   92.0%   56.6%  88      81      62 Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese    2        37.2%   47.8%   67.1%   91.7%   32.4%  12      11      23 German    1        43.7%   52.9%   67.2%   81.8%   39.1%  11      9       14 Spanish    2        20.0%   28.6%   50.0%  100.0%   16.7%  10      10      50 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese    1        93.8%   85.7%   78.9%   75.0%  100.0%  4       3       0 Dutch    1        18.2%   25.0%   40.0%   66.7%   15.4%  3       2       11 Corsican    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish    1        23.8%   33.3%   55.6%  100.0%   20.0%  2       2       8 Armenian    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (50), Spanish (38), Portuguese (16), Dutch (6), German (6), Polish (4)

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 84.9%

The best language set, with a threshold of 1 language, is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy)

Note that this is the same as (1) above since the threshold was just one language. Allowing 3 languages offered the same F2 score (to one decimal place), with moderately higher recall and a good lower precision (the unbalanced trade-off being the nature of the weighted F2 measure).

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL (213) 1        84.7%   84.8%   84.9%   85.0%   84.6%  213     181     33  2        72.1%   77.7%   84.2%   89.2%   68.8%  213     190     86   3        69.6%   76.5%   84.9%   91.5%   65.7%  213     195     102

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 88.1%

The best language set is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    3        80.2%   84.0%   88.1%   91.1%   77.9%  213     194     55 English    3        84.7%   88.4%   92.5%   95.5%   82.4%  88      84      18 Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese    2        59.8%   68.7%   80.9%   91.7%   55.0%  12      11      9 German    1        52.6%   62.5%   76.9%   90.9%   47.6%  11      10      11 Spanish    2        49.0%   60.6%   79.4%  100.0%   43.5%  10      10      13 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch    1         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Corsican    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish    1        38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4 Armenian    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The 468 queries that are actually French are tagged as (i.e., potentially “silly” results): English (357), Spanish (184), Portuguese (161), German (74), Polish (5)

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 83.7%

The best language set, with a threshold of 2 languages, is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, th, hy)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     70.6%   76.6%   83.7%   89.2%   67.1%  213     190     93 English     86.5%   88.5%   90.6%   92.0%   85.3%  88      81      14 Arabic     98.8%   99.2%   99.7%  100.0%   98.5%  66      66      1 Portuguese     62.5%   71.0%   82.1%   91.7%   57.9%  12      11      8 German     26.1%   36.1%   58.5%  100.0%   22.0%  11      11      39 Spanish     51.0%   62.5%   80.6%  100.0%   45.5%  10      10      12 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       19 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (45), Spanish (38), German (21), Portuguese (18)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 87.8%

The best language set is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, nl, pl, th, hy). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    4        81.3%   84.4%   87.8%   90.1%   79.3%  213     192     50 English    4        86.9%   89.1%   91.5%   93.2%   85.4%  88      82      14 Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese    2        65.5%   73.3%   83.3%   91.7%   61.1%  12      11      7 German    1        57.0%   64.3%   73.8%   81.8%   52.9%  11      9       8 Spanish    2        51.0%   62.5%   80.6%  100.0%   45.5%  10      10      12 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese    1        93.8%   85.7%   78.9%   75.0%  100.0%  4       3       0 Dutch    1        32.3%   40.0%   52.6%   66.7%   28.6%  3       2       5 Corsican    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish    1        38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4 Armenian    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (56), Spanish (38), Portuguese (16), Dutch (6), German (6), Polish (4)

Summary
Configurations that include reporting French (n = 681 samples), by F2:

Configurations that ignore French (n = 213), by F2:

Spanish
0) Precision-Favoring Results—F2: 95.6%

1) Ignore Home Language—F2: 79.5%

The best language set is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     79.5%   79.5%   79.5%   79.5%   79.5%  44      35      9 English     86.1%   89.9%   93.9%   96.9%   83.8%  32      31      6 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese     29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3 f0.5   f1      f2      recall  prec    total   hits    misses

The 476 queries that are actually Spanish are tagged as (i.e., potentially “silly” results): Portuguese (440), English (36)

2) Allow Multiple Lang-ID Results—F2: 96.1%

The best language set, with a threshold of 2 languages, is Spanish, English, Russian, Chinese. (es, en, ru, zh)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     91.3%   93.7%   96.1%   97.9%   89.8%  520     509     58 Spanish     97.8%   98.4%   99.1%   99.6%   97.3%  476     474     13 English     47.1%   58.7%   78.0%  100.0%   41.6%  32      32      45 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (38)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 96.9%

The best language set is Spanish, English, Russian, Chinese. (es, en, ru, zh). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    2        94.8%   95.8%   96.9%   97.7%   94.1%  520     508     32 Spanish    2        97.8%   98.4%   99.1%   99.6%   97.3%  476     474     13 English    1        66.8%   75.6%   87.1%   96.9%   62.0%  32      31      19 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (13).

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 79.5%

The best language set, with a threshold of 1 language, is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)

This is the same as (1).

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 79.5%

The best language set, with a threshold of 1 for every result, is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)

This is also the same as (1).

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    1        79.5%   79.5%   79.5%   79.5%   79.5%  44      35      9 English    1        86.1%   89.9%   93.9%   96.9%   83.8%  32      31      6 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese    1        29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 77.3%

The best language set, with a threshold of 1 language, is Spanish, English, Russian, Chinese, Portuguese. (es, en, ru, zh, pt)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     77.3%   77.3%   77.3%   77.3%   77.3%  44      34      10 English     85.2%   88.2%   91.5%   93.8%   83.3%  32      30      6 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese     38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2 Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): Portuguese (46), English (9)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 80.3%

The best language set is Spanish, English, Russian, Chinese. (es, en, ru, zh). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    2        82.5%   81.4%   80.3%   79.5%   83.3%  44      35      7 English    2        85.1%   90.1%   95.8%  100.0%   82.1%  32      32      7 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Guarani    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (38)

Summary
Configurations that include reporting Spanish (n=520), by F2:

Configurations that ignore Spanish (n=44), by F2:

[Note that this sample is very small, probably too small to draw any strong conclusions from).]

Italian
0) Precision-Favoring Results—F2: 92.2%

1) Ignore Home Language—F2: 79.5%

The best language set is English, Spanish, Russian, Romanian, Portuguese, Arabic, Chinese. (en, es, ru, ro, pt, ar, zh)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     79.5%   79.5%   79.5%   79.5%   79.5%  146     116     30 English     89.1%   90.1%   91.1%   91.7%   88.5%  109     100     13 Spanish     51.5%   60.9%   74.5%   87.5%   46.7%  8       7       8 German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese     21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9 Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (196), English (114), Portuguese (94)

2) Allow Multiple Lang-ID Results—F2: 92.2%

The best language set, with a threshold of 1 language, is Italian, English, Russian, Arabic, Chinese. (it, en, ru, ar, zh)

This is the same result as (0), because the optimal threshold is 1.

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     92.2%   92.2%   92.2%   92.2%   92.2%  550     507     43 Italian     95.4%   96.7%   98.1%   99.0%   94.6%  404     400     23 English     84.9%   87.3%   89.9%   91.7%   83.3%  109     100     20 Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (27)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 92.2%

The best language set is Italian, English, Russian, Arabic, Chinese. (it, en, ru, ar, zh). Thresholds are shown in the table below.

The same F2 score can be had with the thresh for English set to 1, which is then the same as (2) and (0).

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    2        88.6%   90.3%   92.2%   93.5%   87.4%  550     514     74 Italian    1        95.4%   96.7%   98.1%   99.0%   94.6%  404     400     23 English    2        72.2%   80.1%   90.1%   98.2%   67.7%  109     107     51 Spanish    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 German    -         0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Romanian    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (4).

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 80.5%

The best language set, with a threshold of 2 languages, is English, Spanish, Russian, Romanian, Arabic, Chinese. (en, es, ru, ro, ar, zh)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     74.0%   77.1%   80.5%   82.9%   72.0%  146     121     47 English     87.0%   90.6%   94.5%   97.2%   84.8%  109     106     19 Spanish     26.3%   36.4%   58.8%  100.0%   22.2%  8       8       28 German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (364), English (245)

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 82.3%

The best language set is English, Spanish, Russian, Romanian, Portuguese, Arabic, Chinese. (en, es, ru, ro, pt, ar, zh). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    3        78.8%   80.5%   82.3%   83.6%   77.7%  146     122     35 English    3        87.6%   91.0%   94.6%   97.2%   85.5%  109     106     18 Spanish    1        51.5%   60.9%   74.5%   87.5%   46.7%  8       7       8 German    1         0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese    1        21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9 Romanian    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (222), Spanish (196), Portuguese (94)

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 77.1%

The best language set, with a threshold of 4 languages, is Italian, English, Russian, Arabic, Chinese, Spanish, Portuguese. (it, en, ru, ar, zh, es, pt)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     60.4%   67.8%   77.1%   84.9%   56.4%  146     124     96 English     88.8%   91.8%   95.0%   97.2%   86.9%  109     106     16 Spanish     29.4%   40.0%   62.5%  100.0%   25.0%  8       8       24 German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese     14.0%   20.7%   39.5%  100.0%   11.5%  3       3       23 Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       33 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (69), Portuguese (45), English (21)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 82.5%

The best language set is Italian, English, Spanish, Russian, Portuguese, Arabic, Chinese. (it, en, es, ru, pt, ar, zh). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    4        79.6%   81.1%   82.5%   83.6%   78.7%  146     122     33 English    4        88.8%   91.8%   95.0%   97.2%   86.9%  109     106     16 Spanish    1        62.5%   66.7%   71.4%   75.0%   60.0%  8       6       4 German    -         0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Portuguese    2        22.4%   31.6%   53.6%  100.0%   18.8%  3       3       13 Romanian    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Czech    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): Portuguese (32), English (21), Spanish (20)

Summary
Configurations that include reporting Italian (n= 550), by F2:

Configurations that ignore Italian (n=146), by F2:

German
0) Precision-Favoring Results—F2: 88.1%

1) Ignore Home Language—F2: 78.1%

The best language set is English, Italian, Spanish, Chinese. (en, it, es, zh)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    78.1%   78.1%   78.1%   78.1%   78.1%  160     125     35 English    92.1%   91.4%   90.8%   90.3%   92.6%  124     112     9 Italian    39.5%   48.0%   61.2%   75.0%   35.3%  8       6       11 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish    36.5%   46.7%   64.8%   87.5%   31.8%  8       7       15 French     0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (278), Italian (72), Spanish (10)

2) Allow Multiple Lang-ID Results—F2: 88.3%

The best language set, with a threshold of 1 language, is German, English, Chinese. (de, en, zh)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     88.3%   88.3%   88.3%   88.3%   88.3%  520     459     61 German     94.0%   95.0%   96.0%   96.7%   93.3%  360     348     25 English     77.9%   81.9%   86.3%   89.5%   75.5%  124     111     36 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (12)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 89.1%

The best language set is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    4        78.1%   83.2%   89.1%   93.5%   75.0%  520     486     162 German    4        87.5%   91.4%   95.6%   98.6%   85.1%  360     355     62 English    3        69.6%   77.4%   87.1%   95.2%   65.2%  124     118     63 Italian    1        26.0%   33.3%   46.3%   62.5%   22.7%  8       5       17 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish    1        32.4%   42.4%   61.4%   87.5%   28.0%  8       7       18 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese    1         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese    1        38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (49), Italian (6), Spanish (5)

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 79.1%

The best language set, with a threshold of 3 languages, is English, Spanish, Chinese. (en, es, zh)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     71.8%   75.3%   79.1%   81.9%   69.7%  160     131     57 English     89.2%   92.4%   95.9%   98.4%   87.1%  124     122     18 Italian      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish     18.2%   25.9%   44.9%   87.5%   15.2%  8       7       39 French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (356), Spanish (114)

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 83.5%

The best language set is English, Italian, Spanish, Chinese. (en, it, es, zh). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    4        77.8%   80.6%   83.5%   85.6%   76.1%  160     137     43 English    3        89.7%   92.8%   96.1%   98.4%   87.8%  124     122     17 Italian    1        39.5%   48.0%   61.2%   75.0%   35.3%  8       6       11 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish    1        36.5%   46.7%   64.8%   87.5%   31.8%  8       7       15 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese    4       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Dutch    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (345), Italian (72), Spanish (10)

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 74.7%

The best language set, with a threshold of 2 languages, is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi)

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL     58.1%   65.3%   74.7%   82.5%   54.1%  160     132     112 English     91.8%   92.4%   93.1%   93.5%   91.3%  124     116     11 Italian     27.0%   37.2%   59.7%  100.0%   22.9%  8       8       27 Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish     29.2%   38.9%   58.3%   87.5%   25.0%  8       7       21 French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese     29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3 German      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       50 f0.5   f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (45), Italian (14), Spanish (6), Vietnamese (1)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 80.8%

The best language set is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi). Thresholds are shown in the table below.

thresh  f0.5    f1      f2      recall  prec    total   hits    misses TOTAL    3        77.6%   79.2%   80.8%   81.9%   76.6%  160     131     40 English    3        90.5%   92.2%   93.9%   95.2%   89.4%  124     118     14 Italian    1        34.7%   41.7%   52.1%   62.5%   31.2%  8       5       11 Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0 Spanish    1        39.8%   50.0%   67.3%   87.5%   35.0%  8       7       13 French    -         0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Chinese    1         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Turkish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese    1        38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2 thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (49), Italian (6), Spanish (5)

Summary
Configurations that include reporting German (n= 520), by F2:

Configurations that ignore German (n=160), by F2:

Discussion
Among configurations that include reporting the home language (0, 2, 3), we see the expected increase in F2 score over the baseline (0), from allowing multiple languages to be reported (2), or optimizing how far down the list to consider each language independently (3). Sometimes there is no difference between the options, but when there is, the order is always the same, and the improvement is minor (< 1.5%).

Option (2) usually results in either maintaining the threshold of one language, or increasing it a bit to 2 or 3.

Option (3) usually results in increasing the threshold for more strongly represented languages (the home language and the second most common), while keeping the others at 1. Thus, for the less frequent languages, it’s best to only accept that answer if it’s the best guess, while for the more common languages, a less confident guess is still a good one.

Among configurations that ignore the home language (1, 4, 5, 6, 7), the two that allow for per-language thresholds (5 and 7) are consistently the best, which makes sense as they can be more finely tuned (or perhaps overfitted!). Among the others, (6) is consistently the worst. There’s a consistent partial ordering, in that (5) and (7) are the same or better than (4), which is the same or better than (1), which is better than (6)

The overall span in F2 scores for this group isn’t huge, but is sometimes moderate (3-9%). F2 scores for these configs are consistently worse than those including the home language (0, 2, 3, above), but they are apples and oranges (see Notes above).

Conclusions
For raw F2 score, including the home language gives the highest score, but can’t be directly compared to scores ignoring the home language. Allowing for multiple results is the best way to increase overall F2 score. Per-language threshold tweaking is often slightly better, but may not be worth the complexity.

In terms of coverage (see Notes above), we get nearly full coverage (some non-home language alternative is offered for every query) with options (1), (4), and (5). Option (4), allowing multiple language results and ignoring the home language, is the best middle ground. It gives more accurate results than (1), while being less complex than (5).

I prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.