User:TJones (WMF)/Notes/TextCat Improvements

November 2016 – January 2017 — See TJones_(WMF)/Notes for other projects.

Intro
Related to work on TextCat and Confidence, I'm trying several approaches to improving TextCat performance. Phab parent task: T140289

I expect to write up basic results on new features as I add them, but I may come back to them as I add more features that allow them to interact. As an example, adding the ability to use language models from multiple directories can give immediate improvements when there are distinctive languages to be considered (e.g., Burmese, which has it's own writing system) but not in others (e.g., Catalan, which is too similar to Spanish to include on eswiki without also making improvements that give a boost to the host language).

Inspired by Mikhail's always excellent linking in his reports and the recent Why We Read Wikipedia presentation—which shows that common motivations for reading Wikipedia include boredom & randomness and intrinsic learning, I've sometimes included more than the usual number of links in the sections below, for the reader's amusement and edification.

Optimization Framework
(November 2016 — Phab task: T149314)

I've slapped together a wrapper around my existing language identification tool (i.e., the Perl version of TextCat) and my identification results evaluation tool; it can perform a grid search optimization (of F0.5) over multiple dimensions of hyper-parameters, which can be numeric or nominal.

Since any problem in computer science can be solved by introducing another level of indirection, I expanded the wrapper to include the ability to run itself over several different corpora, note the optimal config and F0.5 score for each corpus, and then choose the configuration that gives the best collective results—as measured by the square "error" from the individual optimal configurations for each corpus, along with a hard constraint that no corpus decrease from its baseline performance (i.e, on the current config) by more than a fixed amount (0.5% F0.5 for now). Evaluation can be done on all varying parameters at once, or on a "relevant" subset, each combination of which is represented by the optimal configuration per corpus, found by allowing all other parameters to vary.

Using the multi-corpora optimization allows us to find a generic set of hyper-parameters that apply across all corpora/wikis, which discourages overfitting to any particular corpus.

It's spiffy.

Scoring
Three metrics I use, which tend to line up are: (1) "Cumulative Improvement", which is the total improvement across all corpora; (2) "Square Error (vs Optimal)" which squares the difference between the best performance for this corpus with the relevant parameters and the best performance of this corpus in the entire test run; and, (3) a general preference for no corpus to do worse than the baseline (e.g., delta >= 0 for all corpora), with a hard limit on not dropping by more than 0.5% from the baseline.

I have a fourth, experimental score, called "X Score", which changes without notice, but currently computes the diff between (a) the square of the diff between current baseline and 100% and (b) the square of the diff between the current best and 100%, for each corpus, and sums them. It favors moving significantly closer to 100% than the current baseline.

I generally rank results by square error because it penalizes big drops away from the best possible score for a corpus. When these metrics disagree strongly, life is "interesting".

Framework Update, Jan 2017
Grid search works well, but eventually becomes unwieldy as the number of dimensions and number of possible values in each dimension increases.

By the time I got to Bucketing and Bonuses (T149322), the overall improvements were good enough to consider re-introduce previously excluded languages, e.g French for enwiki. Decent accuracy improves recall in a way that nothing else can (if we don't consider French, we can't ever get the ones that actually are French).

However, for n languages, there are 2n possible subsets of languages to consider. Some are silly (e.g., having all languages, dropping the host language), but way too may are plausible, especially among the long tail of uncommonly seen languages, which can add a little to recall with a few true positives, or hurt recall and precision with too many false positives.

I did some experiments with generic hill climbing and coordinate descent using a very ugly (not at all smooth) function that was relatively cheap to evaluate (unlike language identification). Coordinate descent gave better results and was much faster (because it required few iterations to reach its optimum).

Briefly, coordinate descent starts in a random point in the param space, and then optimizes along one dimension at a time, in random order, moving to the new optimum point before continuing in another dimension. All dimensions are iterated over some number of times, and it no improvement occurs after a full iteration, the next iteration does a random restart.

The process is not deterministic, and often can't take advantage of caching (because its coverage of the param space is  very sparse) or parallelization (a technical limitation). I tested it on parameter spaces that had already been exhaustively explored via grid search, and the optimum wasn't always found, but the result was usually close. A given run seems to max out within 3 to 5 iterations. A heuristic I've used is that if the best results shows up in 3 out of 10 runs, it is probably the best. Sup-optimal results are still close, which hints at how hard it is worth trying.

I added a new parameter specification, for a "shrinking list". For example: (en,zh,es,ar) translates to the explicit list "en,zh,es,ar", "en,zh,es", "en,zh", "en". I used it for the bonus bucket specification.

I added another new parameter specification, for a "hill climber", which is really more of a directed coordinate descent within a set. Each item in the set is removed in turn and the performance evaluated; if it's better, it is left out, otherwise it is put back. The forward/backward process is iterated some specified number of times each time the dimension is evaluate (either in grid search or coordinate descent). Only one "hill climber" can be used at a time (because the code needs to be refactored, alas), but it can be used with grid search on all other dimensions, or with coordinate descent.

I also added an "easy grid search" that explores all point in the space, but only evaluates the ones that are cached. This makes it possible to review all previously computed points in a parameter space (e.g., by running coordinate descent a few times) to find the current global optimum.

Coordinate descent gives a phenomenal speed up. One scenario with the enwiki corpus had > 2.4e13 possible parameter combinations. Using coordinate descent, it was optimized in < 1000 steps, many of which were even faster because they were cached because they were duplicates.

Meta-optimization across corpora cannot use coordinate descent (again, the code needs to be refactored, alas), so I've been manually jumping from individual corpus optimization (e.g., for language set) and meta-optimization on smaller parameter sub-spaces.

Baseline Performance
Below is the F0.5 baseline performance of the nine wiki corpora I'm working with, based on the intended production settings. (It turns out that production isn't configured properly to take ambiguity detection—i.e., maximum returned languages and results ratio—into account, but T153105 addresses that.)

Initial Findings
(November 2016 — Phab task: T149316)

I reviewed the previous language model optimization for TextCat or English, French, Italian, Spanish, German, Portuguese, Russian, Japanese, and Dutch Wikipedias, looking at languages for which there were Query-based models available, but for which there are Wikitext-based models available. The same issues are at work in these cases: very distinctive languages are easy (e.g., those with a unique writing system), and closely related languages are hard (e.g., Catalan and Spanish, Afrikaans and Dutch). Another issue is the set of languages and number if instances of each language represented in the sample. In some cases there are many languages that are not covered, in others, very few; in all cases, there weren't very many instances of these previously not covered languages.

Summary of Results
We can get a tenth of a percentage point or two increase in F0.5 for some wikis by adding previously not covered languages. For many wikis, there weren't many languages to add, and when there were, often they were too similar to others to be of help. There are a number of languages that we could add easier-to-build Wiki-text models for to expand coverage for certain wikis and earn a few more tenths of F0.5.

Details

 * English: I re-optimized form the full set of relevant languages, and arrived at a similar set of optimal languages. Adding Slovak (sk) improved recall for Slovak, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am) could make improvements at the margins.
 * French: Adding Hungarian gets the one Hungarian example in the sample, increasing F0.5 by 0.1%. Other languages (Breton, Icelandic, Latin) have too many false positives.
 * German: Latin was the only language not already accounted for, and adding it didn't help.
 * Spanish: Catalan was the only language not accounted for, and adding it didn't help.
 * Italian: Latin and Romanian were the only languages not accounted for, and adding them didn't help.
 * Portuguese: Tagalog and Latin were the only languages not accounted for, and adding them didn't help.
 * Russian: Adding Finnish (fi) improved recall for Finnish, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani (az) or other languages could make improvements at the margins.
 * Japanese: Kazakh was the only language not accounted for, and no model is available for it.
 * Dutch: Adding Burmese gets the one Burmese example in the sample, increasing F0.5 by 0.2%. Adding Finnish and Croatian (fi, hr) improved recall for Finnish and Croatian, but it was offset by other errors.

Current Recommendations

 * I can add Burmese, Oriya, Malayalam, and Kannada to the list of "easy" language models to be added when present (along with Greek, Korean, Hebrew, Japanese, Thai, Telugu, Georgian, and when there are no examples of "competing" languages using the same script, Arabic, Russian, and Hindi).
 * We could include the models for these "easy" languages with very distinctive writing systems in general (and not only when they are present in a largish sample) because identifying them is "easy" even if they are very low frequency. The main concern is the computational cost of including more models, though I believe that the current implementation of the PHP version of TextCat in production loads all available models anyway.
 * There are a couple of languages present in the samples above that I should build Wiki-text models for and re-assess: Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am), particularly Khmer and Amharic since they have distinctive writing systems.
 * I expect that as other features in this list will interact with this feature, and as precision generally improves, it will be possible to include additional models, including Wiki-text-based models.

Maximum Returned Languages and Results Ratio + Model Size
(November 2016 — Phab task: T149321)

Background
The current settings in production for TextCat for the maximum returned languages and results ratio are inherited from the original TextCat implementation. We were getting better results from TextCat than other language identifiers using those defaults, so we didn't originally mess with them.

However, the original TextCat implementation was built primarily to work with longer texts and smaller n-gram models, so it makes sense that there is some improvement to be had here.

For reference: I started out intending to look at variations in maximum returned languages and results ratio, but remembered that in my previous optimizations, which were based on languages to be considered, model size also turned out to be interesting, so I decided to add it into the mix here.
 * The results ratio by default it is set at 1.05—which means that any language model that has a cost less than 1.05 times the lowest cost (i.e., within 5% of the lowest cost) is reported as an alternative. So, TextCat can and will report that a particular string looks like it is Spanish, but the second best guess is Portuguese, the third best guess is maybe Italian.
 * The maximum returned languages by default is set to 5, which is the maximum number of languages that can be returned by TextCat. If more than maximum returned languages languages are within the of the results ratio, then TextCat can’t make up its mind, and returns “unknown” as the detected language.

The model size is the number of 1- to 5-grams from the model training corpus that are retained, in frequency order. So a model size of 1000 means that the 1000 most frequent n-grams for each language are compared to the 1000 most frequent n-grams from the text to be identified, and the best match wins (modulo the settings for results ratio and maximum returned languages).

The original TextCat implementation used a model size of 400 n-grams; it was intended to be used on longer texts—where actual statistical tendencies of a language can emerge—at a time when Moore's Law had not been chugging along for an additional couple of decades. We've been using a model size of 3,000 n-grams. The PHP implementation in production and my updated Perl implementation have 5,000 n-grams available, though I have kept my own 10,000 n-gram versions of the query-based models around for development and testing.

Data
I have the hand-tagged corpora of 500+ poorly-performing queries from nine Wikipedias that I had previously used. The nine codes/languages are: de/German, en/English, es/Spanish, fr/French, it/Italian, ja/Japanese, nl/Dutch, pt/Portuguese, and ru/Russian.

Initial Config
I set up the optimizer to consider maximum returned languages thresholds from 1 to 10 (current default is 5), and results ratio values from 1.00 to 1.10 (in increments of 0.01; current default is 1.05). I also set up model size to vary from 1K to 10K (in increments of 1K; current default is 3K).

For each corpus, I kept the current list of languages to be considered, as previously optimized. The languages to consider are a subset of available/relevant languages, which makes for a more difficult optimization problem. There are 2n possible subsets, which is not as amenable to an exhaustive grid search. I have some ideas for simple hill-climbing that should be O(n) rather than O(2n) in the number of languages, but for now they are held constant per corpus.

I did mess around briefly with the types models used—query-based vs Wiki-text-based—and the benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% of F0.5.

Maximum Returned Languages
In all my tests, maximum returned languages (MRL) always heavily optimized towards 1—i.e, if there's any ambiguity about what language (based on the results ratio), give up!

Below are the stats for MRL while allowing results ratio (RR) and model size to vary. So for MRL == 1, the value reported is with the best possible values for RR and model size, which may be different for MRL == 2. The Square Error ("SqErr" below) is the sum of the squares of the differences between the best F0.5 score for each corpus at that MRL value, and the optimal F0.5 value for each corpus across all config settings. Since it is 0, all of the best F0.5 values for every corpus has MRL == 1. MRL      1       2       3       4       5       6       7       8       9      10 SqErr 0.00   61.01   62.06   66.70   75.41   77.65   77.65   77.65   78.60   78.60 So, from here on out, I'm setting MRL to 1 for all other tests, to reduce the dimensionality for analysis.

Proportional Limits
I originally had an idea to explore proportional settings for maximum returned languages (since some wikis, particularly enwiki, have many more languages being considered than others). I ran a few quick tests on that before abandoning the idea entirely, as the generally optimal MRL always tends toward 1, so it doesn't matter.

Since we don't need proportional limits, no code updates are needed to TextCat itself.

Up to 10K models
Holding the MRL at 1 and allowing the results ratio and model size to vary, we get the following plot of the square error. The best (lowest) value for each model size is in bold. RR \ Model   1000    2000    3000    4000    5000    6000    7000    8000    9000   10000 1.00       279.77  172.26  104.28  110.21  116.72  118.48  111.17   99.92  100.41  101.28    1.01        132.81   82.36   54.24   52.27   58.81   66.02   59.59   58.10   54.38   46.67    1.02         87.87   54.67   30.72   29.82   33.51   32.45   31.99   36.02   35.46   30.32    1.03         81.87   40.64   30.24   20.20   16.89   17.04   17.40   16.99   14.85   16.80    1.04        117.74   39.40   26.86   14.41   13.28    9.59   11.02    9.04    7.44    6.69    1.05        201.07   56.32   33.13   18.50   10.05    7.40    4.29    4.16    4.98    4.20    1.06        366.81   86.91   42.82   24.62   11.36    8.14    3.90    2.26    1.90    2.27    1.07        624.00  155.12   60.81   30.73   15.26    9.10    8.39    3.60    2.94    3.01    1.08        976.09  281.64  109.30   55.88   35.49   18.25    9.69    7.95    6.81    5.04    1.09       1535.30  414.63  172.67   79.29   47.09   29.61   16.22   13.21    8.44    5.61    1.1        2420.88  604.92  247.52  128.37   76.51   49.24   28.96   22.68   14.03   10.65 The optimal results ratio by model size increases as the model size increases. My guess is that this is because as the model size increases, the penalty for unknown n-grams increases (it is the model size). This pushes up the scores for poorly-matched query/model pairs (i.e., when the query has many unknown n-grams compared to a given model), allowing for a bigger results ratio window. But I'm not sure.

The overall optimal value is for 9K models with a results ratio of 1.06. The score report for that setting is below. best   baseln  delta   optim   corpus 91.1%  88.2%     2.9%  91.5%   dewiki 86.5%  83.0%     3.5%  87.0%   enwiki 97.0%  95.6%     1.4%  97.3%   eswiki 92.8%  89.0%     3.8%  93.3%   frwiki 95.0%  92.2%     2.8%  95.4%   itwiki 96.1%  95.1%     1.0%  96.8%   jawiki 88.5%  82.3%     6.2%  88.5%   nlwiki 97.6%  96.9%     0.7%  97.7%   ptwiki 93.1%  92.4%     0.7%  93.8%   ruwiki Square Error (Best vs Optimal): 1.90 Cumulative Improvement (Best vs Baseline): 23.0% Avg Improvement:  2.6% Max Improvement:  6.2% Min Improvement:  0.7%
 * best is the best F0.5 score for that language within this bucket (in this case, only one config is in the bucket).
 * baseln is the baseline F0.5 score, i.e., the F0.5 score obtained in the original optimization, based generally on language selection.
 * delta is the increase from baseln to best.
 * optim is the best F0.5 obtained in this optimization, including all possible configurations.
 * Square Error is as above, and is the basis for selecting the best model.
 * Improvement is the sum of all deltas, with mean, max, and min also shown.

3K models
We are currently using 3K models. If we limit ourselves to only 3K models then we get the following square error values, but results ratio: Model Size \ RR   1.00    1.01    1.02    1.03    1.04    1.05    1.06    1.07    1.08    1.09    1.10 3000        39.17   10.87    2.55    2.25    3.55    5.08   12.35   22.68   53.07   98.56  150.73 Interestingly, the best value is 1.03, rather than 1.04 for 3K models above. This is because the "optimal" F0.5 values among all the configs we are considering are different, so the square error is lower, too, as shown in the score report below.

best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.7%  83.0%     1.7%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.8%  89.0%     2.8%  92.1%   frwiki 93.9%  92.2%     1.7%  94.2%   itwiki 96.4%  95.1%     1.3%  96.8%   jawiki 84.6%  82.3%     2.3%  85.1%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 2.25 Cumulative Improvement (Best vs Baseline): 13.0% Avg Improvement:  1.4% Max Improvement:  2.8% Min Improvement: -0.3%

Notice that Russian takes a small hit of 0.3% worsening in F0.5 performance, which is within the allowed range for a decrease for any one corpus (up to 0.5%)

5K Models
The current implementation already has the information needed to run up to 5K models, so I optimized with that constraint as well. (The 1K models are so horrible that I have ommitted them.) Model Size \ RR   2000    3000    4000    5000 1.00   128.90   70.18   75.73   81.84            1.01     51.44   29.42   29.25   34.65            1.02     30.33   13.22   12.36   17.53            1.03     20.80   13.36    7.08    7.73            1.04     19.70   12.30    4.63    4.56            1.05     32.80   15.91    7.40    2.93            1.06     60.33   24.14   10.20    3.14            1.07    117.64   39.15   16.91    6.60            1.08    227.76   80.02   36.38   20.61            1.09    347.43  136.09   54.47   28.93            1.1     525.12  200.90   98.23   53.23 The 5K models are best, with 1.05 as the optimal results ratio.

best   baseln  delta   optim   corpus 91.1%  88.2%     2.9%  91.4%   dewiki 85.7%  83.0%     2.7%  86.8%   enwiki 96.6%  95.6%     1.0%  97.2%   eswiki 92.2%  89.0%     3.2%  92.3%   frwiki 94.4%  92.2%     2.2%  94.9%   itwiki 96.4%  95.1%     1.3%  96.8%   jawiki 86.6%  82.3%     4.3%  86.6%   nlwiki 97.5%  96.9%     0.6%  97.7%   ptwiki 92.5%  92.4%     0.1%  93.4%   ruwiki Square Error (Best vs Optimal): 2.93 Cumulative Improvement (Best vs Baseline): 18.3% Avg Improvement:  2.0% Max Improvement:  4.3% Min Improvement:  0.1%

Unknown n-gram Penalty
At one point I accidentally ran with configuration model sizes from 1K to 10K, but only using the 5K models. The result was that the only difference between 5K and larger model "sizes" was the unknown n-gram penalty. The optimal "model size" came out to be 6K, meaning that a 1K/20% extra penalty improved performance

Summary & Current Recommendations

 * The maximum returned languages should be set to 1—we should tolerate no ambiguity!


 * Bigger models will give a noticeable improvement in F0.5 score.
 * Assuming there's no serious performance impact, upgrading to 9K models could give the best results. I'm not sure that I have original data for the wiki-text-based models to be able to generate models larger than 5K for them, so that could be a wrinkle.
 * With the current potential for 5K models, we can get much of the improvement from the 9K models (2.0% mean vs 2.6%)
 * With the current 3K models, we can still get a 1.4% mean improvement.


 * The results ratio should be chosen based on the model size selected.


 * The benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% F0.5.


 * Exploring an additional n-gram penalty may yield further improvements.
 * Since we don't get any benefit proportional limits on the maximum returned languages, no code updates are needed to TextCat itself.


 * We still need to see how these features interact with other potential features being considered.

Minimum Input Length
(November 2016 — Phab task: T149318)

Background
Very short strings, especially in the Latin alphabet, are hard to identify—partly because there isn't much to work with, and partly because of genuine ambiguity. As an extreme example, a is listed under 94 languages in the English Wiktionary.

There is also a particular problem on Wikipedias where language identification is used. Language identification is only run on "poorly performing" queries—i.e., those with fewer than 3 results. Punctuation marks don't really get the "full search experience", and often return only one or two results, making them eligible for language identification. The result is fairly random, depending on the language models available and which has the highest n-gram rank for the punctuation (plus the spaces put on either side of it). Since punctuation marks have entries in most languages, whatever random language is identified as a match will also likely return results.

For example, on English Wikipedia, this results in a semicolon getting result from the Korean Wikipedia, a double quote (") gets results from Hebrew Wikipedia, and a circumflex (^) gets results from Japanese Wikipedia. I don't think most users see these results, because if you use the search box in the upper corner, you go directly to the page with the matching title/redirect.

While limiting language identification based on length seems like a good idea, there are potential pitfalls. There are fairly unambiguous characters, like Japanese or Hebrew. (Japanese katakana could be used to write Okinawan, Ainu, or Palauan, and the Hebrew alphabet could be used to write Yiddish, Judaeo-Spanish, or Judeo-Arabic, but there's still an obvious best guess for individual characters, unlike individual letters in the Latin alphabet.) Some non-alphabetic writing systems, like Japanese and especially Chinese, can pack a lot of information into one character. When the language has short words, like Chinese, one character can actually be a word, and can carry enough information to be a reasonable search.

So, the goal is to balance eliminating the shortest queries, while seeing what effect it has on recall (though precision could improve as ambiguous words are no longer allowed to get incorrect results). The number of very short queries varies by corpus/wiki, depending in large part on the languages present in the corpus, and so some are completely unaffected by minimum lengths.

Initial Results
I examined configs with a minimum input length (MIL) from 1 to 8. Since we currently don't include any empty strings in our input, a minimum input length of 1 is equivalent to the current situation.

3K Models
Limiting options to the 3K models currently in use, with an optimal results ratio (RR) of 1.03, the squared error (vs optimal) doesn't change for minimum input length (MIL) of 1 or 2. 3 is only slightly worse, with 4 only slightly worse than that, with a much bigger jump at MIL == 5. RR \ MIL   1       2       3       4       5       6       7       8 1.03        2.58    2.58    2.78    3.44    8.68   26.49   92.13  256.97 With a minimum input length of 1 or 2, we get a slight hit to Russian over current production baseline, but improvement overall: best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.7%  83.0%     1.7%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.8%  89.0%     2.8%  92.1%   frwiki 93.9%  92.2%     1.7%  94.2%   itwiki 96.4%  95.1%     1.3%  96.9%   jawiki 84.6%  82.3%     2.3%  85.3%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 2.58 Cumulative Improvement (Best vs Baseline): 13.0% Mean (Min - Max): 1.4% (-0.3% – 2.8%) At minimum input length of 3, there's only a small decrease in overall performance from MIL == 2, with a slight increase for Japanese (presumably caused by short difficult Chinese queries being excluded): best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.6%  83.0%     1.6%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.7%  89.0%     2.7%  92.1%   frwiki 93.8%  92.2%     1.6%  94.2%   itwiki 96.5%  95.1%     1.4%  96.9%   jawiki 84.6%  82.3%     2.3%  85.3%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 2.78 Cumulative Improvement (Best vs Baseline): 12.8% Mean (Min - Max): 1.4% (-0.3% – 2.7%) At minimum input length of 4, performance overall is slightly worse, with English getting the worst of it, compared to MIL == 3. best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.3%  83.0%     1.3%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.6%  89.0%     2.6%  92.1%   frwiki 93.8%  92.2%     1.6%  94.2%   itwiki 96.5%  95.1%     1.4%  96.9%   jawiki 84.6%  82.3%     2.3%  85.3%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 3.44 Cumulative Improvement (Best vs Baseline): 12.4% Mean (Min - Max): 1.4% (-0.3% – 2.6%)

5K Models
Considering models from 3K to 5K (currently available in production, though we use 3K models), and using as a baseline the performance of 5K models found earlier (when looking at Maximum Returned Languages and Results Ratio, above), we see the same pattern: minimum input length (MIL) of 1 and 2 are the same, 3 is a bit worse, 4 a bit worse than that, with a bigger jump at 5.

The optimal results ratio (RR) depends on the minimum input length, and varies from 1.05 to 1.06. RR \ MIL   1       2       3       4       5       6       7       8 1.05     3.09    3.09    3.88    4.51   10.49   28.65   93.98  259.56        1.06      3.48    3.48    3.53    3.70    7.26   22.92   85.00  247.02 With MIL at 1 (RR == 1.05), this is the baseline. MIL of 2 is the same: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.7%  85.7%     0.0%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.2%  92.2%     0.0%  92.3%   frwiki 94.4%  94.4%     0.0%  95.0%   itwiki 96.4%  96.4%     0.0%  96.9%   jawiki 86.6%  86.6%     0.0%  87.0%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.4%   ruwiki Square Error (Best vs Optimal): 3.09 Cumulative Improvement (Best vs Baseline):  0.0% Mean (Min - Max): 0.0% (0.0% – 0.0%) With MIL at 3 (and RR == 1.06), we get improvements for several corpora, with a net improvement overall, but an unacceptable dip for Russian: best   baseln  delta   optim   corpus 91.3%  91.1%     0.2%  91.4%   dewiki 86.6%  85.7%     0.9%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.1%  92.2%?   -0.1%  92.3%   frwiki 94.8%  94.4%     0.4%  95.0%   itwiki 96.7%  96.4%     0.3%  96.9%   jawiki 86.2%  86.6%?   -0.4%  87.0%   nlwiki 97.7%  97.5%     0.2%  97.7%   ptwiki 91.8%  92.5%!   -0.7%  93.4%   ruwiki Square Error (Best vs Optimal): 3.53 Cumulative Improvement (Best vs Baseline):  0.8% Mean (Min - Max): 0.1% (-0.7% – 0.9%) [Unacceptable Performance Decrease for ruwiki] With MIL at 3 and RR at 1.05, there are only minimal improvement for Japanese, but only minimal dips for any other wikis: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.4%  85.7%?   -0.3%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.2%  92.2%     0.0%  92.3%   frwiki 94.3%  94.4%?   -0.1%  95.0%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 86.6%  86.6%     0.0%  87.0%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.4%   ruwiki Square Error (Best vs Optimal): 3.88 Cumulative Improvement (Best vs Baseline): -0.3% Mean (Min - Max): -0.0% (-0.3% – 0.1%) The performance of MIL == 4 is similar. At RR == 1.06, overall performance is up a bit, but Russian takes too big a hit: best   baseln  delta   optim   corpus 91.3%  91.1%     0.2%  91.4%   dewiki 86.4%  85.7%     0.7%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.0%  92.2%?   -0.2%  92.3%   frwiki 94.8%  94.4%     0.4%  95.0%   itwiki 96.7%  96.4%     0.3%  96.9%   jawiki 86.2%  86.6%?   -0.4%  87.0%   nlwiki 97.7%  97.5%     0.2%  97.7%   ptwiki 91.8%  92.5%!   -0.7%  93.4%   ruwiki Square Error (Best vs Optimal): 3.70 Cumulative Improvement (Best vs Baseline):  0.5% Mean (Min - Max): 0.1% (-0.7% – 0.7%) [Unacceptable Performance Decrease for ruwiki] At MIL == 4 and RR == 1.06, there isn't any improvement other than a tiny bump for Japanese, though English is on the edge of acceptability. best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.2%  85.7%?   -0.5%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.1%  92.2%?   -0.1%  92.3%   frwiki 94.3%  94.4%?   -0.1%  95.0%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 86.6%  86.6%     0.0%  87.0%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.4%   ruwiki Square Error (Best vs Optimal): 4.51 Cumulative Improvement (Best vs Baseline): -0.6% Mean (Min - Max): -0.1% (-0.5% – 0.1%) N.B., these are all using 5K models with RR == 1.05 as a baseline, rather than the current production baseline. Got to keep improving, right?

Up to 10K models
Considering models from 3K to 10K (max available for query-based models without retraining), and using as a baseline the performance of 5K models found earlier (when looking at Maximum Returned Languages and Results Ratio, above), we see the same pattern: minimum input length (MIL) of 1 and 2 are the same, 3 is a bit worse, 4 a bit worse than that, with a bigger jump at 5.

The optimal model size is 9K, and the optimal results ratio (RR) is 1.06. RR \ MIL   1       2       3       4       5       6       7       8 1.06     2.65    2.65    2.77    3.03    7.33   23.75   87.70  250.03 At MIL == 1 (or 2), performance is generally better than the 5K baseline, though Japanese takes a small hit: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.5%   dewiki 86.5%  85.7%     0.8%  87.0%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.8%  92.2%     0.6%  93.3%   frwiki 95.0%  94.4%     0.6%  95.4%   itwiki 96.1%  96.4%?   -0.3%  96.9%   jawiki 88.5%  86.6%     1.9%  89.0%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.8%   ruwiki Square Error (Best vs Optimal): 2.30 Cumulative Improvement (Best vs Baseline):  4.7% Mean (Min - Max): 0.5% (-0.3% – 1.9%) At MIL == 3, Japanese does a bit better (or, at least, a bit less badly), while overall the improvement over the 5K baseline (with no MIL) is less. best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.5%   dewiki 86.3%  85.7%     0.6%  87.0%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.7%  92.2%     0.5%  93.3%   frwiki 94.9%  94.4%     0.5%  95.4%   itwiki 96.3%  96.4%?   -0.1%  96.9%   jawiki 88.4%  86.6%     1.8%  89.0%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.8%   ruwiki Square Error (Best vs Optimal): 2.57 Cumulative Improvement (Best vs Baseline):  4.4% Mean (Min - Max): 0.5% (-0.1% – 1.8%) At MIL == 4, no wikis do worse, though not as many do better, and the total improvement is slightly less: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.5%   dewiki 86.0%  85.7%     0.3%  87.0%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.7%  92.2%     0.5%  93.3%   frwiki 94.9%  94.4%     0.5%  95.4%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 88.4%  86.6%     1.8%  89.0%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.8%   ruwiki Square Error (Best vs Optimal): 2.88 Cumulative Improvement (Best vs Baseline):  4.3% Mean (Min - Max): 0.5% (0.0% – 1.8%)

Per Language Analysis
Looking at the optimization report (for models up to 10K) for each language, we can see how the setting of a minimum input length effects the corpus of queries from the corresponding wiki.

German, Spanish, Italian, Portuguese, Russian: Most or all queries are >5 characters, so the optimization is based mostly or entirely on model size, maximum returned languages, and results ratio.

English, French: The shortest queries are 2 characters and all non-Latin (and so generally easier in this context), so MIL ≥ 3 effects recall.

Japanese: The shortest queries are Chinese, and hard to detect properly. Ignoring them improve precision, so MIL > 2 is a good thing.

Dutch: The optimal MIL is actually 8! There are few queries < 7 characters long, but of the nine queries that are 7 characters, five are English and German (both of which are easily confused with Dutch, esp. with shorter strings), and only one is Dutch, so dropping them all improves precision enough to make a difference!

Summary & Current Recommendations

 * None of the 3K, 5K, or 9K models are any worse if we set the minimum input length (MIL) to 2 (vs the de facto current value of 1). Setting MIL to 3 or 4 seems reasonable, depending on the details of the optimization.
 * We've only seen real-world problems with one-character queries, so setting MIL to 2 or 3 seems like the most likely option, without incurring too great a cost.


 * We still need to see how this feature interacts with other potential features being considered.
 * That said, compared to the current production baseline, improvements in other features gives us some wiggle room, since this feature can only decrease recall (though it improves precision).
 * The Perl and PHP versions of TextCat will need to be updated to include this option.

Max Proportion of Max Score
(December 2016 — Phab task: T149320)

Background
The current configuration of TextCat chooses the best fitting language from among the languages considered, even when "best" really isn't very good. One brake on bad results is determining that a result is "too ambiguous" because there are too many languages that score similarly (see Maximum Returned Languages and Results Ratio, above).

However, depending on the configuration of TextCat and the number of languages (and exact languages) being considered, strings that are not in any of the relevant writing systems can score well enough in one language to beat out all the others.

TextCat scores are best viewed as costs, with the smallest cost–which is the smallest deviation from a particular language model—being the best. There's no upper limit on the cost; it varies with the number of distinct n-grams present—which is itself correlated with but not strictly dependent on length. For example, "eeeeeeeeeeeeeeeeeeeee" has fewer distinct n-grams than "abcdefg", even though it is longer, because e, ee, eee, eeee, and eeeee all get repeated a lot, while all the n-grams in "abcdefg" are distinct.

Another example: there was a small amount of Arabic text in the training data for the French query-based language model, and so if a relatively short Arabic text is compared against, say, French, English, German, Italian, and Spanish, it is possible for the Arabic text to score well enough compared to French to beat out the others, though none of the scores are very good.

Similarly, a string of emoji (e.g., 😠😩😲😞😵😰😒😍😤😜😝😋😘—and, yeah, we regularly get queries like this) might score a tie across all available languages. If there are few enough languages, the result can be insufficiently ambiguous to reject, and one of the languages is declared the "winner", even though its score is objectively poor.

To prevent these kinds of poor identification results, we can compute the maximum possible cost, which is what would be scored if all n-grams were "unknown", i.e., not in the language model. We compare the actual cost/score for a string for any language to that theoretical maximum and disallow any that are too close to that max.

How close is too close? 95%, 90%, 70%, 50%? That's what we're here to find out!

Implementation Hiccup
I had a slight problem with implementation that became apparent when looking at the junk queries. I originally filtered the candidates above the max proportion of max score as early as possible. It turns out that this prevented a small number of junk queries from being marked unidentifiable for being too ambiguous.

As an example, suppose the max returned languages (MRL) is set to 1 and the results ratio (RR) is set to 1.05. That is, there can only be one language that scores within 5% of the best score (namely, the best score). If the best candidate scored 100, nothing else can score 105 or less, otherwise the result is "too ambiguous". In this case, say the max proportion of max score (MPMS) works out to be 103, and there was a second candidate that scored 104. Before taking MPMS into account, the score of 104 would be too close to the score of 100, and the result would be "too ambiguous". With MPMS in play, the 104 candidate is removed, and the result is no longer too ambiguous, though it is still bad.

It makes sense that if a score is both so close to the worst possible score that it is gibberish and so close to the best actual score that it is ambiguous, that both conditions should be taken into account. Putting the MPMS check after the "too ambiguous" check (using max returned languages and results ratio) solves this problem.

Despite the intuition that both ambiguity and MPMS should be considered, the overall performance with MPMS coming after the MRL/RR ambiguity check is slightly worse, and dips into unacceptability for some corpora for the 5K models (see below). I was hoping that MPMS would improve precision without affecting recall too much.

My original plan did not include considering non-language queries in assessing performance, so names, gibberish, and punctuation were ignored. Since then, a few interesting corner cases have come to light, and so now I'm assessing junk queries separately.

I don't think there's any reasonable approach to names; a name can be of one nationality, and have the distinctive characteristics of the national language, and belong to someone who is not from that country and doesn't have a wiki page on the corresponding wiki. For example, there are Italian-American actors with clearly Italian names, who only have pages on enwiki, and not itwiki. So, the wiki of the language of the country a name looks like it is from is a reasonable guess, but you can't really say whether it's right or wrong. With names of mixed ethnolinguistic origin, all bets are off!

Initial Results
I examined configs with a max proportion of max score (MPMS) from 0.1 to 1.0, in increments of 0.1. The optimal value for various configs was often around 0.9, and everything below 0.7 was horrible, I switched to looking at MPMS in increments of 0.05 from 0.6 to 1.0.

Based on earlier results, I set the maximum returned languages to 1, let the results ratio vary from 1.01 to 1.10, and looked at models sizes from 3000 to 10000 (in 1000 increments). The minimum input length I let alternate between 3 and 4; based on earlier results, 3 will do better, but 4 may be preferable for filtering out more junk, so I ran both to compare.

3K Models
Limiting options to the 3K models currently in use, the optimal results ratio (RR) skews a bit lower to 1.02. The optimal max proportion of max score (MPMS) is 0.9 (based on square error vs optimal among configs considered). Minimum input length (MIL) of 3 is better than 4 as expected, but not by a lot. MIL \ MPMS    0.6      0.65    0.7     0.75    0.8     0.85    0.9     0.95    1 3        2751.65  1135.03  329.19   75.16    9.19    2.64    2.32    2.54    2.90        4         2783.82  1149.28  337.47   79.64   11.46    3.95    3.23    3.79    4.19 With min input length of 3 and MPMS of 0.9, the corpora for all wikis do better than current production baseline. best   baseln  delta   optim   corpus 90.5%  88.2%     2.3%  90.5%   dewiki 84.7%  83.0%     1.7%  85.4%   enwiki 97.0%  95.6%     1.4%  97.0%   eswiki 91.4%  89.0%     2.4%  92.0%   frwiki 93.7%  92.2%     1.5%  94.1%   itwiki 96.4%  95.1%     1.3%  96.9%   jawiki 84.2%  82.3%     1.9%  85.0%   nlwiki 97.1%  96.9%     0.2%  97.4%   ptwiki 93.1%  92.4%     0.7%  93.2%   ruwiki

Square Error (Best vs Optimal): 2.00 Cumulative Improvement (Best vs Baseline): 13.4% Mean (Min - Max): 1.5% (0.2% – 2.4%) With min input length of 4 and MPMS of 0.9, there's a small decrease in overall performance compared to MIL == 3, but the corpora for all wikis do better than current production baseline, with English and Dutch doing slightly less better than before: best   baseln  delta   optim   corpus 90.5%  88.2%     2.3%  90.5%   dewiki 84.3%  83.0%     1.3%  85.4%   enwiki 97.0%  95.6%     1.4%  97.0%   eswiki 91.4%  89.0%     2.4%  92.0%   frwiki 93.7%  92.2%     1.5%  94.1%   itwiki 96.4%  95.1%     1.3%  96.9%   jawiki 84.1%  82.3%     1.8%  85.0%   nlwiki 97.1%  96.9%     0.2%  97.4%   ptwiki 93.1%  92.4%     0.7%  93.2%   ruwiki

Square Error (Best vs Optimal): 2.89 Cumulative Improvement (Best vs Baseline): 12.9% Mean (Min - Max): 1.4% (0.2% – 2.4%)

5K Models
Considering models from 3K to 5K (currently available in production, though we use 3K models), and using as a baseline the performance of 5K models found earlier (when looking at Maximum Returned Languages and Results Ratio, above), we see a familiar pattern: larger models do better, and larger models prefer a larger results ratio (RR). Also, even the 3K models prefer a higher RR when compared to the larger models, presumably because the optimal F0.5 score for some languages is higher, and the trade-offs in the square error change.

At the optimal model size of 5K, the optimal max proportion of max score (MPMS) is 0.85, for both MIL of 3 or 4, though they differ on the optimal results ratio (RR), with MIL == 3 doing better at RR == 1.05, and MIL == 4 doing better at RR == 1.06, though the results are very close. MIL == 3 MPMS \ RR   1.01    1.02    1.03    1.04    1.05    1.06    1.07    1.08    1.09    1.10 0.7    160.17  140.36  129.66  126.86  124.07  123.07  126.63  140.91  150.86  176.72        0.75     51.60   34.69   24.42   21.74   20.06   19.15   22.68   36.78   45.40   69.97        0.8      37.25   20.27    8.74    5.60    4.08    3.35    7.11   20.35   29.00   52.76        0.85     34.50   17.05    6.77    4.15    3.00    3.27    6.80   21.28   29.77   54.04        0.9      33.69   17.09    7.05    4.34    3.08    3.30    6.85   21.33   29.82   54.04        0.95     34.14   16.95    7.33    4.58    3.15    3.30    6.85   21.33   29.82   54.04        1        34.54   17.35    7.33    4.58    3.15    3.30    6.85   21.33   29.82   54.04

MIL == 4 MPMS \ RR   1.01    1.02    1.03    1.04    1.05    1.06    1.07    1.08    1.09    1.10 0.7    168.32  149.51  137.67  134.46  131.46  130.32  134.19  148.05  158.75  185.29        0.75     55.45   38.38   26.96   23.87   21.61   20.45   24.01   38.08   47.60   72.52        0.8      40.65   22.76   10.85    7.20    5.47    4.01    7.78   21.32   30.56   55.00        0.85     36.57   19.61    8.34    5.21    3.85    3.50    7.14   21.83   30.91   55.43        0.9      36.81   19.18    8.58    5.45    4.00    3.58    7.19   21.88   30.96   55.43        0.95     36.66   19.74    8.82    5.69    4.09    3.53    7.19   21.88   30.96   55.43        1        36.85   19.93    8.82    5.69    4.09    3.53    7.19   21.88   30.96   55.43 With MPMS == 0.85, MIL == 3, and RR == 1.05, there is no overall improvement over previous 5K models, with English and Italian taking a hit (within the acceptable 0.5% limit) and Japanese getting a benefit. Performance is still better than current production baseline (the earlier 5K performance is used as a baseline here). best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.4%  85.7%?   -0.3%  86.6%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.2%  92.2%     0.0%  92.3%   frwiki 94.3%  94.4%?   -0.1%  94.8%   itwiki 96.8%  96.4%     0.4%  96.9%   jawiki 86.6%  86.6%     0.0%  86.6%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.5%   ruwiki

Square Error (Best vs Optimal): 3.00 Cumulative Improvement (Best vs Baseline): 0.0% Mean (Min - Max): -0.0% (-0.3% – 0.4%) With MPMS == 0.85, MIL == 4, and RR == 1.06, there is a slight overall improvement over previous 5K models, with, though French, Dutch, and Russian taking a hit. Russian performance may be too poor, as it is even -0.6% below current production. best   baseln  delta   optim   corpus 91.3%  91.1%     0.2%  91.4%   dewiki 86.3%  85.7%     0.6%  86.6%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.0%  92.2%?   -0.2%  92.3%   frwiki 94.8%  94.4%     0.4%  94.8%   itwiki 96.8%  96.4%     0.4%  96.9%   jawiki 86.1%  86.6%?   -0.5%  86.6%   nlwiki 97.7%  97.5%     0.2%  97.7%   ptwiki 91.8%  92.5%!   -0.7%  93.5%   ruwiki

Square Error (Best vs Optimal): 3.50 Cumulative Improvement (Best vs Baseline):  0.4% Mean (Min - Max): 0.0% (-0.7% – 0.6%) [Unacceptable Performance Decrease for ruwiki] N.B., these are all using 5K models with RR == 1.05 as a baseline, rather than the current production baseline. Got to keep improving, right?

Up to 10K models
Considering models from 3K to 10K (max available for query-based models without retraining), and using as a baseline the performance of 5K models found earlier (when looking at Maximum Returned Languages and Results Ratio, above), we see a similar pattern: MIL of 3 is slightly better than MIL of 4. The optimal MPMS though, is 1 (meaning, no filtering based on score vs max possible score), though the square error scores are very similar for MPMS >= 0.8.

The optimal model size is 9K, and the optimal results ratio (RR) is 1.06, as in early tests with up to 10K models, and the optimal MPMS value is (0.9) for both MIL == 3 or 4, though MPMS values as low as 0.8 score very similarly. MIL \ MPMS  0.6     0.65    0.7     0.75    0.8     0.85    0.9     0.95    1 3       655.85  199.38   32.71    4.81    2.22    2.10    1.90    2.10    2.10        4        682.33  200.12   34.69    4.66    2.59    2.47    2.23    2.39    2.39 For MIL == 3 (vs earlier 5K baselines): best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.6%   dewiki 86.3%  85.7%     0.6%  86.8%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.7%  92.2%     0.5%  93.2%   frwiki 94.9%  94.4%     0.5%  95.4%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 88.4%  86.6%     1.8%  88.4%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.9%   ruwiki

Square Error (Best vs Optimal): 1.90 Cumulative Improvement (Best vs Baseline):  4.6% Mean (Min - Max): 0.5% (0.0% – 1.8%) For MIL == 4 (vs earlier 5K baselines), similar but slightly worse: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.6%   dewiki 86.0%  85.7%     0.3%  86.8%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.7%  92.2%     0.5%  93.2%   frwiki 94.9%  94.4%     0.5%  95.4%   itwiki 96.6%  96.4%     0.2%  96.9%   jawiki 88.3%  86.6%     1.7%  88.4%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.9%   ruwiki

Square Error (Best vs Optimal): 2.23 Cumulative Improvement (Best vs Baseline):  4.3% Mean (Min - Max): 0.5% (0.0% – 1.7%)

Per Language Analysis
Looking at the optimization report (for models up to 10K) for each language, we can see how the setting of a max proportion of max score effects the corpus of queries from the corresponding wiki.

Looking at the top 10% of configs for each language/corpus, MPMS scores from 0.6 or 0.7 to 1.0 all do reasonably well, indicating that it doesn't have much negative affect on "real language" queries in most corpora. Japanese, however, prefers MPMS >= 0.85.

If I allow the MPMS to optimize independently for each language (setting MIL to 3), the optimal model size is still 9K and the results ratio is still 1.06, but the overall performance is slightly better. best   baseln  delta   optim   corpus 91.2%  91.1%     0.1%  91.2%   dewiki 86.3%  85.7%     0.6%  86.3%   enwiki 97.0%  96.6%     0.4%  97.0%   eswiki 92.7%  92.2%     0.5%  92.7%   frwiki 94.9%  94.4%     0.5%  94.9%   itwiki 96.6%  96.4%     0.2%  96.6%   jawiki 88.4%  86.6%     1.8%  88.4%   nlwiki 97.6%  97.5%     0.1%  97.6%   ptwiki 93.2%  92.5%     0.7%  93.2%   ruwiki

Square Error (Best vs Optimal): 0.00 Cumulative Improvement (Best vs Baseline):  4.9% Mean (Min - Max): 0.5% (0.1% – 1.8%) The optimal MPMS value by corpus, for 9K models and RR == 1.06 and MIL == 3: MPMS       corpus 0.60-0.75   dewiki 0.90-1.00   enwiki 0.70-1.00   eswiki 0.70-1.00   frwiki 0.70-1.00   itwiki 0.80        jawiki 0.70-1.00   nlwiki 0.70-1.00   ptwiki 0.60-0.75   ruwiki

Junk Queries
Of course, the real purpose of this feature is to exclude junk queries, which are purposefully not included in the test corpora. So, I took a set of 731 junk queries I'd extracted for other purposes from English Wikipedia query logs, and tested the various parameters on it.

Example junk queries include: I investigated the full range of options: max proportion of max score (MPMS) from 0.1 to 0.5 in increments of 0.1, and from 0.6 to 1.0 in increments of 0.05. I set the maximum returned languages to 1, let the results ratio vary from 1.01 to 1.10, and looked at models sizes from 3000 to 10000 (in 1000 increments). The minimum input length I let alternate between 3 and 4. For languages to consider, I used the list I've been using for English.
 * /1/3/44;4444zDzDzDzDdfaSsqwwsssssdaßaa
 * 4wwwww
 * 555555555555
 * Uuuyvhtredhress🔥🔥🔥👆✋✋✌💨💚💚💜💙💜
 * Rhfhddcxhxgxdtgcgh
 * Cxxza
 * a dv qAq
 * a dv qAq
 * a dv qAq

170 out of 2240 configs got the optimal performance on the junk set, namely no junk strings matching any language.

Of those, MPMS was, unsurprisingly, most likely to be 0.1, with some 0.2 values. Smaller model sets were more likely to be included than larger by 4:1 (3K vs 8K+). The larger MIL value of 4 was more likely than 3 (!), and RR had no effect.

Junk vs 3K Models
Looking at the 3K model settings above (RR == 1.02, MIL at 3 or 4) and letting MPMS range from 0.6 to 1.0 in increments of 0.05, we get: Junk Identified as some Language (out of 731) MPMS\MIL   3       4 0.6        71      70    0.65        154     153    0.7         242     241    0.75        335     333    0.8         418     416    0.85        468     466    0.9         471     469    0.95        471     469    1.0         471     469 MPMS == 1 has no effect, so we see that with MIL (minimum input length) set to 3, 471 out of 731 junk queries are identified as some language—so 260/731 aren't identified as any language without MPMS. Increasing MIL to 4 knocks out 2 more.

MPMS values down to 0.9 have no effect on filtering additional junk. Dropping MPMS to 0.85 or below filters out additional junk, but not very much at 0.85.

Recalling the 3K MIL/MPMS results, values down to 0.8 are perhaps plausible: MIL \ MPMS    0.6      0.65    0.7     0.75    0.8     0.85    0.9     0.95    1 3        2751.65  1135.03  329.19   75.16    9.19    2.64    2.32    2.54    2.90        4         2783.82  1149.28  337.47   79.64   11.46    3.95    3.23    3.79    4.19

Junk vs up to 5K Models
Given that fewer junk queries get through with smaller models, we will only consider 5K models, as preferred by the language corpora above, with one eval for MIL == 3 and RR == 1.05, and one for MIL == 4 and RR == 1.06: Junk Identified as some Language (out of 731) MPMS\MIL+RR   3 + 1.05       4 + 1.06 0.6          105                97    0.65          175               157    0.7           253               222    0.75          322               283    0.8           359               314    0.85          366               319    0.9           367               320    0.95          367               320    1.0           367               320 With these settings, fewer junk queries are getting through, though MPMS >= 0.9 still has no effect, 0.85 and 0.80 have little effect. The big difference here between the two columns is RR, since changing MIL seems to consistently only filter out 2 additional junk queries.

Recalling the 5K MIL/MPMS/RR results, values down to 0.8 are plausible, but not below: MIL == 3 MPMS \ RR 1.05    1.06 0.7  124.07  123.07        0.75   20.06   19.15        0.8     4.08    3.35        0.85    3.00    3.27        0.9     3.08    3.30        0.95    3.15    3.30        1       3.15    3.30

MIL == 4 MPMS \ RR   1.05    1.06 0.7    131.46  130.32        0.75     21.61   20.45        0.8       5.47    4.01        0.85      3.85    3.50        0.9       4.00    3.58        0.95      4.09    3.53        1         4.09    3.53

Junk vs up to 10K Models
Looking at the previously optimized 9K models, with RR = 1.06 Junk Identified as some Language (out of 731) MPMS\MIL   3       4 0.6        185     183    0.65        270     268    0.7         348     346    0.75        406     404    0.8         423     421    0.85        425     423    0.9         426     424    0.95        426     424    1.0         426     424 As before, MPMS >= 0.9 has no effect, and the effect from 0.8 to 0.85 is minimal

Recalling the 9K MIL/MPMS results, values down to 0.75 are plausible, but not below: MIL \ MPMS  0.6     0.65    0.7     0.75    0.8     0.85    0.9     0.95    1 3       655.85  199.38   32.71    4.81    2.22    2.10    1.90    2.10    2.10        4        682.33  200.12   34.69    4.66    2.59    2.47    2.23    2.39    2.39

"Proportion of Max Score" as a Score
Since max proportion of max score (MPMS) seeks to normalize TextCat's cost score against the worst possible score, it seems reasonable to try to use proportion of max score (PMS) as a general normalized score, with lower scores being better. (Though it's easy to flip them so higher scores are better by subtracting the score from 1, since all PMS scores are betwwen 0 and 1.) Very low scores are extremely unlikely, since they would require matching the model n-gram distributions very closely, which is mathematically impossible for short query strings.

Using the optimally performing settings from above—model size at 9000, maximum returned languages set to 1, and results ratio at 1.06, and arbitrarily choosing minimum input length to be 3—I ran a report on the PMS scores for each language/corpus.

I looked at non-cumulative precision after dividing the 0-1 score range into 50, 20, 10, or 4 buckets. I would expect precision for sufficiently heavily-populated buckets to decrease as the score increased. For the fine-grained buckets (50 and 20 buckets), there was not a very clear progression. For the courser-grained buckets (10 and 4 buckets), there was a reasonable trend for most corpora.

Interestingly, looking at 20 buckets (i.e., 0.05 increments), showed that most queries ended up in the 0.20 to 0.50 buckets (i.e., 0.15 < score ≤ 0.50), with peaks at 0.30, 0.35, or 0.40. For Japanese, most queries were in the 0.4 to 0.85 buckets (i.e., 0.35 < score ≤ 0.85), with a peak at 0.70, further demonstrating that the alphabetic languages are dissimilar from the non-alphabetic languages.

Actual junk queries (all from enwiki) tend to score in the 0.50 to 0.75 buckets (i.e., 0.45 < score ≤ 0.75), with a peak at 0.65, showing some separation between enwiki junk and alphabetic languages.

Summary & Current Recommendations

 * Setting a max proportion of max score has a non-negative effect on F0.5, at least with larger model sizes (i.e., 9K). This is mostly from not allowing wrong-character-set queries to be assigned a value for language identification.


 * The effect on actual junk queries is fairly minimal unless MPMS is rather aggressive (0.75 or below). Despite some corner cases (most addressed by minimum input length), a lot of the worst junk queries will get no results on any wiki, so improperly identifying them as a language is not a huge problem.


 * We probably could get some F0.5 improvements by setting MPMS by language, but I'm generally trying to avoid doing so for all of the non–language-specific features, both to keep things simple, and to avoid over-training on relatively small individual corpora. However, the one most obvious outlier is the Japanese Wikipedia corpus, which has many more Japanese and Chinese queries than any other corpus. The fact that those languages don't use a fairly constrained alphabet for writing could be relevant. Perhaps two settings, for alphabets and non-alphabets would be warranted. On the other hand, the total fluctuation is always less than 1.5%, and the Japanese corpus already has one of the highest baselines (> 95% F0.5).


 * I have some hope that a separately defined unknown-ngram penalty could help separate out junk queries.


 * As a score, MPMS isn't great. It roughly correlates with quality if broken into a small number of buckets, and junk queries tend to score higher/worse than non-junk queries.


 * We still need to see how this feature interacts with other potential features being considered.

New Baselines
There is enough going on here that it's getting hard to keep track of the level of performance improvements over the baseline. Since the currently deployed language models are 5K models (though we only use them as 3K models), I'm going to re-compute a new "baseline so far" for both 3K and up-to-5K models and the best options up to this point.

In order to decrease the size of the parameter space I need to investigate, and based on the experiments so far, I'm going to set Max Returned Languages to 1 (which seems to be the best). I'm also going to set Min Input Length to 3 and Max Proportion of Max Score to 0.85, even those can hurt recall, because they are good at filtering some junk. This leaves only model size (3K to 5K; in 1K steps) and Results Ratio (1.01 to 1.10; in 0.01 steps) as variables to optimize for the new baselines.

Updated Report
I've added a couple of new columns and re-named one column in the report, so here's the new legend:
 * best is the best F0.5 score for that language within this bucket (in this case, only one config is in the bucket).
 * baseln is the baseline F0.5 score, i.e., the F0.5 score obtained in the original optimization, based generally on language selection.
 * baseΔ (formerly delta) is the increase from baseln to best.
 * %Δmax is the percentage that baseΔ is of the maximum possible improvement (i.e., reaching 100% F0.5). If the baseline is 90% and best is 92%, then %Δmax is 20%, because the the max theoretical improvement would be 10%, and the actual baseΔ is 2%; 2%/10% = 20%.
 * optim is the best F0.5 obtained in this optimization, including all possible configurations.
 * optΔ is the decrease from optim to baseln, i.e., how much we are giving up by having a shared value instead of tuning by corpus.
 * Square Error is as above, and is the basis for selecting the best model.
 * Cumulative Improvement is the sum of all deltas, with mean, max, and min also shown.

3K Models
Only considering 3K models, the optimal results ratio is 1.02: best  baseln   baseΔ  %Δmax  optim     optΔ  corpus 90.5%  88.2%     2.3%   19.5%  90.5%     0.0%  dewiki 84.5%  83.0%     1.5%    8.8%  85.4%    -0.9%  enwiki 97.0%  95.6%     1.4%   31.8%  97.0%     0.0%  eswiki 91.4%  89.0%     2.4%   21.8%  92.0%    -0.6%  frwiki 93.7%  92.2%     1.5%   19.2%  94.1%    -0.4%  itwiki 96.4%  95.1%     1.3%   26.5%  96.4%     0.0%  jawiki 84.2%  82.3%     1.9%   10.7%  85.0%    -0.8%  nlwiki 97.1%  96.9%     0.2%    6.5%  97.4%    -0.3%  ptwiki 93.1%  92.4%     0.7%    9.2%  93.1%     0.0%  ruwiki

Square Error (Best vs Optimal): 2.06 Cumulative Improvement (Best vs Baseline): 13.2% Mean (Min - Max): 1.5% (0.2% – 2.4%) All corpora do at least a bit better than the current production 3K baseline, with an average improvement of 1.5% in F0.5. Many of the gains are a substantial proportion of the max possible gain (i.e, %Δmax ≥ ~20%). English and Dutch still lag behind the others in overall performance.

Up-to-5K Models
Considering models up to 5K, the optimal models size is 5K, with a results ratio of 1.05: best  baseln   baseΔ  %Δmax  optim     optΔ  corpus 91.1%  88.2%     2.9%   24.6%  91.4%    -0.3%  dewiki 85.4%  83.0%     2.4%   14.1%  86.6%    -1.2%  enwiki 96.6%  95.6%     1.0%   22.7%  97.0%    -0.4%  eswiki 92.2%  89.0%     3.2%   29.1%  92.3%    -0.1%  frwiki 94.3%  92.2%     2.1%   26.9%  94.8%    -0.5%  itwiki 96.8%  95.1%     1.7%   34.7%  96.9%    -0.1%  jawiki 86.6%  82.3%     4.3%   24.3%  86.6%     0.0%  nlwiki 97.5%  96.9%     0.6%   19.4%  97.7%    -0.2%  ptwiki 92.5%  92.4%     0.1%    1.3%  93.4%    -0.9%  ruwiki

Square Error (Best vs Optimal): 2.81 Cumulative Improvement (Best vs Baseline): 18.3% Mean (Min - Max): 2.0% (0.1% – 4.3%) All corpora do at least a bit better than the current production 3K baseline, though for Russian it's a very minor gain. Overall the avarage increase is better (2% increase in F0.5 rather than 1.5%), though Russian and Spanish do not do as well as with the 3K models. However, Spanish is one of the best-performing corpora, and Russian is still good, while the worst performing—English and Dutch—have bigger gains with 5K models.

Junk Queries
The performance on the junk corpus (from English Wikipedia) with current intended production settings is 697 / 731 (95.3%) of junk queries are identified as being in a language.

With the 3K optimized settings above (Results Ratio == 1.02, Max Returned Languages == 1, Min Input Length == 3, Max Proportion of Max Score == 0.85), 468 / 731 (64.0%) of junk queries are tagged as being in a language.

With the up-to-5K optimized settings above (Model Size == 5K, Results Ratio == 1.05, Max Returned Languages == 1, Min Input Length == 3, Max Proportion of Max Score == 0.85), 366 / 731 (50.1%) of junk queries are tagged as being in a language.

New Baselines
I'll be using these baselines going forward—until such time as we need another new baseline—I can only hope that doesn't become necessary! We'll see.

Bucketing and Bonuses + Language Selection
AKA, a priori weighting.

(January 2017 - Phab task: T149322)

Background
The Phab ticket has several options, but I decided to try the simplest and most obvious: weighting the "host" language of the wiki by a constant bonus. Since this was a potential solution to the problem of excluding languages that generate too many false positives (e.g. French from enwiki), it also opened up the prospect of optimizing over language choice.

I manually re-optimized some enwiki language choices, and, for example, re-including French improved recall on queries in French without major effects on overall precision, which bumped F0.5 overall. I automated the process with modified hill-climbing for languages and coordinate descent in general. (See Framework Update, Jan 2017 above.)

Based on earlier experiments, I set minimum input length to 3 and max proportion of max score to 0.85—these decrease F0.5 a bit—but block poor behavior in corner cases—and so shouldn't be optimized. Many languages were previously excluded for lack of query-based model, so I enabled multiple language directories, and allowed Wiki-text fallbacks. I tested allowing maximum returned languages to be something other than 1, but it never helps (down with ambiguity!). I allowed the results ratio and model size to vary. Model size was on the usual 3K to 10K scale, in 1K increments, and results ratio was originally 1.01 to 1.10, but sometimes it optimized to be the max, so I increased the upper range to 1.15.

For bucketing and bonuses, I let the bonus weight vary, and I decided to allow additional languages to be added in order for bonus weighting. I considered all languages that had more than 10 examples in the corpus.

I originally let the bonus range from 0 to 0.05 ("5% bonus"), but it often optimized to the max value, so I increased the range to 0.10 and then to 0.20.

Optimizing Languages
Unfortunately, some early design decisions made optimizing across multiple corpora incompatible with coordinate descent without a major refactor, so I optimized languages per corpus independently.

I optimized each corpus with coordinate descent over the full range of parameter settings to find the optimal language set. (This took about 5 minutes per language.)

Holding the language sets constant I meta-optimized over the full range of other parameters. (This took about 165 minutes total.)

I then re-optimize each corpus with the full range of languages, but all other parameters at their meta-optimal values (This took < 1 minute per language.)

Corpora with the least stable set of languages—i.e., it was harder to get the optimal value using coordinate descent, and some there was variation in possible final sets between coordinate descent runs; namely, English, Portuguese, Russian—re-optimized to slightly different optimal language sets, with differences in the long tail of languages.

Since the difference was at most a couple tenths of a percent, I decided to stick with the language sets chosen for each corpus individually.

The optimal sets of languages for each corpus are below: In a number of cases, languages which are active still get no postive identifications. If a language has no effect at all, then it is possible that it is present more or less at random. However, it is also the case that having such languages prevents false negatives.
 * German: German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese (de, en, la, it, es, fr, zh, pl, vi)
 * English: English, Chinese, Spanish, Arabic, German, Persian, French, Indonesian, Polish, Russian, Vietnamese, Italian, Japanese, Portuguese, Czech, Bengali, Croatian, Hebrew, Norwegian, Afrikaans, Icelandic, Tagalog, Thai, Hungarian, Irish, Korean, Ukrainian, Urdu (en, zh, es, ar, de, fa, fr, id, pl, ru, vi, it, ja, pt, cs, bn, hr, he, no, af, is, tl, th, hu, ga, ko, uk, ur)
 * Spanish: Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, French, German (es, en, la, ru, zh, pt, it, fr, de)
 * French: French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Thai, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton (fr, en, ar, pt, de, es, ru, zh, nl, pl, it, th, sv, la, is, hy, hu, br)
 * Italian: Italian, English, German, Russian, Arabic, Chinese, Polish (it, en, de, ru, ar, zh, pl)
 * Japanese: Japanese, English, Chinese, Korean, German (ja, en, zh, ko, de)
 * Dutch: Dutch, English, French, German, Spanish, Latin, Chinese, Polish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech (nl, en, fr, de, es, la, zh, pl, ar, vi, pt, my, ko, hr, da, cs)
 * Portuguese: Portuguese, English, Tagalog, Russian, French (pt, en, tl, ru, fr)
 * Russian: Russian, English, Ukrainian, German, Georgian, Armenian, Latvian, Japanese, Finnish, Spanish, Arabic (ru, en, uk, de, ka, hy, lv, ja, fi, es, ar)

Bucketing and Bonus Results
For each corpus, I considered adding bonuses to not only the host language, but also other more commonly represented languages. So, for example, in the enwiki corpus, queries in Chinese (the second most common after English) are about six times more common than queries in Japanese, so it makes sense to favor Chinese over Japanese. To keep things simple, I used the "shrinking list" parameter type, so that the allowed options would include the first, first and second, first second and third, etc, but not first and fifth, for example.

While per-language bonuses and bonuses on arbitrary sets of languages could give better results, it's a lot of complexity, and subject to overfitting. I was hoping to find a generally applicable principle (top n most common languages, languages with more than n occurrences, or languages with that make up at least p proportion of the corpus).

Fortunately, the optimization consistently chose the first two languages, with rarely the third (usually when it made no difference). For all but enwiki, the #2 is English (on enwiki, English is #1, and Chinese is #2)—so I'm not quite sure whether the right generalization is the host language and the most common non-host language, or the host language and English, but they are the same thing in practice.

Below are the results of optimization in each of our three standard use cases: 3K, up to 5K, and up to 10K models.

3K Results
(vs 3K baseline)

The optimal bonus is 0.10 (i.e., 10%), and the optimal results ratio is 1.04: best  baseln   baseΔ   %Δmax  optim     optΔ  corpus 92.7%  90.5%     2.2%   23.2%  93.6%    -0.9%  dewiki 93.0%  84.5%     8.5%   54.8%  93.3%    -0.3%  enwiki 97.0%  97.0%     0.0%    0.0%  97.5%    -0.5%  eswiki 93.8%  91.4%     2.4%   27.9%  93.8%     0.0%  frwiki 94.9%  93.7%     1.2%   19.0%  95.0%    -0.1%  itwiki 96.7%  96.4%     0.3%    8.3%  97.4%    -0.7%  jawiki 89.4%  84.2%     5.2%   32.9%  90.1%    -0.7%  nlwiki 97.7%  97.1%     0.6%   20.7%  97.7%     0.0%  ptwiki 94.9%  93.1%     1.8%   26.1%  95.8%    -0.9%  ruwiki

Square Error (Best vs Optimal): 2.95 Cumulative Improvement (Best vs Baseline): 22.2% Mean (Min - Max): 2.5% (0.0% – 8.5%) With the additional languages and the bonus for the host language and the second-most common language, the improvement is huge, with everything but Dutch jumping above 90%!

up to 5K Results
(vs 5K baseline)

BS,0.11; M,5000; U,1.05 The optimal bonus is 0.11 (i.e., 11%), the optimal results ratio is 1.05, and the optimal model size is 5K. best  baseln   baseΔ   %Δmax  optim     optΔ  corpus 93.5%  91.1%     2.4%   27.0%  94.3%    -0.8%  dewiki 93.5%  85.7%     7.8%   54.5%  93.7%    -0.2%  enwiki 97.1%  96.6%     0.5%   14.7%  97.5%    -0.4%  eswiki 94.4%  92.2%     2.2%   28.2%  94.8%    -0.4%  frwiki 95.5%  94.4%     1.1%   19.6%  95.7%    -0.2%  itwiki 97.4%  96.4%     1.0%   27.8%  97.8%    -0.4%  jawiki 90.6%  86.6%     4.0%   29.9%  91.0%    -0.4%  nlwiki 97.7%  97.5%     0.2%    8.0%  97.9%    -0.2%  ptwiki 95.7%  92.5%     3.2%   42.7%  95.9%    -0.2%  ruwiki

Square Error (Best vs Optimal): 1.44 Cumulative Improvement (Best vs Baseline): 22.4% Mean (Min - Max): 2.5% (0.2% – 7.8%) Every corpus is above 90% F0.5! The gains for some corpora are nominally smaller than for the 3K models because the 5K baseline is better, though the overall gain is slightly higher (22.4% vs 22.2%).

up to 10K Results
(vs 5K baseline)

The optimal bonus is 0.14 (i.e., 14%), the optimal results ratio is 1.06, and the optimal model size is 9K.

In general, larger models end up with a larger results ratio and larger bonus. best  baseln   baseΔ   %Δmax  optim     optΔ  corpus 93.6%  91.1%     2.5%   28.1%  94.3%    -0.7%  dewiki 93.9%  85.7%     8.2%   57.3%  94.3%    -0.4%  enwiki 97.7%  96.6%     1.1%   32.4%  98.1%    -0.4%  eswiki 95.1%  92.2%     2.9%   37.2%  95.7%    -0.6%  frwiki 95.9%  94.4%     1.5%   26.8%  96.2%    -0.3%  itwiki 97.5%  96.4%     1.1%   30.6%  98.0%    -0.5%  jawiki 91.9%  86.6%     5.3%   39.6%  92.4%    -0.5%  nlwiki 97.7%  97.5%     0.2%    8.0%  97.9%    -0.2%  ptwiki 95.3%  92.5%     2.8%   37.3%  96.4%    -1.1%  ruwiki

Square Error (Best vs Optimal): 3.01 Cumulative Improvement (Best vs Baseline): 25.6% Mean (Min - Max): 2.8% (0.2% – 8.2%) Almost every corpus is above 93% except Dutch, which is still 91.9%.

5K vs 9K Results
Below is a comparison of the F0.5 scores for the optimized configs for 5K ("up to 5K") and 9K ("up to 10K") models. 9K is generally better, except for the ruwiki corpus. 5K     9K       Δ     corpus 93.5%  93.6%    0.1%   dewiki 93.5%  93.9%    0.4%   enwiki 97.1%  97.7%    0.6%   eswiki 94.4%  95.1%    0.7%   frwiki 95.5%  95.9%    0.4%   itwiki 97.4%  97.5%    0.1%   jawiki 90.6%  91.9%    1.3%   nlwiki 97.7%  97.7%    0.0%   ptwiki 95.7%  95.3%   -0.4%   ruwiki

Cumulative Improvement (Best vs Baseline): 3.2% Mean (Min - Max): 0.4% (-0.4% – 1.3%) Until now, the "up to 10K" results (usually specifically with 9K models) have generally been significantly better than the 3K or 5K models, which suggested that we'd need to upgrade to ~10K models. These results are close enough that this may no longer be the case if the cost of 10K models is too high (though that doesn't seem to be the case).

Junk Check
Using the optimized values for enwiki above, I ran the configs against my corpus of 731 junk queries collected from enwiki. vs 3K       (bonus == 0.10; model size == 3000; results ratio == 1.04): 472/731 = 64.6% vs up to 5K (bonus == 0.11; model size == 5000; results ratio == 1.05): 408/731 = 55.8% vs up to 10K (bonus == 0.14; model size == 9000; results ratio == 1.06): 406/731 = 55.5% Larger models seem to be better at filtering junk.

Additional Languages
When optimizing languages for the various corpora, I added additional languages (usually in non-Latin character sets) that appeared in the larger corpus that my annotated corpus came from—because they are usually easy to find and identification for those languages is high precision.

I added them all back and tested performance using the previously optimized values, generally expecting no change. That was the case for all except the itwiki corpus, in which adding Japanese caused a slight decrease in accuracy for Chinese, so I omitted it. The final language lists are below.
 * German: Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese (el, ru, ar, hi, th, ko, ja)
 * English: Greek, Telugu, Georgian (el, te, ka)
 * Spanish: Arabic, Japanese (ar, ja)
 * French: Greek, Hebrew, Korean (el, he, ko)
 * Italian: Greek, Korean (el, ko)
 * Japanese: Arabic, Hebrew (ar, he)
 * Dutch: Greek, Hebrew, Japanese, Russian (el, he, ja, ru)
 * Portuguese: Hebrew, Arabic, Chinese, Korean, Greek (he, ar, zh, ko, el)
 * Russian: Hebrew, Chinese (he, zh)

Final Language Sets

 * German: German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese (de, en, la, it, es, fr, zh, pl, vi, el, ru, ar, hi, th, ko, ja)
 * English: English, Chinese, Spanish, Arabic, German, Persian, French, Indonesian, Polish, Russian, Vietnamese, Italian, Japanese, Portuguese, Czech, Bengali, Croatian, Hebrew, Norwegian, Afrikaans, Icelandic, Tagalog, Thai, Hungarian, Irish, Korean, Ukrainian, Urdu, Greek, Telugu, Georgian (en, zh, es, ar, de, fa, fr, id, pl, ru, vi, it, ja, pt, cs, bn, hr, he, no, af, is, tl, th, hu, ga, ko, uk, ur, el, te, ka)
 * Spanish: Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, French, German, Arabic, Japanese (es, en, la, ru, zh, pt, it, fr, de, ar, ja)
 * French: French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Thai, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton, Greek, Hebrew, Korean (fr, en, ar, pt, de, es, ru, zh, nl, pl, it, th, sv, la, is, hy, hu, br, el, he, ko)
 * Italian: Italian, English, German, Russian, Arabic, Chinese, Polish, Greek, Korean (it, en, de, ru, ar, zh, pl, el, ko)
 * Japanese: Japanese, English, Chinese, Korean, German, Arabic, Hebrew (ja, en, zh, ko, de, ar, he)
 * Dutch: Dutch, English, French, German, Spanish, Latin, Chinese, Polish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Greek, Hebrew, Japanese, Russian (nl, en, fr, de, es, la, zh, pl, ar, vi, pt, my, ko, hr, da, cs, el, he, ja, ru)
 * Portuguese: Portuguese, English, Tagalog, Russian, French, Hebrew, Arabic, Chinese, Korean, Greek (pt, en, tl, ru, fr, he, ar, zh, ko, el)
 * Russian: Russian, English, Ukrainian, German, Georgian, Armenian, Latvian, Japanese, Finnish, Spanish, Arabic, Hebrew, Chinese (ru, en, uk, de, ka, hy, lv, ja, fi, es, ar, he, zh)

Summary and Recommendations

 * Enabling most of the languages that got turned off for too many false positives in the early days can now be turned back on, so we should!
 * Enabling the basic bonus feature, and enabling it for the top 2 languages for each wiki, further improves performance. We should do that!
 * We still need to see how this feature interacts with other potential features being considered—but there's only one left: the unknown n-gram penalty.

Unknown n-gram Penalty
(January 2017 — Phab task: T151230)

Background
During one of my earlier experiments, I tried running a 6K model size against model files that only have 5K n-grams. In terms of scoring it makes no difference, except for the penalty associated with unknown n-grams (i.e., those found in the text to be identified that are not found in the model we are trying to match). The unknown n-gram penalty is the model size. In this case, the "6K" model, which only differed from the 5K model in the unknown n-gram penalty, was one of the better performing options. So, it seemed like a good idea to investigate this more systematically.

I considered both positive and negative unknown n-gram penalties. Positive values indicate that we should be more punitive for unknown n-grams (which could be unknown characters—like Δ compared to English—or unknown character sequences—like "sch" in Spanish)—so any uncharacteristic n-gram is evidence of a poor match. Negative values indicate that we should be less punitive and require more evidence than a few n-grams, since, for example, even though this sentence has a Δ in it, it's clearly English: "baseΔ (formerly delta) is the increase from baseln to best."

Experiments
I started with unknown n-gram penalty values range from -500 to 2500 in increments of 500, and later -2000 to 10000 in increments of 1000. It's hard to be sure what the right range or resolution is here. I also let results ratio range from 1.00 to 1.15, model size from 3K to 10K, and the top-language bonus range from 0 to 0.20. Maximum returned languages was held at 1, minimum input length at 3, and max proportion of max score at 0.85.

I started experimenting with the enwiki corpus, which I optimized independently with coordinate descent. The improvement over that achieved by "bucketing and bonuses" was very small, typically 0.1%, though I did find one set of params with an improvement of 0.3%.

I similarly optimized each of the other corpora individually. Improvements ranged from 0 to 0.3%. Optimal and near optimal unknown n-gram penalty values generally ranged from -1000 to 3000, with outliers as high as 10000 (though in the one case, the jawiki, all the value performed similarly well, which turned out to be not doing much at all). The bonus scores mostly where between 0.14 to 0.20, the model sizes mostly 7K to 10K, and the results ratio mostly from 1.04 to 1.09.

With those more limited ranges, I set up a grid search for meta-optimization across all the corpora. As usual, it is possible to optimize the unknown n-gram penalty for each corpus, but a more general value is much more appealing: working with all nine corpora at once is less prone to overfitting, and it's less work in the future to add new languages.

Results
Individual Optimization: Below is a table showing the F0.5 optimized under "buckets and bonuses" and then individually optimized with unknown n-gram penalties (UnP). UnP B&B    indiv   Δ de 94.3%   94.6%   0.3% en 94.3%   94.4%   0.1% es 98.1%   98.1%   0.0% fr 95.7%   95.9%   0.2% it 96.2%   96.2%   0.0% ja 98.0%   98.1%   0.1% nl 92.4%   92.5%   0.1% pt 97.9%   98.0%   0.1% ru 96.4%   96.6%   0.2% Meta-optimization: Below is the result of the meta-optimization across all 9 corpora. Due to rounding, the mean improvement is "0.0%", when it is close to (but less than) 0.05%—which isn't really anything to get excited about. requiring all corpora to use the same value results in three corpora doing worse than the "bucketing and bonus" baseline.

Optimal config: boosted bonus == 0.13; unknown n-gram penalty == 1000; model size == 10000; results ratio == 1.06 best  baseln   baseΔ   %Δmax  optim     optΔ  corpus 93.4%  93.6%?   -0.2%   -3.1%  94.4%    -1.0%  dewiki 93.6%  93.9%?   -0.3%   -4.9%  94.4%    -0.8%  enwiki 98.0%  97.7%     0.3%   13.0%  98.1%    -0.1%  eswiki 95.3%  95.1%     0.2%    4.1%  95.9%    -0.6%  frwiki 95.9%  95.9%     0.0%    0.0%  96.2%    -0.3%  itwiki 97.6%  97.5%     0.1%    4.0%  98.0%    -0.4%  jawiki 91.9%  91.9%     0.0%    0.0%  92.5%    -0.6%  nlwiki 97.5%  97.7%?   -0.2%   -8.7%  98.0%    -0.5%  ptwiki 95.8%  95.3%     0.5%   10.6%  96.6%    -0.8%  ruwiki

Square Error (Best vs Optimal): 3.51 Cumulative Improvement (Best vs Baseline):  0.4% Mean (Min - Max): 0.0% (-0.3% – 0.5%)

Proportional Penalty
As I was writing this up, I realized that there are (at least) two reasonable ways of generalizing the cost function to an unknown n-gram.

The current method gives an unknown n-gram the a penalty of the maximum possible matching score plus one. For example, if we are using a 500 n-gram model, then if the most common n-gram in a sample is the 500th n-gram in the model, the cost of that mismatch is 500 - 1 == 499, one less than the model size. We can generalize this to "always give the max cost for an unknown n-gram" (rounded up to the model size). If the most common n-gram is unknown, the penalty is 500; if the 500th most common n-gram is unknown, the penalty is 500.

On the other hand, if we have a 500 n-gram model, and an n-gram is the 499th most common in the sample but 500th in the model, the cost is 500 - 499 == 1. If it was 501st, a more reasonable penalty would be just 2. We can generalize this to "assume the unknown n-gram is tied in rank with the last n-gram in the model". If the most common n-gram is unknown, the penalty is 499; if the 500th most common n-gram is unknown, the penalty is 0.

Sounds good, but it didn't do much. Using 5 iterations of coordinate descent on each of the nine corpora individually, there was no marked improvement. The best options found (all close to the "buckets and bonuses" optima), included one or two variants with the proportional penalty, but more without. The numbers were similar for the top 5% of results.

Overall, it just didn't matter.

Summary & Recommendations

 * A per-wiki unknown n-gram penalty makes a small difference in overall F0.5 score for most corpora.
 * The unknown n-gram penalty can't be optimized across all the corpora; they each want their own value.
 * The "proportional penalty" isn't terribly useful, either.
 * It's not worth pursuing, but it's worth keeping in our bag of tricks should it be useful in the future.
 * It's only a few lines of code, so I'm not going to clutter up TextCat with more stuff we aren't going to use.