User:TJones (WMF)/Notes/TextCat Improvements

November 2016 — See TJones_(WMF)/Notes for other projects.

Intro
Related to work on TextCat and Confidence, I'm trying several approaches to improving TextCat performance. Phab parent task: T140289

I expect to write up basic results on new features as I add them, but I may come back to them as I add more features that allow them to interact. As an example, adding the ability to use language models from multiple directories can give immediate improvements when there are distinctive languages to be considered (e.g., Burmese, which has it's own writing system) but not in others (e.g., Catalan, which is too similar to Spanish to include on eswiki without also making improvements that give a boost to the host language).

Inspired by Mikhail's always excellent linking in his reports and the recent Why We Read Wikipedia presentation—which shows that common motivations for reading Wikipedia include boredom & randomness and intrinsic learning, I've sometimes included more than the usual number of links in the sections below, for the reader's amusement and edification.

Optimization Framework
(November 2016 — Phab task: T149314)

I've slapped together a wrapper around my existing language identification tool (i.e., the Perl version of TextCat) and my identification results evaluation tool; it can perform a grid search optimization (of F0.5) over multiple dimensions of hyper-parameters, which can be numeric or nominal.

Since any problem in computer science can be solved by introducing another level of indirection, I expanded the wrapper to include the ability to run itself over several different corpora, note the optimal config and F0.5 score for each corpus, and then choose the configuration that gives the best collective results—as measured by the square "error" from the individual optimal configurations for each corpus, along with a hard constraint that no corpus decrease from its baseline performance (i.e, on the current config) by more than a fixed amount (0.5% F0.5 for now). Evaluation can be done on all varying parameters at once, or on a "relevant" subset, each combination of which is represented by the optimal configuration per corpus, found by allowing all other parameters to vary.

Using the multi-corpora optimization allows us to find a generic set of hyper-parameters that apply across all corpora/wikis, which discourages overfitting to any particular corpus.

It's spiffy.

Scoring
Three metrics I use, which tend to line up are: (1) "Cumulative Improvement", which is the total improvement across all corpora; (2) "Square Error (vs Optimal)" which squares the difference between the best performance for this corpus with the relevant parameters and the best performance of this corpus in the entire test run; and, (3) a general preference for no corpus to do worse than the baseline (e.g., delta >= 0 for all corpora), with a hard limit on not dropping by more than 0.5% from the baseline.

I have a fourth, experimental score, called "X Score", which changes without notice, but currently computes the diff between (a) the square of the diff between current baseline and 100% and (b) the square of the diff between the current best and 100%, for each corpus, and sums them. It favors moving significantly closer to 100% than the current baseline.

I generally rank results by square error because it penalizes big drops away from the best possible score for a corpus. When these metrics disagree strongly, life is "interesting".

Initial Findings
(November 2016 — Phab task: T149316)

I reviewed the previous language model optimization for TextCat or English, French, Italian, Spanish, German, Portuguese, Russian, Japanese, and Dutch Wikipedias, looking at languages for which there were Query-based models available, but for which there are Wikitext-based models available. The same issues are at work in these cases: very distinctive languages are easy (e.g., those with a unique writing system), and closely related languages are hard (e.g., Catalan and Spanish, Afrikaans and Dutch). Another issue is the set of languages and number if instances of each language represented in the sample. In some cases there are many languages that are not covered, in others, very few; in all cases, there weren't very many instances of these previously not covered languages.

Summary of Results
We can get a tenth of a percentage point or two increase in F0.5 for some wikis by adding previously not covered languages. For many wikis, there weren't many languages to add, and when there were, often they were too similar to others to be of help. There are a number of languages that we could add easier-to-build Wiki-text models for to expand coverage for certain wikis and earn a few more tenths of F0.5.

Details

 * English: I re-optimized form the full set of relevant languages, and arrived at a similar set of optimal languages. Adding Slovak (sk) improved recall for Slovak, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am) could make improvements at the margins.
 * French: Adding Hungarian gets the one Hungarian example in the sample, increasing F0.5 by 0.1%. Other languages (Breton, Icelandic, Latin) have too many false positives.
 * German: Latin was the only language not already accounted for, and adding it didn't help.
 * Spanish: Catalan was the only language not accounted for, and adding it didn't help.
 * Italian: Latin and Romanian were the only languages not accounted for, and adding them didn't help.
 * Portuguese: Tagalog and Latin were the only languages not accounted for, and adding them didn't help.
 * Russian: Adding Finnish (fi) improved recall for Finnish, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani (az) or other languages could make improvements at the margins.
 * Japanese: Kazakh was the only language not accounted for, and no model is available for it.
 * Dutch: Adding Burmese gets the one Burmese example in the sample, increasing F0.5 by 0.2%. Adding Finnish and Croatian (fi, hr) improved recall for Finnish and Croatian, but it was offset by other errors.

Current Recommendations

 * I can add Burmese, Oriya, Malayalam, and Kannada to the list of "easy" language models to be added when present (along with Greek, Korean, Hebrew, Japanese, Thai, Telugu, Georgian, and when there are no examples of "competing" languages using the same script, Arabic, Russian, and Hindi).
 * We could include the models for these "easy" languages with very distinctive writing systems in general (and not only when they are present in a largish sample) because identifying them is "easy" even if they are very low frequency. The main concern is the computational cost of including more models, though I believe that the current implementation of the PHP version of TextCat in production loads all available models anyway.
 * There are a couple of languages present in the samples above that I should build Wiki-text models for and re-assess: Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am), particularly Khmer and Amharic since they have distinctive writing systems.
 * I expect that as other features in this list will interact with this feature, and as precision generally improves, it will be possible to include additional models, including Wiki-text-based models.

Maximum Returned Languages and Results Ratio + Model Size
(November 2016 — Phab task: T149321)

Background
The current settings in production for TextCat for the maximum returned languages and results ratio are inherited from the original TextCat implementation. We were getting better results from TextCat than other language identifiers using those defaults, so we didn't originally mess with them.

However, the original TextCat implementation was built primarily to work with longer texts and smaller n-gram models, so it makes sense that there is some improvement to be had here.

For reference: I started out intending to look at variations in maximum returned languages and results ratio, but remembered that in my previous optimizations, which were based on languages to be considered, model size also turned out to be interesting, so I decided to add it into the mix here.
 * The results ratio by default it is set at 1.05—which means that any language model that has a cost less than 1.05 times the lowest cost (i.e., within 5% of the lowest cost) is reported as an alternative. So, TextCat can and will report that a particular string looks like it is Spanish, but the second best guess is Portuguese, the third best guess is maybe Italian.
 * The maximum returned languages by default is set to 5, which is the maximum number of languages that can be returned by TextCat. If more than maximum returned languages languages are within the of the results ratio, then TextCat can’t make up its mind, and returns “unknown” as the detected language.

The model size is the number of 1- to 5-grams from the model training corpus that are retained, in frequency order. So a model size of 1000 means that the 1000 most frequent n-grams for each language are compared to the 1000 most frequent n-grams from the text to be identified, and the best match wins (modulo the settings for results ratio and maximum returned languages).

The original TextCat implementation used a model size of 400 n-grams; it was intended to be used on longer texts—where actual statistical tendencies of a language can emerge—at a time when Moore's Law had not been chugging along for an additional couple of decades. We've been using a model size of 3,000 n-grams. The PHP implementation in production and my updated Perl implementation have 5,000 n-grams available, though I have kept my own 10,000 n-gram versions of the query-based models around for development and testing.

Data
I have the hand-tagged corpora of 500+ poorly-performing queries from nine Wikipedias that I had previously used. The nine codes/languages are: de/German, en/Englsh, es/Spanish, fr/French, it/Italian, ja/Japanese, nl/Dutch, pt/Portuguese, and ru/Russian.

Initial Config
I set up the optimizer to consider maximum returned languages thresholds from 1 to 10 (current default is 5), and results ratio values from 1.00 to 1.10 (in increments of 0.01; current default is 1.05). I also set up model size to vary from 1K to 10K (in increments of 1K; current default is 3K).

For each corpus, I kept the current list of languages to be considered, as previously optimized. The languages to consider are a subset of available/relevant languages, which makes for a more difficult optimization problem. There are 2n possible subsets, which is not as amenable to an exhaustive grid search. I have some ideas for simple hill-climbing that should be O(n) rather than O(2n) in the number of languages, but for now they are held constant per corpus.

I did mess around briefly with the types models used—query-based vs Wiki-text-based—and the benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% of F0.5.

Maximum Returned Languages
In all my tests, maximum returned languages (MRL) always heavily optimized towards 1—i.e, if there's any ambiguity about what language (based on the results ratio), give up!

Below are the stats for MRL while allowing results ratio (RR) and model size to vary. So for MRL == 1, the value reported is with the best possible values for RR and model size, which may be different for MRL == 2. The Square Error ("SqErr" below) is the sum of the squares of the differences between the best F0.5 score for each corpus at that MRL value, and the optimal F0.5 value for each corpus across all config settings. Since it is 0, all of the best F0.5 values for every corpus has MRL == 1. MRL      1       2       3       4       5       6       7       8       9      10 SqErr 0.00   61.01   62.06   66.70   75.41   77.65   77.65   77.65   78.60   78.60 So, from here on out, I'm setting MRL to 1 for all other tests, to reduce the dimensionality for analysis.

Proportional Limits
I originally had an idea to explore proportional settings for maximum returned languages (since some wikis, particularly enwiki, have many more languages being considered than others). I ran a few quick tests on that before abandoning the idea entirely, as the generally optimal MRL always tends toward 1, so it doesn't matter.

Since we don't need proportional limits, no code updates are needed to TextCat itself.

Up to 10K models
Holding the MRL at 1 and allowing the results ratio and model size to vary, we get the following plot of the square error. The best (lowest) value for each model size is in bold. RR \ Model   1000    2000    3000    4000    5000    6000    7000    8000    9000   10000 1.00       279.77  172.26  104.28  110.21  116.72  118.48  111.17   99.92  100.41  101.28    1.01        132.81   82.36   54.24   52.27   58.81   66.02   59.59   58.10   54.38   46.67    1.02         87.87   54.67   30.72   29.82   33.51   32.45   31.99   36.02   35.46   30.32    1.03         81.87   40.64   30.24   20.20   16.89   17.04   17.40   16.99   14.85   16.80    1.04        117.74   39.40   26.86   14.41   13.28    9.59   11.02    9.04    7.44    6.69    1.05        201.07   56.32   33.13   18.50   10.05    7.40    4.29    4.16    4.98    4.20    1.06        366.81   86.91   42.82   24.62   11.36    8.14    3.90    2.26    1.90    2.27    1.07        624.00  155.12   60.81   30.73   15.26    9.10    8.39    3.60    2.94    3.01    1.08        976.09  281.64  109.30   55.88   35.49   18.25    9.69    7.95    6.81    5.04    1.09       1535.30  414.63  172.67   79.29   47.09   29.61   16.22   13.21    8.44    5.61    1.1        2420.88  604.92  247.52  128.37   76.51   49.24   28.96   22.68   14.03   10.65 The optimal results ratio by model size increases as the model size increases. My guess is that this is because as the model size increases, the penalty for unknown n-grams increases (it is the model size). This pushes up the scores for poorly-matched query/model pairs (i.e., when the query has many unknown n-grams compared to a given model), allowing for a bigger results ratio window. But I'm not sure.

The overall optimal value is for 9K models with a results ratio of 1.06. The score report for that setting is below. best   baseln  delta   optim   corpus 91.1%  88.2%     2.9%  91.5%   dewiki 86.5%  83.0%     3.5%  87.0%   enwiki 97.0%  95.6%     1.4%  97.3%   eswiki 92.8%  89.0%     3.8%  93.3%   frwiki 95.0%  92.2%     2.8%  95.4%   itwiki 96.1%  95.1%     1.0%  96.8%   jawiki 88.5%  82.3%     6.2%  88.5%   nlwiki 97.6%  96.9%     0.7%  97.7%   ptwiki 93.1%  92.4%     0.7%  93.8%   ruwiki Square Error (Best vs Optimal): 1.90 Cumulative Improvement (Best vs Baseline): 23.0% Avg Improvement:  2.6% Max Improvement:  6.2% Min Improvement:  0.7%
 * best is the best F0.5 score for that language within this bucket (in this case, only one config is in the bucket).
 * baseln is the baseline F0.5 score, i.e., the F0.5 score obtained in the original optimization, based generally on language selection.
 * delta is the increase from baseln to best.
 * optim is the best F0.5 obtained in this optimization, including all possible configurations.
 * Square Error is as above, and is the basis for selecting the best model.
 * Improvement is the sum of all deltas, with mean, max, and min also shown.

3K models
We are currently using 3K models. If we limit ourselves to only 3K models then we get the following square error values, but results ratio: Model Size \ RR   1.00    1.01    1.02    1.03    1.04    1.05    1.06    1.07    1.08    1.09    1.10 3000        39.17   10.87    2.55    2.25    3.55    5.08   12.35   22.68   53.07   98.56  150.73 Interestingly, the best value is 1.03, rather than 1.04 for 3K models above. This is because the "optimal" F0.5 values among all the configs we are considering are different, so the square error is lower, too, as shown in the score report below.

best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.7%  83.0%     1.7%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.8%  89.0%     2.8%  92.1%   frwiki 93.9%  92.2%     1.7%  94.2%   itwiki 96.4%  95.1%     1.3%  96.8%   jawiki 84.6%  82.3%     2.3%  85.1%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 2.25 Cumulative Improvement (Best vs Baseline): 13.0% Avg Improvement:  1.4% Max Improvement:  2.8% Min Improvement: -0.3%

Notice that Russian takes a small hit of 0.3% worsening in F0.5 performance, which is within the allowed range for a decrease for any one corpus (up to 0.5%)

5K Models
The current implementation already has the information needed to run up to 5K models, so I optimized with that constraint as well. (The 1K models are so horrible that I have ommitted them.) Model Size \ RR   2000    3000    4000    5000 1.00   128.90   70.18   75.73   81.84            1.01     51.44   29.42   29.25   34.65            1.02     30.33   13.22   12.36   17.53            1.03     20.80   13.36    7.08    7.73            1.04     19.70   12.30    4.63    4.56            1.05     32.80   15.91    7.40    2.93            1.06     60.33   24.14   10.20    3.14            1.07    117.64   39.15   16.91    6.60            1.08    227.76   80.02   36.38   20.61            1.09    347.43  136.09   54.47   28.93            1.1     525.12  200.90   98.23   53.23 The 5K models are best, with 1.05 as the optimal results ratio.

best   baseln  delta   optim   corpus 91.1%  88.2%     2.9%  91.4%   dewiki 85.7%  83.0%     2.7%  86.8%   enwiki 96.6%  95.6%     1.0%  97.2%   eswiki 92.2%  89.0%     3.2%  92.3%   frwiki 94.4%  92.2%     2.2%  94.9%   itwiki 96.4%  95.1%     1.3%  96.8%   jawiki 86.6%  82.3%     4.3%  86.6%   nlwiki 97.5%  96.9%     0.6%  97.7%   ptwiki 92.5%  92.4%     0.1%  93.4%   ruwiki Square Error (Best vs Optimal): 2.93 Cumulative Improvement (Best vs Baseline): 18.3% Avg Improvement:  2.0% Max Improvement:  4.3% Min Improvement:  0.1%

Unknown n-gram Penalty
At one point I accidentally ran with configuration model sizes from 1K to 10K, but only using the 5K models. The result was that the only difference between 5K and larger model "sizes" was the unknown n-gram penalty. The optimal "model size" came out to be 6K, meaning that a 1K/20% extra penalty improved performance

Summary & Current Recommendations

 * The maximum returned languages should be set to 1—we should tolerate no ambiguity!


 * Bigger models will give a noticeable improvement in F0.5 score.
 * Assuming there's no serious performance impact, upgrading to 9K models could give the best results. I'm not sure that I have original data for the wiki-text-based models to be able to generate models larger than 5K for them, so that could be a wrinkle.
 * With the current potential for 5K models, we can get much of the improvement from the 9K models (2.0% mean vs 2.6%)
 * With the current 3K models, we can still get a 1.4% mean improvement.


 * The results ratio should be chosen based on the model size selected.


 * The benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% F0.5.


 * Exploring an additional n-gram penalty may yield further improvements.
 * Since we don't get any benefit proportional limits on the maximum returned languages, no code updates are needed to TextCat itself.


 * We still need to see how these features interact with other potential features being considered.

Minimum Input Length
(November 2016 — Phab task: T149318)

Background
Very short strings, especially in the Latin alphabet, are hard to identify—partly because there isn't much to work with, and partly because of genuine ambiguity. As an extreme example, a is listed under 94 languages in the English Wiktionary.

There is also a particular problem on Wikipedias where language identification is used. Language identification is only run on "poorly performing" queries—i.e., those with fewer than 3 results. Punctuation marks don't really get the "full search experience", and often return only one or two results, making them eligible for language identification. The result is fairly random, depending on the language models available and which has the highest n-gram rank for the punctuation (plus the spaces put on either side of it). Since punctuation marks have entries in most languages, whatever random language is identified as a match will also likely return results.

For example, on English Wikipedia, this results in a semicolon getting result from the Korean Wikipedia, a double quote (") gets results from Hebrew Wikipedia, and a circumflex (^) gets results from Japanese Wikipedia. I don't think most users see these results, because if you use the search box in the upper corner, you go directly to the page with the matching title/redirect.

While limiting language identification based on length seems like a good idea, there are potential pitfalls. There are fairly unambiguous characters, like Japanese or Hebrew. (Japanese katakana could be used to write Okinawan, Ainu, or Palauan, and the Hebrew alphabet could be used to write Yiddish, Judaeo-Spanish, or Judeo-Arabic, but there's still an obvious best guess for individual characters, unlike individual letters in the Latin alphabet.) Some non-alphabetic writing systems, like Japanese and especially Chinese, can pack a lot of information into one character. When the language has short words, like Chinese, one character can actually be a word, and can carry enough information to be a reasonable search.

So, the goal is to balance eliminating the shortest queries, while seeing what effect it has on recall (though precision could improve as ambiguous words are no longer allowed to get incorrect results). The number of very short queries varies by corpus/wiki, depending in large part on the languages present in the corpus, and so some are completely unaffected by minimum lengths.

Initial Results
I examined configs with a minimum input length (MIL) from 1 to 8. Since we currently don't include any empty strings in our input, a minimum input length of 1 is equivalent to the current situation.

3K Models
Limiting options to the 3K models currently in use, with an optimal results ratio (RR) of 1.03, the squared error (vs optimal) doesn't change for minimum input length (MIL) of 1 or 2. 3 is only slightly worse, with 4 only slightly worse than that, with a much bigger jump at MIL == 5. RR \ MIL   1       2       3       4       5       6       7       8 1.03        2.58    2.58    2.78    3.44    8.68   26.49   92.13  256.97 With a minimum input length of 1 or 2, we get a slight hit to Russian over current production baseline, but improvement overall: best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.7%  83.0%     1.7%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.8%  89.0%     2.8%  92.1%   frwiki 93.9%  92.2%     1.7%  94.2%   itwiki 96.4%  95.1%     1.3%  96.9%   jawiki 84.6%  82.3%     2.3%  85.3%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 2.58 Cumulative Improvement (Best vs Baseline): 13.0% Mean (Min - Max): 1.4% (-0.3% – 2.8%) At minimum input length of 3, there's only a small decrease in overall performance from MIL == 2, with a slight increase for Japanese (presumably caused by short difficult Chinese queries being excluded): best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.6%  83.0%     1.6%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.7%  89.0%     2.7%  92.1%   frwiki 93.8%  92.2%     1.6%  94.2%   itwiki 96.5%  95.1%     1.4%  96.9%   jawiki 84.6%  82.3%     2.3%  85.3%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 2.78 Cumulative Improvement (Best vs Baseline): 12.8% Mean (Min - Max): 1.4% (-0.3% – 2.7%) At minimum input length of 4, performance overall is slightly worse, with English getting the worst of it, compared to MIL == 3. best   baseln  delta   optim   corpus 90.2%  88.2%     2.0%  90.5%   dewiki 84.3%  83.0%     1.3%  85.4%   enwiki 96.8%  95.6%     1.2%  97.0%   eswiki 91.6%  89.0%     2.6%  92.1%   frwiki 93.8%  92.2%     1.6%  94.2%   itwiki 96.5%  95.1%     1.4%  96.9%   jawiki 84.6%  82.3%     2.3%  85.3%   nlwiki 97.2%  96.9%     0.3%  97.4%   ptwiki 92.1%  92.4%?   -0.3%  93.1%   ruwiki Square Error (Best vs Optimal): 3.44 Cumulative Improvement (Best vs Baseline): 12.4% Mean (Min - Max): 1.4% (-0.3% – 2.6%)

5K Models
Considering models from 3K to 5K (currently available in production, though we use 3K models), and using as a baseline the performance of 5K models found earlier (when looking at Maximum Returned Languages and Results Ratio, above), we see the same pattern: minimum input length (MIL) of 1 and 2 are the same, 3 is a bit worse, 4 a bit worse than that, with a bigger jump at 5.

The optimal results ratio (RR) depends on the minimum input length, and varies from 1.05 to 1.06. RR \ MIL   1       2       3       4       5       6       7       8 1.05     3.09    3.09    3.88    4.51   10.49   28.65   93.98  259.56        1.06      3.48    3.48    3.53    3.70    7.26   22.92   85.00  247.02 With MIL at 1 (RR == 1.05), this is the baseline. MIL of 2 is the same: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.7%  85.7%     0.0%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.2%  92.2%     0.0%  92.3%   frwiki 94.4%  94.4%     0.0%  95.0%   itwiki 96.4%  96.4%     0.0%  96.9%   jawiki 86.6%  86.6%     0.0%  87.0%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.4%   ruwiki Square Error (Best vs Optimal): 3.09 Cumulative Improvement (Best vs Baseline):  0.0% Mean (Min - Max): 0.0% (0.0% – 0.0%) With MIL at 3 (and RR == 1.06), we get improvements for several corpora, with a net improvement overall, but an unacceptable dip for Russian: best   baseln  delta   optim   corpus 91.3%  91.1%     0.2%  91.4%   dewiki 86.6%  85.7%     0.9%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.1%  92.2%?   -0.1%  92.3%   frwiki 94.8%  94.4%     0.4%  95.0%   itwiki 96.7%  96.4%     0.3%  96.9%   jawiki 86.2%  86.6%?   -0.4%  87.0%   nlwiki 97.7%  97.5%     0.2%  97.7%   ptwiki 91.8%  92.5%!   -0.7%  93.4%   ruwiki Square Error (Best vs Optimal): 3.53 Cumulative Improvement (Best vs Baseline):  0.8% Mean (Min - Max): 0.1% (-0.7% – 0.9%) [Unacceptable Performance Decrease for ruwiki] With MIL at 3 and RR at 1.05, there are only minimal improvement for Japanese, but only minimal dips for any other wikis: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.4%  85.7%?   -0.3%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.2%  92.2%     0.0%  92.3%   frwiki 94.3%  94.4%?   -0.1%  95.0%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 86.6%  86.6%     0.0%  87.0%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.4%   ruwiki Square Error (Best vs Optimal): 3.88 Cumulative Improvement (Best vs Baseline): -0.3% Mean (Min - Max): -0.0% (-0.3% – 0.1%) The performance of MIL == 4 is similar. At RR == 1.06, overall performance is up a bit, but Russian takes too big a hit: best   baseln  delta   optim   corpus 91.3%  91.1%     0.2%  91.4%   dewiki 86.4%  85.7%     0.7%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.0%  92.2%?   -0.2%  92.3%   frwiki 94.8%  94.4%     0.4%  95.0%   itwiki 96.7%  96.4%     0.3%  96.9%   jawiki 86.2%  86.6%?   -0.4%  87.0%   nlwiki 97.7%  97.5%     0.2%  97.7%   ptwiki 91.8%  92.5%!   -0.7%  93.4%   ruwiki Square Error (Best vs Optimal): 3.70 Cumulative Improvement (Best vs Baseline):  0.5% Mean (Min - Max): 0.1% (-0.7% – 0.7%) [Unacceptable Performance Decrease for ruwiki] At MIL == 4 and RR == 1.06, there isn't any improvement other than a tiny bump for Japanese, though English is on the edge of acceptability. best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.4%   dewiki 85.2%  85.7%?   -0.5%  86.8%   enwiki 96.6%  96.6%     0.0%  97.0%   eswiki 92.1%  92.2%?   -0.1%  92.3%   frwiki 94.3%  94.4%?   -0.1%  95.0%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 86.6%  86.6%     0.0%  87.0%   nlwiki 97.5%  97.5%     0.0%  97.7%   ptwiki 92.5%  92.5%     0.0%  93.4%   ruwiki Square Error (Best vs Optimal): 4.51 Cumulative Improvement (Best vs Baseline): -0.6% Mean (Min - Max): -0.1% (-0.5% – 0.1%) N.B., these are all using 5K models with RR == 1.05 as a baseline, rather than the current production baseline. Got to keep improving, right?

Up to 10K models
Considering models from 3K to 10K (max available for query-based models without retraining), and using as a baseline the performance of 5K models found earlier (when looking at Maximum Returned Languages and Results Ratio, above), we see the same pattern: minimum input length (MIL) of 1 and 2 are the same, 3 is a bit worse, 4 a bit worse than that, with a bigger jump at 5.

The optimal model size is 9K, and the optimal results ratio (RR) is 1.06. RR \ MIL   1       2       3       4       5       6       7       8 1.06     2.65    2.65    2.77    3.03    7.33   23.75   87.70  250.03 At MIL == 1 (or 2), performance is generally better than the 5K baseline, though Japanese takes a small hit: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.5%   dewiki 86.5%  85.7%     0.8%  87.0%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.8%  92.2%     0.6%  93.3%   frwiki 95.0%  94.4%     0.6%  95.4%   itwiki 96.1%  96.4%?   -0.3%  96.9%   jawiki 88.5%  86.6%     1.9%  89.0%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.8%   ruwiki Square Error (Best vs Optimal): 2.30 Cumulative Improvement (Best vs Baseline):  4.7% Mean (Min - Max): 0.5% (-0.3% – 1.9%) At MIL == 3, Japanese does a bit better (or, at least, a bit less badly), while overall the improvement over the 5K baseline (with no MIL) is less. best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.5%   dewiki 86.3%  85.7%     0.6%  87.0%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.7%  92.2%     0.5%  93.3%   frwiki 94.9%  94.4%     0.5%  95.4%   itwiki 96.3%  96.4%?   -0.1%  96.9%   jawiki 88.4%  86.6%     1.8%  89.0%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.8%   ruwiki Square Error (Best vs Optimal): 2.57 Cumulative Improvement (Best vs Baseline):  4.4% Mean (Min - Max): 0.5% (-0.1% – 1.8%) At MIL == 4, no wikis do worse, though not as many do better, and the total improvement is slightly less: best   baseln  delta   optim   corpus 91.1%  91.1%     0.0%  91.5%   dewiki 86.0%  85.7%     0.3%  87.0%   enwiki 97.0%  96.6%     0.4%  97.3%   eswiki 92.7%  92.2%     0.5%  93.3%   frwiki 94.9%  94.4%     0.5%  95.4%   itwiki 96.5%  96.4%     0.1%  96.9%   jawiki 88.4%  86.6%     1.8%  89.0%   nlwiki 97.6%  97.5%     0.1%  97.7%   ptwiki 93.1%  92.5%     0.6%  93.8%   ruwiki Square Error (Best vs Optimal): 2.88 Cumulative Improvement (Best vs Baseline):  4.3% Mean (Min - Max): 0.5% (0.0% – 1.8%)

Per Language Analysis
Looking at the optimization report (for models up to 10K) for each language, we can see how the setting of a minimum input length effects the corpus of queries from the corresponding wiki.

German, Spanish, Italian, Portuguese, Russian: Most or all queries are >5 characters, so the optimization is based mostly or entirely on model size, maximum returned languages, and results ratio.

English, French: The shortest queries are 2 characters and all non-Latin (and so generally easier in this context), so MIL ≥ 3 effects recall.

Japanese: The shortest queries are Chinese, and hard to detect properly. Ignoring them improve precision, so MIL > 2 is a good thing.

Dutch: The optimal MIL is actually 8! There are few queries < 7 characters long, but of the nine queries that are 7 characters, five are English and German (both of which are easily confused with Dutch, esp. with shorter strings), and only one is Dutch, so dropping them all improves precision enough to make a difference!

Summary & Current Recommendations

 * None of the 3K, 5K, or 9K models are any worse if we set the minimum input length (MIL) to 2 (vs the de facto current value of 1). Setting MIL to 3 or 4 seems reasonable, depending on the details of the optimization.
 * We've only seen real-world problems with one-character queries, so setting MIL to 2 or 3 seems like the most likely option, without incurring too great a cost.


 * We still need to see how this feature interacts with other potential features being considered.
 * That said, compared to the current production baseline, improvements in other features gives us some wiggle room, since this feature can only decrease recall (though it improves precision).
 * The Perl and PHP versions of TextCat will need to be updated to include this option.