User:TJones (WMF)/Notes/TextCat Improvements

November 2016 — See TJones_(WMF)/Notes for other projects.

Intro
Related to work on TextCat and Confidence, I'm trying several approaches to improving TextCat performance. Phab parent task: T140289

I expect to write up basic results on new features as I add them, but I may come back to them as I add more features that allow them to interact. As an example, adding the ability to use language models from multiple directories can give immediate improvements when there are distinctive languages to be considered (e.g., Burmese, which has it's own writing system) but not in others (e.g., Catalan, which is too similar to Spanish to include on eswiki without also making improvements that give a boost to the host language).

Inspired by Mikhail's always excellent linking in his reports and the recent Why We Read Wikipedia presentation—which shows that common motivations for reading Wikipedia include boredom & randomness and intrinsic learning, I've sometimes included more than the usual number of links in the sections below, for the reader's amusement and edification.

Optimization Framework
(November 2016 — Phab task: T149314)

I've slapped together a wrapper around my existing language identification tool (i.e., the Perl version of TextCat) and my identification results evaluation tool; it can perform a grid search optimization (of F0.5) over multiple dimensions of hyper-parameters, which can be numeric or nominal.

Since any problem in computer science can be solved by introducing another level of indirection, I expanded the wrapper to include the ability to run itself over several different corpora, note the optimal config and F0.5 score for each corpus, and then choose the configuration that gives the best collective results—as measured by the square "error" from the individual optimal configurations for each corpus, along with a hard constraint that no corpus decrease from its baseline performance (i.e, on the current config) by more than a fixed amount (0.5% F0.5 for now). Evaluation can be done on all varying parameters at once, or on a "relevant" subset, each combination of which is represented by the optimal configuration per corpus, found by allowing all other parameters to vary.

Using the multi-corpora optimization allows us to find a generic set of hyper-parameters that apply across all corpora/wikis, which discourages overfitting to any particular corpus.

It's spiffy.

Initial Findings
(November 2016 — Phab task: T149316)

I reviewed the previous language model optimization for TextCat or English, French, Italian, Spanish, German, Portuguese, Russian, Japanese, and Dutch Wikipedias, looking at languages for which there were Query-based models available, but for which there are Wikitext-based models available. The same issues are at work in these cases: very distinctive languages are easy (e.g., those with a unique writing system), and closely related languages are hard (e.g., Catalan and Spanish, Afrikaans and Dutch). Another issue is the set of languages and number if instances of each language represented in the sample. In some cases there are many languages that are not covered, in others, very few; in all cases, there weren't very many instances of these previously not covered languages.

Summary of Results
We can get a tenth of a percentage point or two increase in F0.5 for some wikis by adding previously not covered languages. For many wikis, there weren't many languages to add, and when there were, often they were too similar to others to be of help. There are a number of languages that we could add easier-to-build Wiki-text models for to expand coverage for certain wikis and earn a few more tenths of F0.5.

Details

 * English: I re-optimized form the full set of relevant languages, and arrived at a similar set of optimal languages. Adding Slovak (sk) improved recall for Slovak, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am) could make improvements at the margins.
 * French: Adding Hungarian gets the one Hungarian example in the sample, increasing F0.5 by 0.1%. Other languages (Breton, Icelandic, Latin) have too many false positives.
 * German: Latin was the only language not already accounted for, and adding it didn't help.
 * Spanish: Catalan was the only language not accounted for, and adding it didn't help.
 * Italian: Latin and Romanian were the only languages not accounted for, and adding them didn't help.
 * Portuguese: Tagalog and Latin were the only languages not accounted for, and adding them didn't help.
 * Russian: Adding Finnish (fi) improved recall for Finnish, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani (az) or other languages could make improvements at the margins.
 * Japanese: Kazakh was the only language not accounted for, and no model is available for it.
 * Dutch: Adding Burmese gets the one Burmese example in the sample, increasing F0.5 by 0.2%. Adding Finnish and Croatian (fi, hr) improved recall for Finnish and Croatian, but it was offset by other errors.

Current Recommendations

 * I can add Burmese, Oriya, Malayalam, and Kannada to the list of "easy" language models to be added when present (along with Greek, Korean, Hebrew, Japanese, Thai, Telugu, Georgian, and when there are no examples of "competing" languages using the same script, Arabic, Russian, and Hindi).
 * We could include the models for these "easy" languages with very distinctive writing systems in general (and not only when they are present in a largish sample) because identifying them is "easy" even if they are very low frequency. The main concern is the computational cost of including more models, though I believe that the current implementation of the PHP version of TextCat in production loads all available models anyway.
 * There are a couple of languages present in the samples above that I should build Wiki-text models for and re-assess: Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am), particularly Khmer and Amharic since they have distinctive writing systems.
 * I expect that as other features in this list will interact with this feature, and as precision generally improves, it will be possible to include additional models, including Wiki-text-based models.

Background
The current settings in production for TextCat for the maximum returned languages and results ratio are inherited from the original TextCat implementation. We were getting better results from TextCat than other language identifiers using those defaults, so we didn't originally mess with them.

However, the original TextCat implementation was built primarily to work with longer texts and smaller n-gram models, so it makes sense that there is some improvement to be had here.

For reference: I started out intending to look at variations in maximum returned languages and results ratio, but remembered that in my previous optimizations, which were based on languages to be considered, model size also turned out to be interesting, so I decided to add it into the mix here.
 * The results ratio by default it is set at 1.05—which means that any language model that has a cost less than 1.05 times the lowest cost (i.e., within 5% of the lowest cost) is reported as an alternative. So, TextCat can and will report that a particular string looks like it is Spanish, but the second best guess is Portuguese, the third best guess is maybe Italian.
 * The maximum returned languages by default is set to 5, which is the maximum number of languages that can be returned by TextCat. If more than maximum returned languages languages are within the of the results ratio, then TextCat can’t make up its mind, and returns “unknown” as the detected language.

The model size is the number of 1- to 5-grams from the model training corpus that are retained, in frequency order. So a model size of 1000 means that the 1000 most frequent n-grams for each language are compared to the 1000 most frequent n-grams from the text to be identified, and the best match wins (modulo the settings for results ratio and maximum returned languages).

The original TextCat implementation used a model size of 400 n-grams; it was intended to be used on longer texts—where actual statistical tendencies of a language can emerge—at a time when Moore's Law had not been chugging along for an additional couple of decades. We've been using a model size of 3,000 n-grams. The PHP implementation in production and my updated Perl implementation have 5,000 n-grams available, though I have kept my own 10,000 n-gram versions of the query-based models around for development and testing.

Data
I have the hand-tagged corpora of 500+ poorly-performing queries from nine Wikipedias that I had previously used. The nine codes/languages are: de/German, en/Englsh, es/Spanish, fr/French, it/Italian, ja/Japanese, nl/Dutch, pt/Portuguese, and ru/Russian.

Initial Config
I set up the optimizer to consider maximum returned languages thresholds from 1 to 10 (current default is 5), and results ratio values from 1.00 to 1.10 (in increments of 0.01; current default is 1.05). I also set up model size to vary from 1K to 10K (in increments of 1K; current default is 3K).

For each corpus, I kept the current list of languages to be considered, as previously optimized. The languages to consider are a subset of available/relevant languages, which makes for a more difficult optimization problem. There are 2n possible subsets, which is not as amenable to an exhaustive grid search. I have some ideas for simple hill-climbing that should be O(n) rather than O(2n) in the number of languages, but for now they are held constant per corpus.

I did mess around briefly with the types models used—query-based vs Wiki-text-based—and the benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% of F0.5.

Maximum Returned Languages
In all my tests, maximum returned languages (MRL) always heavily optimized towards 1—i.e, if there's any ambiguity about what language (based on the results ratio), give up!

Below are the stats for MRL while allowing results ratio (RR) and model size to vary. So for MRL == 1, the value reported is with the best possible values for RR and model size, which may be different for MRL == 2. The Square Error ("SqErr" below) is the sum of the squares of the differences between the best F0.5 score for each corpus at that MRL value, and the optimal F0.5 value for each corpus across all config settings. Since it is 0, all of the best F0.5 values for every corpus has MRL == 1. MRL      1       2       3       4       5       6       7       8       9      10 SqErr 0.00   61.01   62.06   66.70   75.41   77.65   77.65   77.65   78.60   78.60 So, from here on out, I'm setting MRL to 1 for all other tests, to reduce the dimensionality for analysis.

Up to 10K models
Holding the MRL at 1 and allowing the results ratio and model size to vary, we get the following plot of the square error. The best (lowest) value for each model size is in bold. RR \ Model   1000    2000    3000    4000    5000    6000    7000    8000    9000   10000 1.00       279.77  172.26  104.28  110.21  116.72  118.48  111.17   99.92  100.41  101.28    1.01        132.81   82.36   54.24   52.27   58.81   66.02   59.59   58.10   54.38   46.67    1.02         87.87   54.67   30.72   29.82   33.51   32.45   31.99   36.02   35.46   30.32    1.03         81.87   40.64   30.24   20.20   16.89   17.04   17.40   16.99   14.85   16.80    1.04        117.74   39.40   26.86   14.41   13.28    9.59   11.02    9.04    7.44    6.69    1.05        201.07   56.32   33.13   18.50   10.05    7.40    4.29    4.16    4.98    4.20    1.06        366.81   86.91   42.82   24.62   11.36    8.14    3.90    2.26    1.90    2.27    1.07        624.00  155.12   60.81   30.73   15.26    9.10    8.39    3.60    2.94    3.01    1.08        976.09  281.64  109.30   55.88   35.49   18.25    9.69    7.95    6.81    5.04    1.09       1535.30  414.63  172.67   79.29   47.09   29.61   16.22   13.21    8.44    5.61    1.1        2420.88  604.92  247.52  128.37   76.51   49.24   28.96   22.68   14.03   10.65 The optimal results ratio by model size increases as the model size increases. My guess is that this is because as the model size increases, the penalty for unknown n-grams increases (it is the model size). This pushes up the scores for poorly-matched query/model pairs (i.e., when the query has many unknown n-grams compared to a given model), allowing for a bigger results ratio window. But I'm not sure.

The overall optimal value is for 9K models with a results ratio of 1.06. The score report for that setting is below. best   baseln  delta   optim   corpus 91.1%  88.2%     2.9%  91.5%   dewiki 86.5%  83.0%     3.5%  87.0%   enwiki 97.0%  95.6%     1.4%  97.3%   eswiki 92.8%  89.0%     3.8%  93.3%   frwiki 95.0%  92.2%     2.8%  95.4%   itwiki 96.1%  95.1%     1.0%  96.8%   jawiki 88.5%  82.3%     6.2%  88.5%   nlwiki 97.6%  96.9%     0.7%  97.7%   ptwiki 93.1%  92.4%     0.7%  93.8%   ruwiki Square Error (Best vs Optimal): 1.90 Cumulative Improvement (Best vs Baseline): 23.0% Avg Improvement:  2.6% Max Improvement:  6.2% Min Improvement:  0.7%
 * best is the best F0.5 score for that language within this bucket (in this case, only one config is in the bucket).
 * baseln is the baseline F0.5 score, i.e., the F0.5 score obtained in the original optimization, based generally on language selection.
 * delta is the increase from baseln to best.
 * optim is the best F0.5 obtained in this optimization, including all possible configurations.
 * Square Error is as above, and is the basis for selecting the best model.
 * Improvement is the sum of all deltas, with mean, max, and min also shown.

3K models
We are currently using 3K models. If we limit ourselves to only 3K models then we get the following square error values, but results ratio: Model Size \ RR   1.00    1.01    1.02    1.03    1.04    1.05    1.06    1.07    1.08    1.09    1.10 3000        39.17   10.87    2.55    2.25    3.55    5.08   12.35   22.68   53.07   98.56  150.73 Interestingly, the best value is 1.03, rather than 1.04 for 3K models above. This is because the "optimal" F0.5 values among all the configs we are considering are different, so the square error is lower, too, as shown in the score report below.

[table]

Notice that Russian takes a small hit of 0.3% worsening in F0.5 performance, which is within the allowed range for a decrease for any one corpus (up to 0.5%)

5K Models
The current implementation already has the information needed to run up to 5K models, so I optimized with that constraint as well. (The 1K models are so horrible that I have ommitted them.) Model Size \ RR   2000    3000    4000    5000 1.00   128.90   70.18   75.73   81.84            1.01     51.44   29.42   29.25   34.65            1.02     30.33   13.22   12.36   17.53            1.03     20.80   13.36    7.08    7.73            1.04     19.70   12.30    4.63    4.56            1.05     32.80   15.91    7.40    2.93            1.06     60.33   24.14   10.20    3.14            1.07    117.64   39.15   16.91    6.60            1.08    227.76   80.02   36.38   20.61            1.09    347.43  136.09   54.47   28.93            1.1     525.12  200.90   98.23   53.23 The 5K models are best, with 1.05 as the optimal results ratio.

best   baseln  delta   optim   corpus 91.1%  88.2%     2.9%  91.4%   dewiki 85.7%  83.0%     2.7%  86.8%   enwiki 96.6%  95.6%     1.0%  97.2%   eswiki 92.2%  89.0%     3.2%  92.3%   frwiki 94.4%  92.2%     2.2%  94.9%   itwiki 96.4%  95.1%     1.3%  96.8%   jawiki 86.6%  82.3%     4.3%  86.6%   nlwiki 97.5%  96.9%     0.6%  97.7%   ptwiki 92.5%  92.4%     0.1%  93.4%   ruwiki Square Error (Best vs Optimal): 2.93 Cumulative Improvement (Best vs Baseline): 18.3% Avg Improvement:  2.0% Max Improvement:  4.3% Min Improvement:  0.1%

Unknown n-gram Penalty
At one point I accidentally ran with configuration model sizes from 1K to 10K, but only using the 5K models. The result was that the only difference between 5K and larger model "sizes" was the unknown n-gram penalty. The optimal "model size" came out to be 6K, meaning that a 1K/20% extra penalty improved performance

Summary & Current Recommendations

 * The maximum returned languages should be set to 1—we should tolerate no ambiguity!


 * Bigger models will give a noticeable improvement in F0.5 score.
 * Assuming there's no serious performance impact, upgrading to 9K models could give the best results. I'm not sure that I have original data for the wiki-text-based models to be able to generate models larger than 5K for them, so that could be a wrinkle.
 * With the current potential for 5K models, we can get much of the improvement from the 9K models (2.0% mean vs 2.6%)
 * With the current 3K models, we can still get a 1.4% mean improvement.


 * The results ratio should be chosen based on the model size selected.


 * The benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% F0.5.


 * Exploring an additional n-gram penalty may yield further improvements.


 * We still need to see how these features interact with other potential features being considered.

[I was originally going to explore a

... More to come ...