User:TJones (WMF)/Notes/TextCat Improvements

November 2016 — See TJones_(WMF)/Notes for other projects.

Intro
Related to work on TextCat and Confidence, I'm trying several approaches to improving TextCat performance. Phab parent task: T140289

I expect to write up basic results on new features as I add them, but I may come back to them as I add more features that allow them to interact. As an example, adding the ability to use language models from multiple directories can give immediate improvements when there are distinctive languages to be considered (e.g., Burmese, which has it's own writing system) but not in others (e.g., Catalan, which is too similar to Spanish to include on eswiki without also making improvements that give a boost to the host language).

Inspired by Mikhail's always excellent linking in his reports and the recent Why We Read Wikipedia presentation—which shows that common motivations for reading Wikipedia include boredom & randomness and intrinsic learning, I've sometimes included more than the usual number of links in the sections below, for the reader's amusement and edification.

Optimization Framework
(November 2016 — Phab task: T149314)

I've slapped together a wrapper around my existing language identification tool (i.e., the Perl version of TextCat) and my identification results evaluation tool; it can perform a grid search optimization (of F0.5) over multiple dimensions of hyperparameters, which can be numeric or nominal.

Since any problem in computer science can be solved by introducing another level of indirection, I expanded the wrapper to include the ability to run itself over several different corpora, note the optimal config and F0.5 score for each corpus, and then choose the configuration that gives the best collective results—as measured by the square "error" from the individual optimal configurations for each corpus, along with a hard constraint that no corpus decrease from its baseline performance (i.e, on the current config) by more than a fixed amount (0.5% F0.5 for now). Evaluation can be done on all varying parameters at once, or on a "relevant" subset, each combination of which is represented by the optimal configuration per corpus, found by allowing all other parameters to vary.

It's spiffy.

Initial Findings
(November 2016 — Phab task: T149316)

I reviewed the previous language model optimization for TextCat or English, French, Italian, Spanish, German, Portuguese, Russian, Japanese, and Dutch Wikipedias, looking at languages for which there were Query-based models available, but for which there are Wikitext-based models available. The same issues are at work in these cases: very distinctive languages are easy (e.g., those with a unique writing system), and closely related languages are hard (e.g., Catalan and Spanish, Afrikaans and Dutch). Another issue is the set of languages and number if instances of each language represented in the sample. In some cases there are many languages that are not covered, in others, very few; in all cases, there weren't very many instances of these previously not covered languages.

Summary of Results
We can get a tenth of a percentage point or two increase in F0.5 for some wikis by adding previously not covered languages. For many wikis, there weren't many languages to add, and when there were, often they were too similar to others to be of help. There are a number of languages that we could add easier-to-build Wiki-text models for to expand coverage for certain wikis and earn a few more tenths of F0.5.

Details

 * English: I re-optimized form the full set of relevant languages, and arrived at a similar set of optimal languages. Adding Slovak (sk) improved recall for Slovak, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am) could make improvements at the margins.
 * French: Adding Hungarian gets the one Hungarian example in the sample, increasing F0.5 by 0.1%. Other languages (Breton, Icelandic, Latin) have too many false positives.
 * German: Latin was the only language not already accounted for, and adding it didn't help.
 * Spanish: Catalan was the only language not accounted for, and adding it didn't help.
 * Italian: Latin and Romanian were the only languages not accounted for, and adding them didn't help.
 * Portuguese: Tagalog and Latin were the only languages not accounted for, and adding them didn't help.
 * Russian: Adding Finnish (fi) improved recall for Finnish, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani (az) or other languages could make improvements at the margins.
 * Japanese: Kazakh was the only language not accounted for, and no model is available for it.
 * Dutch: Adding Burmese gets the one Burmese example in the sample, increasing F0.5 by 0.2%. Adding Finnish and Croatian (fi, hr) improved recall for Finnish and Croatian, but it was offset by other errors.

Current Recommendations

 * I can add Burmese, Oriya, Malayalam, and Kannada to the list of "easy" language models to be added when present (along with Greek, Korean, Hebrew, Japanese, Thai, Telugu, Georgian, and when there are no examples of "competing" languages using the same script, Arabic, Russian, and Hindi).
 * We could include the models for these "easy" languages with very distinctive writing systems in general (and not only when they are present in a largish sample) because identifying them is "easy" even if they are very low frequency. The main concern is the computational cost of including more models, though I believe that the current implementation of the PHP version of TextCat in production loads all available models anyway.
 * There are a couple of languages present in the samples above that I should build Wiki-text models for and re-assess: Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am), particularly Khmer and Amharic since they have distinctive writing systems.
 * I expect that as other features in this list will interact with this feature, and as precision generally improves, it will be possible to include additional models, including Wiki-text-based models.

Background
The current settings in production for TextCat for the maximum returned languages and results ratio are inherited from the original TextCat implementation. We were getting better results from TextCat than other language identifiers using those defaults, so we didn't originally mess with them.

However, the original TextCat implementation was built primarily to work with longer texts and smaller n-gram models, so it makes sense that there is some improvement to be had here.

For reference: I started out intending to look at variations in maximum returned languages and results ratio, but remembered that in my previous optimizations, which were based on languages to be considered, model size also turned out to be interesting, so I decided to add it into the mix here.
 * The results ratio by default it is set at 1.05—which means that any language model that has a cost less than 1.05 times the lowest cost (i.e., within 5% of the lowest cost) is reported as an alternative. So, TextCat can and will report that a particular string looks like it is Spanish, but the second best guess is Portuguese, the third best guess is maybe Italian.
 * The maximum returned languages by default is set to 5, which is the maximum number of languages that can be returned by TextCat. If more than maximum returned languages languages are within the of the results ratio, then TextCat can’t make up its mind, and returns “unknown” as the detected language.

The model size is the number of 1- to 5-grams from the model training corpus that are retained, in frequency order. So a model size of 1000 means that the 1000 most frequent n-grams for each language are compared to the 1000 most frequent n-grams from the text to be identified, and the best match wins (modulo the settings for results ratio and maximum returned languages).

The original TextCat implementation used a model size of 400 n-grams; it was intended to be used on longer texts—where actual statistical tendencies of a language can emerge—at a time when Moore's Law had not been chugging along for an additional couple of decades. We've been using a model size of 3,000 n-grams. The PHP implementation in production and my updated Perl implementation have 5,000 n-grams available, though I have kept my own 10,000 n-gram versions of the query-based models around for development and testing.

Data
I have the hand-tagged corpora of 500+ poorly-performing queries from nine Wikipedias that I had previously used. The nine codes/languages are: de/German, en/Englsh, es/Spanish, fr/French, it/Italian, ja/Japanese, nl/Dutch, pt/Portuguese, and ru/Russian.

Initial Config
I set up the optimizer to consider maximum returned languages thresholds from 1 to 10 (current default is 5), and results ratio values from 1.00 to 1.10 (in increments of 0.01; current default is 1.05). I also set up model size to vary from 1K to 10K (in increments of 1K; current default is 3K).

For each corpus, I kept the current list of languages to be considered, as previously optimized. The languages to consider are a subset of available/relevant languages, which makes for a more difficult optimization problem. There are 2n possible subsets, which is not as amenable to an exhaustive grid search. I have some ideas for simple hill-climbing that should be O(n) rather than O(2n) in the number of languages, but for now they are held constant per corpus.

I did mess around briefly with the types models used—query-based vs Wiki-text-based—and the benefit of the query-based models is still clear, and they consistently outperform Wiki-text-based models on the query corpora, generally by 2-3% of F0.5.

... More to come ...