User:TJones (WMF)/Notes/TextCat Improvements

November 2016 — See TJones_(WMF)/Notes for other projects.

Intro
Related to work on TextCat and Confidence, I'm trying several approaches to improving TextCat performance. Phab parent task: T140289

I expect to write up basic results on new features as I add them, but I may come back to them as I add more features that allow them to interact. As an example, adding the ability to use language models from multiple directories can give immediate improvements when there are distinctive languages to be considered (e.g., Burmese, which has it's own writing system) but not in others (e.g., Catalan, which is too similar to Spanish to include on eswiki without also making improvements that give a boost to the host language).

Optimization Framework
(November 2016 — Phab task: T149314)

I've slapped together a wrapper around my existing language identification tool (i.e., the Perl version of TextCat) and my identification results evaluation tool; it can perform a grid search optimization (of F0.5) over multiple dimensions of hyperparameters, which can be numeric or nominal.

Since any problem in computer science can be solved by introducing another level of indirection, I expanded the wrapper to include the ability to run itself over several different corpora, note the optimal config and F0.5 score for each corpus, and then choose the configuration that gives the best collective results—as measured by the square "error" from the individual optimal configurations for each corpus, along with a hard constraint that no corpus decrease from its baseline performance (i.e, on the current config) by more than a fixed amount (0.5% F0.5 for now). Evaluation can be done on all varying parameters at once, or on a "relevant" subset, each combination of which is represented by the optimal configuration per corpus, found by allowing all other parameters to vary.

It's spiffy.

Initial Findings
(November 2016 — Phab task: T149316)

I reviewed the previous language model optimization for TextCat or English, French, Italian, Spanish, German, Portuguese, Russian, Japanese, and Dutch Wikipedias, looking at languages for which there were Query-based models available, but for which there are Wikitext-based models available. The same issues are at work in these cases: very distinctive languages are easy (e.g., those with a unique writing system), and closely related languages are hard (e.g., Catalan and Spanish, Afrikaans and Dutch). Another issue is the set of languages and number if instances of each language represented in the sample. In some cases there are many languages that are not covered, in others, very few; in all cases, there weren't very many instances of these previously not covered languages.

Summary of Results
We can get a tenth of a percentage point or two increase in F0.5 for some wikis by adding previously not covered languages. For many wikis, there weren't many languages to add, and when there were, often they were too similar to others to be of help. There are a number of languages that we could add easier-to-build Wiki-text models for to expand coverage for certain wikis and earn a few more tenths of F0.5.

Details

 * English: I re-optimized form the full set of relevant languages, and arrived at a similar set of optimal languages. Adding Slovak (sk) improved recall for Slovak, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am) could make improvements at the margins.
 * French: Adding Hungarian gets the one Hungarian example in the sample, increasing F0.5 by 0.1%. Other languages (Breton, Icelandic, Latin) have too many false positives.
 * German: Latin was the only language not already accounted for, and adding it didn't help.
 * Spanish: Catalan was the only language not accounted for, and adding it didn't help.
 * Italian: Latin and Romanian were the only languages not accounted for, and adding them didn't help.
 * Portuguese: Tagalog and Latin were the only languages not accounted for, and adding them didn't help.
 * Russian: Adding Finnish (fi) improved recall for Finnish, but it was offset by other errors. Likely that adding Wiki-text models for Azerbaijani (az) or other languages could make improvements at the margins.
 * Japanese: Kazakh was the only language not accounted for, and no model is available for it.
 * Dutch: Adding Burmese gets the one Burmese example in the sample, increasing F0.5 by 0.2%. Adding Finnish and Croatian (fi, hr) improved recall for Finnish and Croatian, but it was offset by other errors.

Current Recommendations

 * I can add Burmese, Oriya, Malayalam, and Kannada to the list of "easy" language models to be added when present (along with Greek, Korean, Hebrew, Japanese, Thai, Telugu, Georgian, and when there are no examples of "competing" languages using the same script, Arabic, Russian, and Hindi).
 * We could include the models for these "easy" languages with very distinctive writing systems in general (and not only when they are present in a largish sample) because identifying them is "easy" even if they are very low frequency. The main concern is the computational cost of including more models, though I believe that the current implementation of the PHP version of TextCat in production loads all available models anyway.
 * There are a couple of languages present in the samples above that I should build Wiki-text models for and re-assess: Azerbaijani, Swahili, Hausa, Khmer, Amharic (az, sw, ha, km, am), particularly Khmer and Amharic since they have distinctive writing systems.
 * I expect that as other features in this list will interact with this feature, and as precision generally improves, it will be possible to include additional models, including Wiki-text-based models.