User:TJones (WMF)/Notes/TextCat Re-optimization for enwiki

June 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T138315)

Background & Highlights
I’m posting the results for optimizing TextCat for enwiki separately from the others in the same Phab ticket because this is a re-evaluation of English using different criteria to extract a sample. The good news is that while the selection criteria were fairly different and the specifics of the long tail differ, the sample extracted has a fairly similar distribution of languages represented, the optimized set of languages for identification is compatible, and the previous set of languages performs quite well on the current sample. See “Comparison to Earlier Analysis” below for more details.

See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpus was created.

Summary of Results
Using the default 3K models, the best options for enwiki are presented below:

enwiki
 * languages: English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian; possibly Bulgarian, Tamil, and Portuguese
 * lang codes: en, zh, es, ar, fa, vi, ru, pl, id, ja, bn, he, ko, th, uk, hi, el, te, ka; possibly bg, ta, pt
 * relevant poor-performing queries: 31%
 * f0.5: 83.0%

English Results
About 13% of the original 10K corpus was removed in the initial filtering. A 2000-query random sample was taken, and 64% of those queries were discarded, leaving a 721-query corpus. Thus only about 31% of low-performing queries are in an identifiable language.

Other languages searched on enwiki
Based on the sample of 721 poor-performing queries on enwiki that are in some language, about 70% are in English, 3-5% each in Chinese, Spanish, Arabic, and German, and fewer than 1-2% each are in a large number of other languages.

Below are the results for enwiki, with raw counts, percentage, and 95% margin of error. In order, those are English, Chinese, Spanish, Arabic, German, Persian, French, Vietnamese, Russian, Polish, Indonesian, Italian, Portuguese, Japanese, Czech, Swedish, Norwegian, Malay, Croatian, Hebrew, Bengali, Turkish, Tagalog, Thai, Dutch, Latin, Icelandic, Azerbaijani, Afrikaans, Urdu, Ukrainian, Swahili, Slovak, Kinyarwanda, Korean, Khmer, Hungarian, Hausa, Irish, and Amharic.

We don’t have query-trained language models for many of the languages in the long tail. Since these each represent very small slices of our corpus (<= 3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,727 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Telugu, Georgian, and Hindi queries, and Malayalam, Amharic, and Khmer (for which we do not have models).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and the models that did as well or better are here:

model size    3000    3500 TOTAL    74.2%   74.4% English    84.2%   84.3% Chinese    93.3%   93.3% Spanish    61.3%   61.3% Arabic   100.0%  100.0% German    59.0%   60.0% Persian    95.7%   95.7% French    34.0%   34.0% Indonesian    48.0%   46.2% Polish    63.2%   70.0% Russian    92.3%   92.3% Vietnamese   100.0%  100.0% Italian    20.5%   20.5% Japanese    90.9%   90.9% Portuguese    53.3%   53.3% Czech    36.4%   36.4% Bengali   100.0%  100.0% Croatian     0.0%    0.0% Hebrew   100.0%  100.0% Malay     0.0%    0.0% Norwegian     0.0%    0.0% Swedish    16.7%   16.7% Afrikaans     0.0%    0.0% Azerbaijani     0.0%    0.0% Dutch    16.7%   18.2% Icelandic     0.0%    0.0% Latin     0.0%    0.0% Tagalog     0.0%    0.0% Thai   100.0%  100.0% Turkish    33.3%   33.3% Amharic     0.0%    0.0% Hausa     0.0%    0.0% Hungarian     0.0%    0.0% Irish     0.0%    0.0% Khmer     0.0%    0.0% Kinyarwanda     0.0%    0.0% Korean   100.0%  100.0% Slovak     0.0%    0.0% Swahili     0.0%    0.0% Ukrainian    66.7%   50.0% Urdu     0.0%    0.0%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    74.5%   74.2%   73.9%   73.6%   74.7%  721     531     180 English    92.7%   84.2%   77.1%   73.0%   99.5%  500     365     2 Chinese    97.2%   93.3%   89.7%   87.5%  100.0%  32      28      0 Spanish    56.9%   61.3%   66.4%   70.4%   54.3%  27      19      16 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  25      25      0 German    51.4%   59.0%   69.2%   78.3%   47.4%  23      18      20 Persian    93.2%   95.7%   98.2%  100.0%   91.7%  11      11      1 French    24.7%   34.0%   54.2%   90.0%   20.9%  10      9       34 Indonesian    38.0%   48.0%   65.2%   85.7%   33.3%  7       6       12 Polish    54.5%   63.2%   75.0%   85.7%   50.0%  7       6       6 Russian    96.8%   92.3%   88.2%   85.7%  100.0%  7       6       0 Vietnamese   100.0%  100.0%  100.0%  100.0%  100.0%  7       7       0 Italian    14.5%   20.5%   35.1%   66.7%   12.1%  6       4       29 Japanese    86.2%   90.9%   96.2%  100.0%   83.3%  5       5       1 Portuguese    44.4%   53.3%   66.7%   80.0%   40.0%  5       4       6 Czech    31.2%   36.4%   43.5%   50.0%   28.6%  4       2       5 Bengali   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Croatian     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Hebrew   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Malay     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Norwegian     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Swedish    11.5%   16.7%   30.3%   66.7%    9.5%  3       2       19 Afrikaans     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Azerbaijani     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch    11.1%   16.7%   33.3%  100.0%    9.1%  2       2       20 Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Tagalog     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Thai   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Turkish    23.8%   33.3%   55.6%  100.0%   20.0%  2       2       8 Amharic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hausa     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Irish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Khmer     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Kinyarwanda     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Korean   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Slovak     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Ukrainian    55.6%   66.7%   83.3%  100.0%   50.0%  1       1       1 Urdu     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

French, German, Italian, Swedish, and Dutch all do very poorly, with too many false positives. Turkish isn’t terrible in terms of raw false positives, but aren’t great, either. Once French and Italian are eliminated, Portuguese does very poorly, too.

As noted above, Greek, Telugu, Georgian are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian. With this language set, 3K is the optimal model size.

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    83.0%   82.8%   82.6%   82.5%   83.1%  721     595     121 English    92.4%   92.6%   92.9%   93.0%   92.3%  500     465     39 Chinese    97.2%   93.3%   89.7%   87.5%  100.0%  32      28      0 Spanish    47.5%   58.1%   74.9%   92.6%   42.4%  27      25      34 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  25      25      0 German     0.0%    0.0%    0.0%    0.0%    0.0%  23      0       0 Persian    93.2%   95.7%   98.2%  100.0%   91.7%  11      11      1 French     0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0 Indonesian    21.6%   30.0%   49.2%   85.7%   18.2%  7       6       27 Polish    35.4%   46.7%   68.6%  100.0%   30.4%  7       7       16 Russian    96.8%   92.3%   88.2%   85.7%  100.0%  7       6       0 Vietnamese    81.4%   87.5%   94.6%  100.0%   77.8%  7       7       2 Italian     0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 Japanese    86.2%   90.9%   96.2%  100.0%   83.3%  5       5       1 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Czech     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Bengali   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Croatian     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Hebrew   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Malay     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Norwegian     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Afrikaans     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Azerbaijani     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Tagalog     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Thai   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Turkish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Amharic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hausa     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Irish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Khmer     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Kinyarwanda     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Korean   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Slovak     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Ukrainian    55.6%   66.7%   83.3%  100.0%   50.0%  1       1       1 Urdu     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for English, Spanish, Indonesian, Polish, Vietnamese and others, but overall performance improved. Queries in unrepresented languages were most often English, Spanish, or Indonesian (decreasing precision for all three), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

Comparison to Earlier Analysis
Previously, we’ve been using a very different data source for optimizing languages for TextCat for enwiki. In my original analysis for enwiki I used a 1K query set gathered for a general review of enwiki usage. It was sampled from a single day, included API requests (which made up about 2/3 of the queries) and had none of the simple anti-bot precautions we use now (e.g., queries from search box, exclude users with more than 30 queries/day, only one query from any IP/day, etc.) It also was limited to queries that got zero results, rather than the current criterion of fewer than three results, i.e., “poorly performing”). It also had significantly fewer “junk” queries, which I hypothesize is due to the inclusion of API queries—but that’s just a guess.

The proportions of queries in different languages for the previous and current samples are below. Given the differences in the sources, significant differences would not be surprising, but only English, Arabic, and German have non-overlapping 95% confidence intervals (using the Wilson Score Interval, which “has good properties even for a small number of trials and/or an extreme probability”—i.e., it won’t give negative numbers—instead of the simple margin of error calculations I have been using, as in the table above). The Arabic 95% intervals miss by less than 0.01%, and the all languages overlap in their 99% confidence intervals. The long tails are noisy and differ, but given the limited sample sizes, that’s to be expected.

Based on the current sample, the best set of languages for enwiki is (alphabetically) Arabic, Bengali, Chinese, English, Georgian, Greek, Hebrew, Hindi, Indonesian, Japanese, Korean, Persian, Polish, Russian, Spanish, Telugu, Thai, Ukrainian, and Vietnamese, with F0.5 of 83.0%.

Based on the previous sample, the best set of languages for enwiki is (alphabetically) Arabic, Bengali, Bulgarian, Chinese, English, Greek, Hindi, Japanese, Korean, Persian, Portuguese, Russian, Spanish, Tamil, and Thai, with a slightly higher F0.5 of 83.1%.

The difference is the addition of Georgian, Hebrew, Indonesian, Polish, Telugu, Ukrainian, and Vietnamese, and the removal of Bulgarian, Portuguese, and Tamil. Why these changes?

The previous sample had no Hebrew, Ukrainian, or Vietnamese, and the newer sample had no Bulgarian or Tamil. Georgian and Telugu were added because they are present in the much larger 100K unreviewed sample, and cause no false recall problems when added.

That leaves Portuguese (removed), and Indonesian and Polish (added). Interestingly, there’s a pattern in the percentage of queries in the sample and the direction of change: the percentage of Portuguese queries decreased, and the percentage of Indonesian and Polish queries increases. My hypothesis is that having more queries (especially more than just one) to potentially get correct can offset the generally more stable number of false positives among more well-represented languages.

For Indonesian and Portuguese, the effect is quite small. Removing Indonesian doesn’t change the overall score for the evaluation set (the errors just shift around, and I prefer using more languages to fewer); adding in Portuguese decreases F0.5 by 0.4%. Removing Polish a small effect, decreasing F0.5 by 0.2%.

These minor differences probably represent some overfitting to these particular samples.

Running the current sample with the optimized list from the previous sample gives and F0.5 score of 81.4%, further indicating that we’re probably overfitting a bit, and that it doesn’t matter too much.

enwiki: Best Options
The optimal settings for enwiki, based on these experiments, would be to use models for English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian (en, zh, es, ar, fa, vi, ru, pl, id, ja, bn, he, ko, th, uk, hi, el, te, ka), using the default 3000-ngram models.

Based on information from earlier experiments, including Bulgarian, Tamil, and even Portuguese (bg, ta, pt) would not be amiss.

So far, English Wikipedia has the most diverse collection of languages represented in its queries. If the cost of running so many models (19 or 22 models!) is too high, it would be least damaging to drop Ukrainian, Hindi, Greek, Telugu, Georgian, Bulgarian, Tamil, Portuguese, Korean, and Thai.