User:TJones (WMF)/Notes/TextCat Re-optimization for enwiki

From mediawiki.org

June 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T138315)

Background & Highlights[edit]

I’m posting the results for optimizing TextCat for enwiki separately from the others in the same Phab ticket because this is a re-evaluation of English using different criteria to extract a sample. The good news is that while the selection criteria were fairly different and the specifics of the long tail differ, the sample extracted has a fairly similar distribution of languages represented, the optimized set of languages for identification is compatible, and the previous set of languages performs quite well on the current sample. See “Comparison to Earlier Analysis” below for more details.

See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpus was created.

Summary of Results[edit]

Using the default 3K models, the best options for enwiki are presented below:

enwiki

  • languages: English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian; possibly Bulgarian, Tamil, and Portuguese
  • lang codes: en, zh, es, ar, fa, vi, ru, pl, id, ja, bn, he, ko, th, uk, hi, el, te, ka; possibly bg, ta, pt
  • relevant poor-performing queries: 31%
  • f0.5: 83.0%

English Results[edit]

About 13% of the original 10K corpus was removed in the initial filtering. A 2000-query random sample was taken, and 64% of those queries were discarded, leaving a 721-query corpus. Thus only about 31% of low-performing queries are in an identifiable language.

Other languages searched on enwiki[edit]

Based on the sample of 721 poor-performing queries on enwiki that are in some language, about 70% are in English, 3-5% each in Chinese, Spanish, Arabic, and German, and fewer than 1-2% each are in a large number of other languages.

Below are the results for enwiki, with raw counts, percentage, and 95% margin of error.

count lg % +/-
500 en 69.35% 3.37%
32 zh 4.44% 1.50%
27 es 3.74% 1.39%
25 ar 3.47% 1.34%
23 de 3.19% 1.28%
11 fa 1.53% 0.89%
10 fr 1.39% 0.85%
7 vi 0.97% 0.72%
7 ru 0.97% 0.72%
7 pl 0.97% 0.72%
7 id 0.97% 0.72%
6 it 0.83% 0.66%
5 pt 0.69% 0.61%
5 ja 0.69% 0.61%
4 cs 0.55% 0.54%
3 sv 0.42% 0.47%
3 no 0.42% 0.47%
3 ms 0.42% 0.47%
3 hr 0.42% 0.47%
3 he 0.42% 0.47%
3 bn 0.42% 0.47%
2 tr 0.28% 0.38%
2 tl 0.28% 0.38%
2 th 0.28% 0.38%
2 nl 0.28% 0.38%
2 la 0.28% 0.38%
2 is 0.28% 0.38%
2 az 0.28% 0.38%
2 af 0.28% 0.38%
1 ur 0.14% 0.27%
1 uk 0.14% 0.27%
1 sw 0.14% 0.27%
1 sk 0.14% 0.27%
1 rw 0.14% 0.27%
1 ko 0.14% 0.27%
1 km 0.14% 0.27%
1 hu 0.14% 0.27%
1 ha 0.14% 0.27%
1 ga 0.14% 0.27%
1 am 0.14% 0.27%

In order, those are English, Chinese, Spanish, Arabic, German, Persian, French, Vietnamese, Russian, Polish, Indonesian, Italian, Portuguese, Japanese, Czech, Swedish, Norwegian, Malay, Croatian, Hebrew, Bengali, Turkish, Tagalog, Thai, Dutch, Latin, Icelandic, Azerbaijani, Afrikaans, Urdu, Ukrainian, Swahili, Slovak, Kinyarwanda, Korean, Khmer, Hungarian, Hausa, Irish, and Amharic.

We don’t have query-trained language models for many of the languages in the long tail. Since these each represent very small slices of our corpus (<= 3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,727 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Telugu, Georgian, and Hindi queries, and Malayalam, Amharic, and Khmer (for which we do not have models).

Analysis and Optimization[edit]

Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and the models that did as well or better are here:

   model size     3000    3500    
        TOTAL     74.2%   74.4%
      English     84.2%   84.3%
      Chinese     93.3%   93.3%
      Spanish     61.3%   61.3%
       Arabic    100.0%  100.0%
       German     59.0%   60.0%
      Persian     95.7%   95.7%
       French     34.0%   34.0%
   Indonesian     48.0%   46.2%
       Polish     63.2%   70.0%
      Russian     92.3%   92.3%
   Vietnamese    100.0%  100.0%
      Italian     20.5%   20.5%
     Japanese     90.9%   90.9%
   Portuguese     53.3%   53.3%
        Czech     36.4%   36.4%
      Bengali    100.0%  100.0%
     Croatian      0.0%    0.0%
       Hebrew    100.0%  100.0%
        Malay      0.0%    0.0%
    Norwegian      0.0%    0.0%
      Swedish     16.7%   16.7%
    Afrikaans      0.0%    0.0%
  Azerbaijani      0.0%    0.0%
        Dutch     16.7%   18.2%
    Icelandic      0.0%    0.0%
        Latin      0.0%    0.0%
      Tagalog      0.0%    0.0%
         Thai    100.0%  100.0%
      Turkish     33.3%   33.3%
      Amharic      0.0%    0.0%
        Hausa      0.0%    0.0%
    Hungarian      0.0%    0.0%
        Irish      0.0%    0.0%
        Khmer      0.0%    0.0%
  Kinyarwanda      0.0%    0.0%
       Korean    100.0%  100.0%
       Slovak      0.0%    0.0%
      Swahili      0.0%    0.0%
    Ukrainian     66.7%   50.0%
         Urdu      0.0%    0.0%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

                 f0.5    f1      f2      recall  prec    total   hits    misses
        TOTAL     74.5%   74.2%   73.9%   73.6%   74.7%  721     531     180
      English     92.7%   84.2%   77.1%   73.0%   99.5%  500     365     2
      Chinese     97.2%   93.3%   89.7%   87.5%  100.0%  32      28      0
      Spanish     56.9%   61.3%   66.4%   70.4%   54.3%  27      19      16
       Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  25      25      0
       German     51.4%   59.0%   69.2%   78.3%   47.4%  23      18      20
      Persian     93.2%   95.7%   98.2%  100.0%   91.7%  11      11      1
       French     24.7%   34.0%   54.2%   90.0%   20.9%  10      9       34
   Indonesian     38.0%   48.0%   65.2%   85.7%   33.3%  7       6       12
       Polish     54.5%   63.2%   75.0%   85.7%   50.0%  7       6       6
      Russian     96.8%   92.3%   88.2%   85.7%  100.0%  7       6       0
   Vietnamese    100.0%  100.0%  100.0%  100.0%  100.0%  7       7       0
      Italian     14.5%   20.5%   35.1%   66.7%   12.1%  6       4       29
     Japanese     86.2%   90.9%   96.2%  100.0%   83.3%  5       5       1
   Portuguese     44.4%   53.3%   66.7%   80.0%   40.0%  5       4       6
        Czech     31.2%   36.4%   43.5%   50.0%   28.6%  4       2       5
      Bengali    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
     Croatian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
       Hebrew    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
        Malay      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Norwegian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
      Swedish     11.5%   16.7%   30.3%   66.7%    9.5%  3       2       19
    Afrikaans      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
  Azerbaijani      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
        Dutch     11.1%   16.7%   33.3%  100.0%    9.1%  2       2       20
    Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
        Latin      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Tagalog      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
         Thai    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
      Turkish     23.8%   33.3%   55.6%  100.0%   20.0%  2       2       8
      Amharic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
        Hausa      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
        Irish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
        Khmer      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Kinyarwanda      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Korean    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
       Slovak      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Ukrainian     55.6%   66.7%   83.3%  100.0%   50.0%  1       1       1
         Urdu      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
                 f0.5    f1      f2      recall  prec    total   hits    misses

French, German, Italian, Swedish, and Dutch all do very poorly, with too many false positives. Turkish isn’t terrible in terms of raw false positives, but aren’t great, either. Once French and Italian are eliminated, Portuguese does very poorly, too.

As noted above, Greek, Telugu, Georgian are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian. With this language set, 3K is the optimal model size.

The detailed report for the 3K model is here:

                f0.5    f1      f2      recall  prec    total   hits    misses
       TOTAL     83.0%   82.8%   82.6%   82.5%   83.1%  721     595     121
     English     92.4%   92.6%   92.9%   93.0%   92.3%  500     465     39
     Chinese     97.2%   93.3%   89.7%   87.5%  100.0%  32      28      0
     Spanish     47.5%   58.1%   74.9%   92.6%   42.4%  27      25      34
      Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  25      25      0
      German      0.0%    0.0%    0.0%    0.0%    0.0%  23      0       0
     Persian     93.2%   95.7%   98.2%  100.0%   91.7%  11      11      1
      French      0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0
  Indonesian     21.6%   30.0%   49.2%   85.7%   18.2%  7       6       27
      Polish     35.4%   46.7%   68.6%  100.0%   30.4%  7       7       16
     Russian     96.8%   92.3%   88.2%   85.7%  100.0%  7       6       0
  Vietnamese     81.4%   87.5%   94.6%  100.0%   77.8%  7       7       2
     Italian      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
    Japanese     86.2%   90.9%   96.2%  100.0%   83.3%  5       5       1
  Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
       Czech      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Bengali    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Croatian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
      Hebrew    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
       Malay      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Norwegian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
     Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Afrikaans      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
 Azerbaijani      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
       Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
   Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
       Latin      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Tagalog      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
        Thai    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
     Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Amharic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Hausa      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
   Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Irish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Khmer      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Kinyarwanda      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Korean    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Slovak      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
   Ukrainian     55.6%   66.7%   83.3%  100.0%   50.0%  1       1       1
        Urdu      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
                 f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for English, Spanish, Indonesian, Polish, Vietnamese and others, but overall performance improved. Queries in unrepresented languages were most often English, Spanish, or Indonesian (decreasing precision for all three), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

Comparison to Earlier Analysis[edit]

Previously, we’ve been using a very different data source for optimizing languages for TextCat for enwiki. In my original analysis for enwiki I used a 1K query set gathered for a general review of enwiki usage. It was sampled from a single day, included API requests (which made up about 2/3 of the queries) and had none of the simple anti-bot precautions we use now (e.g., queries from search box, exclude users with more than 30 queries/day, only one query from any IP/day, etc.) It also was limited to queries that got zero results, rather than the current criterion of fewer than three results, i.e., “poorly performing”). It also had significantly fewer “junk” queries, which I hypothesize is due to the inclusion of API queries—but that’s just a guess.

The proportions of queries in different languages for the previous and current samples are below. Given the differences in the sources, significant differences would not be surprising, but only English, Arabic, and German have non-overlapping 95% confidence intervals (using the Wilson Score Interval, which “has good properties even for a small number of trials and/or an extreme probability”—i.e., it won’t give negative numbers—instead of the simple margin of error calculations I have been using, as in the table above). The Arabic 95% intervals miss by less than 0.01%, and the all languages overlap in their 99% confidence intervals.

previous current lang
77.32% 69.35% English
2.58% 4.44% Chinese
5.54% 3.74% Spanish
1.29% 3.47% Arabic
1.03% 3.19% German
0.52% 1.53% Persian
1.29% 1.39% French
0.52% 0.97% Indonesian
0.13% 0.97% Polish
0.64% 0.97% Russian
0.97% Vietnamese
0.26% 0.83% Italian
0.13% 0.69% Japanese
2.45% 0.69% Portuguese
0.55% Czech
0.26% 0.42% Bengali
0.13% 0.42% Croatian
0.42% Hebrew
0.77% 0.42% Malay
0.26% 0.42% Norwegian
0.13% 0.42% Swedish
0.28% Afrikaans
0.28% Azerbaijani
0.13% 0.28% Dutch
0.28% Icelandic
0.13% 0.28% Latin
1.16% 0.28% Tagalog
0.13% 0.28% Thai
0.64% 0.28% Turkish
0.14% Amharic
0.14% Hausa
0.14% Hungarian
0.14% Irish
0.14% Khmer
0.14% Kinyarwanda
0.39% 0.14% Korean
0.14% Slovak
0.52% 0.14% Swahili
0.14% Ukrainian
0.14% Urdu
0.26% Bulgarian
0.13% Estonian
0.13% Finnish
0.13% Greek
0.26% Hindi
0.13% Hmong
0.13% Kannada
0.13% Serbian
0.13% Somali
0.13% Tamil
0.13% Uzbek
776 721 sample size

The long tails are noisy and differ, but given the limited sample sizes, that’s to be expected.

Based on the current sample, the best set of languages for enwiki is (alphabetically) Arabic, Bengali, Chinese, English, Georgian, Greek, Hebrew, Hindi, Indonesian, Japanese, Korean, Persian, Polish, Russian, Spanish, Telugu, Thai, Ukrainian, and Vietnamese, with F0.5 of 83.0%.

Based on the previous sample, the best set of languages for enwiki is (alphabetically) Arabic, Bengali, Bulgarian, Chinese, English, Greek, Hindi, Japanese, Korean, Persian, Portuguese, Russian, Spanish, Tamil, and Thai, with a slightly higher F0.5 of 83.1%.

The difference is the addition of Georgian, Hebrew, Indonesian, Polish, Telugu, Ukrainian, and Vietnamese, and the removal of Bulgarian, Portuguese, and Tamil. Why these changes?

The previous sample had no Hebrew, Ukrainian, or Vietnamese, and the newer sample had no Bulgarian or Tamil. Georgian and Telugu were added because they are present in the much larger 100K unreviewed sample, and cause no false recall problems when added.

That leaves Portuguese (removed), and Indonesian and Polish (added). Interestingly, there’s a pattern in the percentage of queries in the sample and the direction of change: the percentage of Portuguese queries decreased, and the percentage of Indonesian and Polish queries increases. My hypothesis is that having more queries (especially more than just one) to potentially get correct can offset the generally more stable number of false positives among more well-represented languages.

For Indonesian and Portuguese, the effect is quite small. Removing Indonesian doesn’t change the overall score for the evaluation set (the errors just shift around, and I prefer using more languages to fewer); adding in Portuguese decreases F0.5 by 0.4%. Removing Polish a small effect, decreasing F0.5 by 0.2%.

These minor differences probably represent some overfitting to these particular samples.

Running the current sample with the optimized list from the previous sample gives and F0.5 score of 81.4%, further indicating that we’re probably overfitting a bit, and that it doesn’t matter too much.

enwiki: Best Options[edit]

The optimal settings for enwiki, based on these experiments, would be to use models for English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian (en, zh, es, ar, fa, vi, ru, pl, id, ja, bn, he, ko, th, uk, hi, el, te, ka), using the default 3000-ngram models.

Based on information from earlier experiments, including Bulgarian, Tamil, and even Portuguese (bg, ta, pt) would not be amiss.

So far, English Wikipedia has the most diverse collection of languages represented in its queries. If the cost of running so many models (19 or 22 models!) is too high, it would be least damaging to drop Ukrainian, Hindi, Greek, Telugu, Georgian, Bulgarian, Tamil, Portuguese, Korean, and Thai.