User:TJones (WMF)/Notes/TextCat Optimization for plwiki arwiki zhwiki and nlwiki

September 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T142140)

Summary of Results
Using the default 3K models, the best options for each wiki are presented below:

nlwiki
 * languages: Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, Russian
 * lang codes: nl, en, zh, ar, ko, el, he, ja, ru
 * relevant poor-performing queries: 36%
 * f0.5: 82.3%

Background
See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpora were created.

Dutch Results
About 16.8% of the original 10K corpus was removed in the initial filtering. A 1200-query random sample was taken, and 57.1% of those queries were discarded, leaving a 515-query corpus. Thus only about 35.7% of low-performing queries are in an identifiable language.

Other languages searched on nlwiki
Based on the sample of 515 poor-performing queries on nlwiki that are in some language, about 63% are in Dutch, 25% in English, 2-3% in French and German, less than 2% each are in a handful of other languages.

Below are the results for nlwiki, with raw counts, percentage, and 95% margin of error. In order, those are Dutch, English, French, German, Spanish, Italian, Latin, Chinese, Turkish, Polish, Finnish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Afrikaans.

We don’t have query-trained language models for all of the languages represented here, such as Afrikaans, Danish, Finnish, Croatian, Latin, and Burmese (af, da, fi, hr, la, my). Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,323 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Russian, Hebrew, Greek, and Japanese queries, and Amharic (for which we do not have models).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

 3000   3500    4000    4500    5000    6000    9000           TOTAL    74.5%   74.5%   75.5%   76.1%   77.0%   78.3%   78.7% Dutch   88.2%   88.6%   89.2%   88.8%   89.3%   89.9%   89.9% English   71.3%   69.8%   71.0%   73.6%   75.1%   75.9%   78.2% French   57.1%   58.3%   59.6%   66.7%   64.0%   71.1%   71.1% German   32.1%   32.7%   34.0%   32.7%   34.6%   39.2%   37.7% Spanish   46.2%   44.4%   44.4%   46.2%   46.2%   46.2%   50.0% Italian   20.7%   21.4%   14.8%   14.8%   15.4%   13.8%   13.3% Latin    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Chinese   80.0%   80.0%   80.0%   80.0%   80.0%  100.0%  100.0% Arabic  100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0% Finnish    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Polish   50.0%   50.0%   66.7%   66.7%   66.7%   66.7%   66.7% Turkish   57.1%   50.0%   50.0%   40.0%   44.4%   44.4%   44.4% Afrikaans    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Burmese    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Croatian    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Czech   22.2%   25.0%   28.6%   40.0%   50.0%   50.0%   50.0% Danish    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Korean  100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0% Portuguese   25.0%   20.0%   22.2%   25.0%   22.2%   28.6%   28.6% Vietnamese   40.0%   40.0%   40.0%   40.0%   50.0%   66.7%   66.7%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    74.6%   74.5%   74.4%   74.4%   74.7%  515     383     130 Dutch    93.7%   88.2%   83.3%   80.4%   97.8%  326     262     6 English    80.2%   71.3%   64.2%   60.2%   87.5%  128     77      11 French    47.3%   57.1%   72.2%   87.5%   42.4%  16      14      19 German    23.6%   32.1%   50.6%   81.8%   20.0%  11      9       36 Spanish    34.9%   46.2%   68.2%  100.0%   30.0%  6       6       14 Italian    14.9%   20.7%   34.1%   60.0%   12.5%  5       3       21 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Chinese    90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Finnish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish    38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4 Turkish    45.5%   57.1%   76.9%  100.0%   40.0%  2       2       3 Afrikaans     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Burmese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Croatian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Czech    15.2%   22.2%   41.7%  100.0%   12.5%  1       1       7 Danish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Korean   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Portuguese    17.2%   25.0%   45.5%  100.0%   14.3%  1       1       6 Vietnamese    29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3 f0.5   f1      f2      recall  prec    total   hits    misses

French, German, Spanish, and Italian all do very poorly, with too many false positives. (When Spanish and Italian are disabled, French does even worse). Polish, Turkish, Czech, Portuguese and Vietnamese aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Greek, Hebrew, Japanese, and Russian are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian. As above, 3K is not the optimal model size, but it is within 1.5%. The 3K results are shown below along with the best performing model sizes:

 3000   3500    6000    7000    9000          TOTAL    82.3%   82.5%   82.9%   83.3%   83.7% Dutch   92.1%   92.4%   92.8%   92.9%   92.7% English   76.0%   76.2%   76.5%   77.6%   79.1% French    0.0%    0.0%    0.0%    0.0%    0.0% German    0.0%    0.0%    0.0%    0.0%    0.0% Spanish    0.0%    0.0%    0.0%    0.0%    0.0% Italian    0.0%    0.0%    0.0%    0.0%    0.0% Latin    0.0%    0.0%    0.0%    0.0%    0.0% Chinese  100.0%  100.0%  100.0%  100.0%  100.0% Arabic   80.0%   80.0%   80.0%   80.0%   80.0% Finnish    0.0%    0.0%    0.0%    0.0%    0.0% Polish    0.0%    0.0%    0.0%    0.0%    0.0% Turkish    0.0%    0.0%    0.0%    0.0%    0.0% Afrikaans    0.0%    0.0%    0.0%    0.0%    0.0% Burmese    0.0%    0.0%    0.0%    0.0%    0.0% Croatian    0.0%    0.0%    0.0%    0.0%    0.0% Czech    0.0%    0.0%    0.0%    0.0%    0.0% Danish    0.0%    0.0%    0.0%    0.0%    0.0% Korean  100.0%  100.0%  100.0%  100.0%  100.0% Portuguese    0.0%    0.0%    0.0%    0.0%    0.0% Vietnamese    0.0%    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are <2%, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL   82.3%   82.3%   82.3%   82.3%   82.3%  515     424     91 Dutch   92.4%   92.1%   91.9%   91.7%   92.6%  326     299     24 English   68.5%   76.0%   85.4%   93.0%   64.3%  128     119     66 French    0.0%    0.0%    0.0%    0.0%    0.0%  16      0       0 German    0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0 Spanish    0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0 Italian    0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0 Latin    0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Chinese  100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Arabic   71.4%   80.0%   90.9%  100.0%   66.7%  2       2       1 Finnish    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Turkish    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Afrikaans    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Burmese    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Croatian    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Czech    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Danish    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Korean  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Portuguese    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Vietnamese    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Dutch and English, but overall performance improved. Queries in unrepresented languages were almost all identified as either Dutch or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall. (The one query in Burmese was identified as Arabic, probably because it scored the same in all languages—with the max “unknown” score—and Arabic is alphabetically first among the contenders.)

nlwiki: Best Options
The barely sub-optimal settings (though consistent with others using 3K models) for nlwiki, based on these experiments, would be to use models for Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian (nl, en, zh, ar, ko, el, he, ja, ru), using the default 3000-ngram models.