User:TJones (WMF)/Notes/TextCat Optimization for ptwiki ruwiki and jawiki

July 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T138315)

Summary of Results
Using the default 3K models, the best options for each wiki are presented below:

ptwiki ruwiki
 * languages: Portuguese, English, Russian, Hebrew, Arabic, Chinese, Korean, Greek
 * lang codes: pt, en, ru, he, ar, zh, ko, el
 * relevant poor-performing queries: 46%
 * f0.5: 96.9%
 * languages: Russian, English, Ukrainian, Georgian, Armenian, Japanese, Arabic, Hebrew, Chinese
 * lang codes: ru, en, uk, ka, hy, ja, ar, he, zh
 * relevant poor-performing queries: 30.5%
 * f0.5: 92.4%

Background
See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpora were created.

Portuguese Results
About 12% of the original 10K corpus was removed in the initial filtering. A 1000-query random sample was taken, and 48% of those queries were discarded, leaving a 524-query corpus. Thus only about 46% of low-performing queries are in an identifiable language.

Other languages searched on ptwiki
Based on the sample of 524 poor-performing queries on ptwiki that are in some language, about 80% are in Portuguese, 4% in English, and fewer than 1% each are in a handful of other languages.

Below are the results for ptwiki, with raw counts, percentage, and 95% margin of error. In order, those are Portuguese, English, Spanish, Tagalog, Russian, Dutch, Latin, and French.

We don’t have query-trained language models for all of the languages represented here, namely Tagalog and Latin. Since these each represent very small slices of our corpus (1 query each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8797 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Hebrew, Arabic, Chinese, Korean, and Greek queries, and Burmese (for which we do not have models).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

model size  3000    5000    6000    9000    10000 TOTAL  86.8%   87.4%   88.0%   88.2%   88.7% Portuguese  93.2%   93.6%   93.9%   94.2%   94.5% English  78.4%   80.0%   81.6%   76.6%   76.6% Spanish  13.1%   13.6%   13.8%   14.3%   15.1% Dutch  28.6%   25.0%   25.0%   28.6%   33.3% French  28.6%   33.3%   40.0%   33.3%   28.6% Latin   0.0%    0.0%    0.0%    0.0%    0.0% Russian 100.0%  100.0%  100.0%  100.0%  100.0% Tagalog   0.0%    0.0%    0.0%    0.0%    0.0%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    86.8%   86.8%   86.8%   86.8%   86.8%  524     455     69 Portuguese    97.2%   93.2%   89.6%   87.3%  100.0%  490     428     0 English    77.5%   78.4%   79.4%   80.0%   76.9%  25      20      6 Spanish     8.6%   13.1%   27.4%  100.0%    7.0%  4       4       53 Dutch    20.0%   28.6%   50.0%  100.0%   16.7%  1       1       5 French    20.0%   28.6%   50.0%  100.0%   16.7%  1       1       5 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Tagalog     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Spanish does very poorly, with way too many false positives. Dutch and French aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Hebrew, Arabic, Chinese, Korean, and Greek are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is Portuguese, English, Russian, Hebrew, Arabic, Chinese, Korean, and Greek. With these languages, 3K is the optimal model size. The 3K results are shown below along with other top-performing model sizes:

model size   2500    3000    9000    10000 TOTAL   96.9%   96.9%   96.9%   96.9% Portuguese   98.9%   98.8%   98.7%   98.7% English   79.4%   80.6%   82.0%   81.4% Spanish    0.0%    0.0%    0.0%    0.0% Dutch    0.0%    0.0%    0.0%    0.0% French    0.0%    0.0%    0.0%    0.0% Latin    0.0%    0.0%    0.0%    0.0% Russian  100.0%  100.0%  100.0%  100.0% Tagalog    0.0%    0.0%    0.0%    0.0%

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL   96.9%   96.9%   96.9%   96.9%   96.9%  524     508     16 Portuguese   99.0%   98.8%   98.5%   98.4%   99.2%  490     482     4 English   72.3%   80.6%   91.2%  100.0%   67.6%  25      25      12 Spanish    0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Dutch    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 French    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Russian  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Tagalog    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Portuguese and English, but overall performance improved. Queries in unrepresented languages were all identified as English, except for Spanish queries, which were identified as Portuguese (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

ptwiki: Best Options
The optimal settings for ptwiki, based on these experiments, would be to use models for Portuguese, English, Russian, Hebrew, Arabic, Chinese, Korean, Greek (pt, en, ru, he, ar, zh, ko, el), using the default 3000-ngram models.

Russian Results
About 10.7% of the original 10K corpus was removed in the initial filtering. A 1500-query random sample was taken, and 65.8% of those queries were discarded, leaving a 512-query corpus. Thus only about 30.5% of low-performing queries are in an identifiable language.

Other languages searched on ruwiki
Based on the sample of 512 poor-performing queries on ruwiki that are in some language, about 77% are in Russian, >10% in English, <5% in Ukrainian, and fewer than 1% each are in a handful of other languages.

Below are the results for ruwiki, with raw counts, percentage, and 95% margin of error. In order, those are Russian, English, Ukrainian, Kazakh, German, Georgian, Uzbek, Kirghiz, Armenian, Romanian, Latvian, Japanese, Italian, French, Finnish, Spanish, Azerbaijani, and Arabic.

We don’t have query-trained language models for all of the languages represented here, such as Azerbaijani, Finnish, Kazakh, Kirghiz, Latvian, Romanian, and Uzbek. Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,931 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Hebrew and Chinese queries.

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

model size    3000    4500    5000    7000 TOTAL   88.5%   89.5%   90.0%   91.2% Russian   96.7%   96.8%   97.2%   97.6% English   76.4%   80.0%   78.9%   82.1% Ukrainian   67.7%   68.9%   72.4%   78.0% German   40.0%   53.3%   50.0%   53.3% Kazakh    0.0%    0.0%    0.0%    0.0% Georgian  100.0%  100.0%  100.0%  100.0% Armenian  100.0%  100.0%  100.0%  100.0% Kirghiz    0.0%    0.0%    0.0%    0.0% Uzbek    0.0%    0.0%    0.0%    0.0% Arabic  100.0%  100.0%  100.0%  100.0% Azerbaijani    0.0%    0.0%    0.0%    0.0% Finnish    0.0%    0.0%    0.0%    0.0% French    0.0%    0.0%    33.3%   40.0% Italian   20.0%   18.2%   20.0%   22.2% Japanese  100.0%  100.0%  100.0%  100.0% Latvian    0.0%    0.0%    0.0%    0.0% Romanian    0.0%    0.0%    0.0%    0.0% Spanish    0.0%    0.0%    0.0%    0.0%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL   88.5%   88.5%   88.5%   88.5%   88.5%  512     453     59 Russian   97.1%   96.7%   96.2%   95.9%   97.4%  394     378     10 English   87.9%   76.4%   67.5%   62.7%   97.7%  67      42      1 Ukrainian   60.7%   67.7%   76.6%   84.0%   56.8%  25      21      16 German   29.4%   40.0%   62.5%  100.0%   25.0%  4       4       12 Kazakh    0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Georgian  100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Armenian  100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Kirghiz    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Uzbek    0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Arabic  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Azerbaijani    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Finnish    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 French    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       5 Italian   13.5%   20.0%   38.5%  100.0%   11.1%  1       1       8 Japanese  100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Latvian    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Romanian    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Spanish    0.0%    0.0%    0.0%    0.0%    0.0%  1       0       7 f0.5   f1      f2      recall  prec    total   hits    misses

French, Spanish, Italian, and German all do very poorly, with too many false positives.

As noted above, Hebrew and Chinese are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is Russian, English, Ukrainian, Georgian, Armenian, Japanese, Arabic, Hebrew, Chinese. As above, 3K is not the optimal model size, but it is within 1.5%. The 3K results are shown below along with the best performing model sizes:

model size  3000    4500    5000    7000 TOTAL  92.4%   92.6%   93.2%   93.8% Russian  96.7%   96.8%   97.2%   97.6% English  91.2%   91.2%   91.2%   91.2% Ukrainian  67.7%   68.9%   72.4%   78.0% German   0.0%    0.0%    0.0%    0.0% Kazakh   0.0%    0.0%    0.0%    0.0% Georgian 100.0%  100.0%  100.0%  100.0% Armenian 100.0%  100.0%  100.0%  100.0% Kirghiz   0.0%    0.0%    0.0%    0.0% Uzbek   0.0%    0.0%    0.0%    0.0% Arabic 100.0%  100.0%  100.0%  100.0% Azerbaijani   0.0%    0.0%    0.0%    0.0% Finnish   0.0%    0.0%    0.0%    0.0% French   0.0%    0.0%    0.0%    0.0% Italian   0.0%    0.0%    0.0%    0.0% Japanese 100.0%  100.0%  100.0%  100.0% Latvian   0.0%    0.0%    0.0%    0.0% Romanian   0.0%    0.0%    0.0%    0.0% Spanish    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are reasonably small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    92.4%   92.4%   92.4%   92.4%   92.4%  512     473     39 Russian    97.1%   96.7%   96.2%   95.9%   97.4%  394     378     10 English    86.6%   91.2%   96.3%  100.0%   83.8%  67      67      13 Ukrainian    60.7%   67.7%   76.6%   84.0%   56.8%  25      21      16 German     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Kazakh     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0 Georgian   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0 Armenian   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Kirghiz     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Uzbek     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Azerbaijani     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Finnish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 French     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Japanese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Latvian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Romanian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Spanish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went way up and precision went down for English, but overall performance improved. Queries in unrepresented languages were all identified as English (decreasing precision), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

ruwiki: Best Options
The slightly sub-optimal settings (though consistent with others using 3K models) for ruwiki, based on these experiments, would be to use models for Russian, English, Ukrainian, Georgian, Armenian, Japanese, Arabic, Hebrew, Chinese (ru, en, uk, ka, hy, ja, ar, he, zh), using the default 3000-ngram models.

Notes on Latin Russian, Cyrillic English, etc.
Since I recently did some work on typing on the wrong keyboard in Russian and English, I enabled the models for Latin Russian and Cyrillic English for the first 1000 random samples I looked at. I did not include the additional filters mentioned in my previous write up, since I only use the models at that stage to roughly group queries for manual review.

Of the 21 (2.1%) identified as Cyrillic English (i.e., English typed on a Russian or other Cyrillic keyboard), Of the 16 (1.6%) identified as Latin Russian (i.e., Russian typed on an American English or other Latin keyboard), In passing, while working on the queries, I also noticed: Sounds like there is a decent-sized chunk of queries to improve by identifying and transliterating queries. Phonetic keyboards or transliterated queries will be harder, since they at least look like language even in the wrong character set (i.e., there are enough vowels in reasonable places).
 * 6 were Cyrillic English (including 2 very short acronyms)
 * 1 was mixed (Cyrillic/Latin), but it converted to something plausible
 * 8 were Russian/Cyrillic (including names, acronyms, typos)
 * 3 more were very short (2-3 letters)
 * 3 were junk
 * 13 were Latin Russian/Cyrillic (including names)
 * 1 was a name in Cyrillic
 * 2 were apparent junk (1 of which was also mixed Cyrillic/Latin)
 * several Russian queries transliterated into Latin, sometimes identified as Polish, sometimes mixed with English
 * a few Latin queries (including names) transliterated into Russian
 * at least one each of Georgian and Armenian transliterated into Latin
 * a couple of cases of Devanagari transliterated into Cyrillic