User:TJones (WMF)/Notes/TextCat Optimization for frwiki eswiki itwiki and dewiki

April 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T132466)

Background
I’ve previously looked at optimizing the set of languages to use with TextCat for language detection on enwiki, and we have an A/B test in the works.

The next step is to do a similar analysis for other big wikis, based on query volume. The next four wikis are Italian, German, Spanish, and French Wikipedias. Due to technical difficulties and personal preferences, I will be looking at French, Spanish, Italian, and then German.

I’ve also done some preliminary work—corpus creation and initial filtering—on the next four candidate wikis: Russian, Japanese, Portuguese, and Indonesian. This was useful for defining and streamlining the process, especially with non-Latin Russian and Japanese.

Query Corpus Creation
The first step for each wiki is to extract a recent corpus of relevant queries, and then do some initial filtering to remove some of the dreck and other undesirable queries in the corpora.

Random sampling
Select 10K random queries from a recent one-week period (generally in March 2016) that meet the following criteria: On one occasion, the Hive query got stuck in the reduce stage. The only work-around I found was to use a different week-long time period from which to extract the queries.
 * Query came from the search box on .wikipedia.org
 * No more than one query from any given IP for any given day
 * No more than 30 queries per day from that IP
 * Only the _content index was searched (except for wikis that search multiple indexes by default)
 * Query had < 3 results

The Hive query for frwiki is available as an example.

Initial filtering
There is a lot of junk in our query logs, and some of it is relatively easy to identify with relative accuracy. The process is as follows: This cuts down the query pool by 5-25% (median of 8 languages so far: about 11%.).
 * Extract queries that have the same 1-to-10 character sequence at least three times in a row and manually review—these are mostly junk, but there are some good ones. Put the good ones back.
 * Extract queries that are nothing but consonants and spaces—there are more than you would think, and it works on non-Latin languages/wikis (like ru and ja), too! These are all junk, but they are reviewed anyway.
 * Extract queries that have four Latin characters in a row that are not in [aeiouhy]. Again, more than you would think, most are junk. Works in non-Latin languages, too. Works less well in German, but still found a lot of junk.
 * Review remaining queries. Sort and review:
 * Remove most queries with www, http, @, .com, .org, .net, .mobi, .biz, .xxx, .co.uk, and common TLDs for the language under review.
 * Remove queries that aren’t words—mostly numbers, things that look like serial numbers, ID numbers, phone numbers, addresses, etc.
 * Remove queries that are mostly or completely emoji.
 * Note any obvious “other” languages in use in the sample. Different scripts are really obvious because they are grouped when sorted. Other languages using the same script are hit or miss.
 * Incidentally remove any unwanted queries as they go by: proper names, chemical names, any other obvious gibberish not caught by the gibberish filters above.
 * Sort, uniq, and randomly order remaining queries.

Language annotation and further filtering
Once we’ve removed the really obvious junk, it’s time to manually review queries to create a corpus.
 * Take a the first 1000 queries from the filtered and randomized sample.
 * Run current language identification on it, using the language of the wiki, English (which is everywhere), and any other languages noted during initial filtering. This is far from perfect, but when it works decently, it’s helpful and reduces context switching. For example, most of the non-junk queries identified from frwiki as French are in fact French.
 * Skim language ID results and see if anything is obviously terrible (e.g., most of the “German” queries are obviously French) or obviously missing (oops, there’s a query in Armenian) and run again if necessary.
 * Review and manually tag the queries, removing queries that are proper names (people, places, language names, companies, products, fictional characters, etc., etc.), acronyms, more gibberish (there’s always more gibberish), scientific terminology and other words that are extremely ambiguous and not specific to any one language, and anything that’s unidentifiable.
 * Queries with typos are often left in, even though they make automatic identification hard(er).
 * Longer queries that include a few “undesirable” words are kept. (e.g., “Le declin du système éducatif haïtien. Quelles en sont les causes fondamentales?” would be kept since it is mostly French, but “Haïti” would not because it’s a name.)
 * Proper names that are made up of common nouns are kept. (e.g., names of movies, like “Seeking a Friend for the End of the World”, are often phrases made up of normal words, and are kept. Similarly country names made up of normal words are kept: "Costa Rica", "Puerto Rico", "Côte d'Ivoire", etc.)

For French, this cut the query pool by about two thirds, leaving only one third (three hundred and something) of the queries. So the process was repeated on the next 1000 queries from the filtered and randomized sample. For Spanish, just less than half of queries were eliminated, so I stopped with 520. The goal is > 500 queries.

For frwiki, the result is a corpus of 682 queries (from 2000 reviewed, after ~15% were previously removed).

Thus, for French, only ~30% of the queries that meet the criteria for possible language detection (< 3 results) are actually in an identifiable language. In production, much of the other 70% would also often be labelled as being in a particular language, but those results (on names, acronyms, gibberish, etc.) are unpredictable and any results from another wiki may or may not be helpful. Hence the need for A/B testing after this analysis is done.

Corpus size
While 682 (the size of the French query corpus) is not a huge sample, it’s enough to get a sense of what languages are commonly present among these poor-performing queries, and optimize the choice of what languages to detect. The 95% confidence interval for the margin of error on a proportion (read more; calculator) maxes out at 50%. For a sample of size 500, that’s 4.38%. For a smaller proportion, the error is smaller, but larger relative to the proportion (e.g., 0.87% for a proportion of 1% out of 500).

Overall, though, that’s good enough for us to say things like, “Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.”—which is enough for us to optimize the languages to be used for language detection for accuracy and run-time performance.

French Results
About 15% of the original 10K corpus was removed in the initial filtering. A 2,000-query random sample was taken, and about 66% of those queries were discarded, leaving a 682-query corpus. Thus only about 29% of poor-performing queries are in an identifiable language.

Other languages searched on frwiki
Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, about 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.

Below are the results for frwiki, with raw counts, percentage, and 95% margin of error.

In order, those are French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Corsican, Thai, Swahili, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton.

We don’t have query-trained language models for all of the languages represented here, such as Corsican, Swahili, Breton, Icelandic, Latin, or Hungarian. Since these each represent very small slices of our corpus (1-2 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,517 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Hebrew, and Korean queries.

Analysis and Optimization
Using all of the language models available, the performance report (for the 3000-ngram models* we use in enwiki) is below.

* ''I also ran tests on other model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. 3000 is still the best model size.''

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    83.0%   83.1%   83.1%   83.1%   83.0%  681     566     116 French    95.5%   91.7%   88.1%   85.9%   98.3%  468     402     7 English    80.6%   75.5%   70.9%   68.2%   84.5%  88      60      11 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese    62.5%   66.7%   71.4%   75.0%   60.0%  12      9       6 German    44.3%   50.0%   57.4%   63.6%   41.2%  11      7       10 Spanish    15.4%   20.8%   32.1%   50.0%   13.2%  10      5       33 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch    21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9 Corsican     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian     6.1%    9.1%   17.9%   50.0%    5.0%  2       1       19 Polish    29.4%   40.0%   62.5%  100.0%   25.0%  2       2       6 Armenian   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish     7.7%   11.8%   25.0%  100.0%    6.2%  1       1       15 Thai   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 f0.5   f1      f2      recall  prec    total   hits    misses

Spanish, Dutch, Italian, Polish, and Swedish do very poorly. They have too few actual instances that they can get correct, which are heavily outweighed by the false positives they do get.

Portuguese and German are not great, either. I reran the analysis without Portuguese and German, and it was better. I added them each back into the mix separately and in both cases the results were worse.

As noted above, Greek, Hebrew, and Korean are present in the larger sample, and from earlier work on the balanced query sets, our models for these languages are very high accuracy.

So, I dropped Portuguese and German, added Greek, Hebrew, and Korean, and re-ran the performance report with the 3000-ngram models (to check the performance and double-check that Greek, Hebrew, and Korean aren’t causing problems). The results are below:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    89.0%   89.1%   89.1%   89.1%   89.0%  681     607     75 French    94.8%   95.1%   95.5%   95.7%   94.5%  468     448     26 English    67.0%   74.9%   84.9%   93.2%   62.6%  88      82      49 Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  12      0       0 German     0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0 Spanish     0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0 Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Corsican     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Italian     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Polish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0 Armenian   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 Breton     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Thai   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for French and English, but overall performance improved. Queries in unrepresented languages were all identified as either French or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

frwiki: Best Options
The optimal settings for frwiki, based on these experiments, would be to use models for French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, hy, he, ko), using the default 3000-ngram models.

Spanish Results
About 10% of the original 10K corpus was removed in the initial filtering. A 1,000-query random sample was taken, and 48% of those queries were discarded, leaving a 520-query corpus. Thus only about 47% of poor-performing queries are in an identifiable language.

Other languages searched on eswiki
Based on the sample of 520 poor-performing queries on eswiki that are in some language, about 90% are in Spanish, 4-8% are in English, and fewer than 2% each are in a handful of other languages.

Below are the results for eswiki, with raw counts, percentage, and 95% margin of error. In order, those are Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, Guarani*, French, German, Catalan.

*  Mbaé’chepa!

We don’t have query-trained language models for all of the languages represented here, such as Latin, Guarani, and Catalan. Since these each represent very small slices of our corpus (1-3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 9,003 remaining queries after the Initial filtering, focusing on queries in other writing systems, there are also a small number of Arabic and Japanese queries, and one each for Cherokee and Aramaic (for which we do not have models).

Analysis and Optimization
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better, are here:

model size   3000    3500    6000    7000    8000    9000    10000 TOTAL   83.2%   84.4%   84.8%   85.0%   85.4%   85.9%   86.3% Spanish   91.6%   92.2%   92.3%   92.6%   92.8%   93.2%   93.4% English   73.0%   75.0%   75.8%   73.8%   75.0%   76.2%   76.2% Latin   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Russian   100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0% Catalan   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0% French   0.0%    0.0%    33.3%   20.0%   22.2%   22.2%   22.2% German   22.2%   22.2%   25.0%   22.2%   20.0%   20.0%   20.0% Italian   8.0%    9.1%    9.1%    10.0%   11.8%   11.8%   11.1% Portuguese   4.7%    4.8%    4.8%    5.3%    5.0%    5.3%    5.7%

Performance details for the 3K models are here (details for larger models are similar in terms of which language models perform the most poorly):

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    83.2%   83.2%   83.2%   83.2%   83.2%  519     432     87 Spanish    96.3%   91.6%   87.3%   84.7%   99.8%  476     403     1 English    73.7%   73.0%   72.3%   71.9%   74.2%  32      23      8 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       7 German    15.2%   22.2%   41.7%  100.0%   12.5%  1       1       7 Italian     5.2%    8.0%   17.9%  100.0%    4.2%  1       1       23 Portuguese     3.0%    4.7%   10.9%  100.0%    2.4%  1       1       41 f0.5   f1      f2      recall  prec    total   hits    misses

Italian, Portuguese, French, and German all do very poorly, with too many false positives.

As noted above, Arabic and Japanese are present in the larger sample, and as our models for these languages are high accuracy, I’ve included them.

The final language set is Spanish, English, Russian, Chinese, Arabic, and Japanese. As above, 3K is not the optimal model size—my current unsupported hypothesis is that 3K isn’t the best here because there are really only two languages in contention. The 3K results are shown below along with the best performing model sizes:

model size   1500    2000    2500    3000    9000    10000 TOTAL   96.5%   96.1%   96.0%   95.8%   96.0%   96.1% Spanish   98.7%   98.5%   98.4%   98.2%   98.3%   98.4% English   78.9%   76.3%   75.3%   76.5%   76.9%   77.9% Latin   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Russian   100.0%  100.0%  100.0%  100.0%  100.0%  100.0% Catalan   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  100.0% French   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% German   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Italian   0.0%    0.0%    0.0%    0.0%    0.0%    0.0% Portuguese   0.0%    0.0%    0.0%    0.0%    0.0%    0.0%

However, the accuracy is very high, and the differences are not huge, so it makes sense to stick with the default 3K models for now, but I'll continue to keep an eye out for significant performance improvements with other model sizes when working with other corpora.

The detailed report for the 3K model is here:

f0.5   f1      f2      recall  prec    total   hits    misses TOTAL    95.8%   95.8%   95.8%   95.8%   95.8%  519     497     22 Spanish    98.8%   98.2%   97.6%   97.3%   99.1%  476     463     4 English    68.0%   76.5%   87.6%   96.9%   63.3%  32      31      18 Latin     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0 Russian   100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0 Catalan     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0 French     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 German     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Italian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0 f0.5   f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Spanish and English, but overall performance improved. Queries in unrepresented languages were all identified as either Spanish or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.

eswiki: Best Options
Non-optimal settings for eswiki (while being consistent with other wikis using 3K models), based on these experiments, would be to use models for Spanish, English, Russian, Chinese, Arabic, Japanese (es, en, ru, zh, ar, ja), using the default 3000-ngram models.

Next Up

 * Italian
 * German

Other thoughts

 * It would be easy to build high-accuracy identifiers for languages that have unique character sets—or at least character sets that are effectively unique in practice. For example, Yiddish can be written with the Hebrew alphabet, but on most wikis, we'd expect most Hebrew characters to actually be Hebrew (and identifying Yiddish as Hebrew is better than identifying it as anything else other than Yiddish). Similarly, Persian and Arabic writing share many characters, but on frwiki, for example, we only see Arabic. Cherokee is rare, but examples have shown up in our samples. Korean, Armenian, Hebrew, Greek, Georgian, Thai, and others could be used on most wikis because they are low risk, if the run-time cost of enabling them is not too high.