User:TJones (WMF)/Notes/Language Detection with TextCat

December 2015 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T118287)

Background
As previously noted, the default configuration ES Plugin doesn't do a great job of language detection of queries. At Oliver's suggestion, I was looking into a paper by Řehůřek and Kolkus ("Language Identification on the Web: Extending the Dictionary Method", 2009). Řehůřek and Kolkus compare their technique against an n-gram method (TextCat). On small phrases (30 characters or less), I felt the n-gram method they used generally out-performed their method (typically similar or higher precision, though often with lower recall—see Table 2 in the paper).

I was familiar with TextCat, so I looked it up online, and it's available under GNU GPL, so I thought I'd give that a try.

TextCat
TextCat is based on a paper by Cavnar and Trenkle ("N-Gram-Based Text Categorization", 1994). The basic idea is to sort n-grams by frequency, then compare the rank order of the n-grams against the profile for a given language. One point is added for every position the rank orders disagree for each n-gram, and low score wins.

Unfortunately, the most current version of the original TextCat by Gertjan van Noord (in Perl) is pretty out of date (there are other implementations in other languages available, but I wanted to stick with the original if possible).

The provided language models for TextCat are non-Unicode, and there are even models for the same language in different encodings (e.g., Arabic iso8859_6 or windows1256). Also, as I discovered later, the language models are all limited to 400 n-grams.

Upgrades to TextCat and the Language Models
Since language detection on short strings is generally difficult and the existing models were non-Unicode, I decided to retrain the models on actual wiki query data. In addition to using Unicode input, models built on query data may have significantly different distributions of n-grams. For example, there may be a significantly different proportion of punctuation, diacritics, question words (leading to different proportions of "wh"s in English or "q"s in Spanish, for example), verbs (affecting counts for conjugation endings in some languages), or different number of inflected forms in general. (I didn't try to empirically verify these ideas independently, but they are the motivation for the use of query data for re-training.)

I modified TextCat in several ways:


 * updated it to handle Unicode characters
 * changed the default maximum number of languages to be 100 (instead of 10) so it always gives a result
 * modified the output to include scores (in case we want to limit based on the score)
 * pre-loaded all language models so that when processing line by line it is many times faster (a known deficiency mentioned in the comments of the original)
 * put in an alphabetic sub-sort after frequency sorting of n-grams (as noted in the comments of the original, not having this is faster, but without it, results are not unique, and can vary from run to run on the same input!!)
 * removed the benchmark timers (after re-shuffling some parts of the code, they weren't in a convenient location anymore, so I just took them out.
 * I did not update most of the very old Perl idioms in the original. The modified version will be available on GitHub.

I also changed the way TextCat deals with the number of n-grams in a model and the number of n-grams in the sample. This requires a bit more explanation. The language models that come with TextCat have 400 n-grams (the 400 most frequent for each language), and by default TextCat considers the 400 most frequenct n-grams from the sample to be identified. There is an option to use fewer n-grams from the sample (for speed, presumably), but the entire language model would still be used. There is a penalty for an unknown n-gram, which is the same as the number of n-grams used in the sample. Confusing, no?

As an example, if you have language models with 400 n-grams, but you choose to only look at the 20 most frequent n-grams in your sample (a silly thing to do), then any unknown n-gram would be given a penalty of 20 (penalties are based on difference in rank order). In this case, that's crazy, because a known n-gram in 50th place in the language model (i.e., with a penalty of at least 30) counts against a language more than an unknown n-gram (penalty of 20). In practice, I assume 300-500 sample n-grams would be used, and the penalty for an unknown n-gram would be more similar to that of a low frequency n-gram.

This makes sense when dealing with reasonably large texts, where the top most frequent n-grams really do the work of identifying a language, because they are repeated often. In really short samples (like most queries), the final decision may be made more on the basis of which language a string is least dissimilar to, rather than which is it most similar to, simply because it's too short to exhibit characteristic patterns. For example, in English, e is the most common letter, and is roughly 1.4 times as common as t, 1.6 times as common as a, and 1.7 times as common as o. You won't reliably get those proportions in a ten to twenty character string made up of English words.

As a result, it makes sense that very large language models, with thousands of n-grams, could be better at discriminating between languages, especially on very short strings. So we can see that while "ish_" (i.e., "ish" at the end of a word), is not super common in English (n-gram #1,014), it is even less common in Swedish (n-gram #4,100). In a long text, this wouldn't matter, because the preponderance of words ending in e, s, t, or y, or starting with s or t, or containing an, in, on, or er, or the relative proportions of single letters, or some other emergent feature would carry the day. But that's not going to happen when the string you are assessing is just "zebrafish".

I also modified TextCat to limit the size of the language model being used rather than using the whole model available (i.e., the model may have 5,000 n-grams in it, but we only want to look at the first 3,000). This means we can use the same n-gram file to test language models of various sizes without having to regenerate the models.

I made the penalty the size of the model we're using (i.e., if we look at 3,000 English n-grams, then any unknown n-gram gets treated as if it were #3000, regardless of how many n-grams we look at in the sample). The number of n-grams looked at in the sample is still configurable, but I set it to 1,000, which is effectively "all of them" for most query strings.

Using the entire available model (or even larger models) isn't necessarily a good idea. At some point random noise will begin to creep in, and with very low frequency counts, alphabetizing the n-grams may have as much of an effect as the actual frequency (i.e., an n-gram may be tied for 15,683rd place, but may show up in 16,592nd place because there are a thousand n-grams with a count of 1). Also, larger models (with more n-grams) are more coarse when built on smaller training data sets, further exaggerating the differences between models built on larger vs. smaller corpora.

Query Data Collection
I started with 46,559,669 queries extracted from a week's worth of query logs (11/10/2015 through 11/16/2015). I collated the queries by wiki (with the various wikis acting as a initial stand-in for the corresponding language). There were 59 query sets with at least 10,000 raw queries (up to 18M+, for English): Albanian, Arabic, Armenian, Azerbaijani, Basque, Bengali, Bosnian, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Igbo, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Mongolian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Serbo-Croatian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.

There's plenty of messiness in the queries, so I filtered queries according to a number of criteria: After filtering, a lot of queries were removed. English was down to ~14M. Telugu had the fewest left, losing almost 70% down to ~3,300 queries. The largest loss (percentage and total) was Italian, which lost about 89% (5.8M) queries. The largest factor in Italian is probably deduplication, since searches on itwiki are repeated on a number of other Italian wikis.
 * deduplication: I deduped the queries. Even though the same query could come from multiple sources, the most commonly repeated queries in general come from bots. Others are driven by current events, and don't reflect more general language or query patterns. Deduping reduces their ability to skew the language model stats.
 * repetitive junk: A decent filter for junk (with very high precision) is to remove queries with the same character or two-letter sequence repeated at least four times in a row, or the same 3-6 character sequence at least three times in a row. I skimmed the queries being removed, and for some character sets my non-Unicode tool (grep) did some things not quite right, and so I adjusted accordingly. But as a general heuristic, this is a good way of reducing noise.
 * inappropriate character set: For each language, I also filtered out queries that were entirely in an inappropriate character set. For example, a query with no Latin characters is not going to be in English. This is obviously much more precise for some languages (Thai, Greek), and since all query sets seem to have a fair number of English queries, it was also fairly effective for languages that don't use the Latin alphabet even if their writing system isn't unique to the language (Cyrillic, Arabic). I also filtered queries with characters, since there were bits of HTML and XML in some queries.
 * bad key words: I took a look at highest frequency tokens across all queries and found a number of terms that were high-precision markers for "bad" queries, including insource, category, Cookbook, prefix, www, etc. These were all filtered out, too.

The data was still messy, but filtering should have improved the signal strength of the main language of the wiki, while preserving the idiosyncrasies of actual queries (vs., say, wiki text in that language).

Variants Tested
My primary variables were (a) language model size (b) whether to use the sample n-gram count or language model size (i.e., n-gram count) as the unknown n-gram penalty, and (c) sample n-gram count. As noted above, using the language model size as the penalty (b) performed much better, and the sample n-gram count (c) seemed best when it was "all of them" (in practice, for queries, that's 1,000).

I tested model sizes with 100 to 2,000 n-grams (in increments of 100) and 2,000 to 5,000 n-grams (in increments of 500). In my experiments, 3,000 to 3,500 n-grams generally performed the best.

When I reviewed the results, there were clearly some detectors that performed very poorly. I was less concerned with recall (every right answer is a happy answer) and more concerned with precision. Some low-precision models are the result of poor training data (the Igbo wiki, for example, gets a lot of queries in English), others are apparently just hard, esp. on small strings (like French). I removed language models with poor precision, in the hopes that, for example, English queries identified as French or Igbo would be correctly identified once French and Igbo were removed as options. Removing options that had very low precision resulted in improved performance.

A number of languages were dropped because there were no examples in the evaluation set, meaning they could only be wrong (and many were). Others, like French Tagalog, and German, were dropped even though they could theoretically help, because they got so many misses (false positives). The final list of languages used included: English, Spanish, Chinese, Portuguese, Arabic, Russian, Persian, Korean, Bengali, Bulgarian, Hindi, Greek, Japanese, Tamil, and Thai. The language models for Hebrew, Armenian, Georgian, and Telugu were also used, but didn't detect anything (i.e., they weren't problematic, so they weren't removed).

Some of these are high accuracy because their writing systems are very distinctive: Armenian, Bengali, Chinese (esp. when not trying to distinguish Cantonese), Georgian, Greek, Hebrew, Hindi (in this set of languages), Korean, Tamil, Telugu, and Thai. Bulgarian and Portuguese (potentially confused with Russian and Spanish, respectively) actually didn't do particularly well, but their negatives were on a fairly small scale.

Best Options
The best performing set up for enwiki then is: language models with 3,000 n-grams, built on the filtered query set, setting the unknown n-gram penalty to the language model size, and limiting the languages to those that are very high precision or very useful to enwiki: English, Spanish, Chinese, Portuguese, Arabic, Russian, Persian, Korean, Bengali, Bulgarian, Hindi, Greek, Japanese, Tamil, and Thai.

The Numbers
The baseline performance I am trying to beat is the ES Plugin (with spaces). A summary of F0.5 performance of the ES Plugin overall and for the most common languages in enwiki queries is provided below. ES Plugin Baseline f0.5   recall  prec    total  hits  misses TOTAL       54.4%   39.0%   60.4%  775    302   198 English     71.8%   34.2%   99.0%  599    205   2 Spanish     62.8%   58.1%   64.1%  43     25    14 Chinese     90.3%   65.0%  100.0%  20     13    0 Portuguese  44.0%   42.1%   44.4%  19     8     10 Arabic      95.2%   80.0%  100.0%  10     8     0 French      13.6%   30.0%   12.0%  10     3     22 Tagalog     31.0%   77.8%   26.9%  9      7     19 German      36.8%   62.5%   33.3%  8      5     10 Russian     88.2%   60.0%  100.0%  5      3     0 Persian     75.0%   75.0%   75.0%  4      3     1 The results for TextCat, showing sub-scores languages with > 0% (plus French, Tagalog, and German, for nostalgia's sake): TextCat, limited to certain languages f0.5  recall  prec    total  hits  misses TOTAL       83.1%   83.2%   83.1%  775    645   131 English     90.5%   93.3%   89.9%  599    559   63 Spanish     51.4%   74.4%   47.8%  43     32    35 Chinese     85.5%   65.0%   92.9%  20     13    1 Portuguese  37.4%   73.7%   33.3%  19     14    28 Arabic      87.0%   80.0%   88.9%  10     8     1 French       0.0%    0.0%    0.0%  10     0     0 Tagalog      0.0%    0.0%    0.0%  9      0     0 German       0.0%    0.0%    0.0%  8      0     0 Russian     95.2%   80.0%  100.0%  5      4     0 Persian     83.3%  100.0%   80.0%  4      4     1 Korean      90.9%   66.7%  100.0%  3      2     0 Bengali    100.0%  100.0%  100.0%  2      2     0 Bulgarian   55.6%  100.0%   50.0%  2      2     2 Hindi      100.0%  100.0%  100.0%  2      2     0 Greek      100.0%  100.0%  100.0%  1      1     0 Tamil      100.0%  100.0%  100.0%  1      1     0 Thai       100.0%  100.0%  100.0%  1      1     0

These results are comparable to (actually slightly better than for F0.5) the using thresholds by language with the ES Plugin (which was optimized on the evaluation set and thus very much overfitted and brittle), with much, much better overall recall (83.2% vs 36.1%) and marginally worse precision (83.1% vs 90.3%).

Other Training Options Explored

 * I did initially and very optimistically build language models on the raw query strings for each language. The results were not better than the ES Plugin, hence the filtering.


 * I tried to reduce the noise in the training data. I chose English and Spanish because they are the most important languages for queries on enwiki. I manually reviewed 5,699 enwiki queries and reduced them to 1,554 English queries (so much junk!!), and similarly reduced 4,101 eswiki queries to 2,497 Spanish queries. I built models on these queries and used them with models for other languages built on the larger query sets above. They performed noticeably worse, probably because of the very small corpus size. It might be possible to improve the performance of lower-performing language models using this method, but it's a lot of work to build up sizable corpora.


 * I extracted text from thousands of Wiki articles for Arabic, German, English, French, Portuguese, Tagalog, and Chinese—the languages with the most examples in my test corpus for enwiki. I extracted 2.6MB of training data for each language; though it was obviously messy and included bits of text in other languages. I built language models on these samples, and used them in conjunction with the high-performing models for other wikis built on query data. The results were not as good as with the original models built on query data, regardless of how I mixed and matched them. So, query data does seem to have patterns that differ from regular text, at least Wikipedia article text. (Interestingly, these models were best at the max language model size of 5,000 n-grams, so I tested model sizes in increments of 500 up to 10,000 n-grams. Performance did in fact max out around 5,000.)


 * I looked at using the internal dissimilarity score (i.e., smaller is better) from TextCat as a threshold, but it didn't help.

Next Steps
There are a number of options we could explore from here, and these have been converted into Phabricator tickets for the Discovery Team.


 * 1) Stas has already started working on converting TextCat to PHP for use in Cirrus Search (available on GitHub), and he and Erik have been brainstorming on ways of making it more efficient, too. That needs some testing (e.g., Unicode compatibility) and comparison to the Perl version (i.e., same results on test queries). Phabricator: T121538
 * 2) Do a better assessment of the new language models to decide which ones are really not good (e.g., probably Igbo) and which ones are just not appropriate for enwiki (e.g., hopefully French and German). The obvious approach is to create a "fair" evaluation test set with equal numbers of examples for each language, and evaluating performance on that set. Phabricator: T121539
 * 3) Use the training data created here for training models for the ES Plugin / Cybozu. Perhaps its difficulties with queries are partly due to inaccurate general language models. This could also include looking at the internals and seeing if there is any benefit to changing the model size or other internal configuration, including optionally disabling "unhelpful" models (I'm looking at you, Romanian). Phabricator: T121540
 * 4) Create properly weighted evaluation sets for other wikis (in order by query volume) and determine the best mix of languages to use for each of them. Each evaluation set would be a set of ~1,000 zero-results queries from the given wiki, manually tagged by language. It takes half a day to do if you are familiar with the main language of the wiki, and evaluation on a given set of language models takes a couple of hours at most. (depends on 2 to make sure we aren't wasting time on a main language that does not perform well) Phabricator: T121541
 * 5) Do an A/B test on enwiki (or A/B/C test vs the ES Plugin) using the best config determined here. (A/B test depends on 1; A/B/C test could benefit from 3) Phabricator: T121542
 * 6) Do A/B tests on other wikis (depends on 4) Phabricator: T121543
 * 7) Create larger manually "curated" training sets for languages with really crappy training data (e.g., Igbo) that's contaminated with English and other junk. (could depend on and be gated by the results of 8; could be tested via re-test of data in 2) Phabricator: T121544
 * 8) See if Wikipedia-based language models for languages with crappy training data do better. (could obviate the need for 7 in some cases; could be tested via re-test of data in 2) Phabricator: T121545
 * 9) Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. That is, extract a lot more data for particular wikis with smaller training sets so their language models are more fine-grained. These languages ended up with less than 20K queries to build their language models on: Armenian, Bosnian, Cantonese, Hindi, Latin, Latvian, Macedonian, Malayalam, Mongolian, Serbo-Croatian, Swahili, Tamil, Telugu, Urdu. Some need it more than others—languages with distinctive character sets do well already. (could link up with 7, 8, and/or 2) Phabricator: T121546
 * 10) Improve training data via application of language models to the training data. For example, use all available high-precision language models except French on the French training data. Group the results by language and sort by score (for TextCat, smaller is better). Manually review the results and mark for deletion those that are not French. This should be much faster (though less exhaustive) than option 7, because most of the best-scoring queries will in fact not be French. Review can stop, say, when the incidence of non-French queries is less than half. This should remove the most distinctively English (or German, or whatever) queries from the French training set. Repeat on other languages, retrain all the language models, and if there is useful improvement, repeat the whole process. (depends on 2 for a reasonable evaluation set) Phabricator: T121547

Query Counts
The table below shows the (usually) two-letter language code, the name of the language, the scripts used by the language, the original number of queries from the week's sample, the number of queries left after filtering, and the diff between queries and filtered queries. It is sorted in descending order by filtered queries.