User:TJones (WMF)/Notes/Language Detection with TextCat

December 2015 — See TJones_(WMF)/Notes for other projects.

Background
As previously noted, the ES Plugin doesn't do a great job of language detection. At Oliver's suggestion, I was looking into a paper by Řehůřek and Kolkus ("Language Identification on the Web: Extending the Dictionary Method", 2009). Řehůřek and Kolkus compare their technique against an ngram method (TextCat). On small phrases (30 characters or less), I felt the ngram method they used generally out-performed their method (typically similar or higher precision, though often with lower recall—see Table 2 in the paper).

I was familiar with TextCat, so I looked it up online, and it's available under GNU GPL, so I thought I'd give that a try.

TextCat
Unfortunately, the most current version of the original [TextCat http://odur.let.rug.nl/vannoord/TextCat/] (in Perl) is pretty out of date (there are other implementations in other languages available, but I wanted to stick with the original if possible).

The provided language models for TextCat are non-Unicode, and there are even models for the same language in different encodings (e.g., Arabic iso8859_6 or windows1256). Also, as I discovered later, the language models are all limited to 400 n-grams.

Upgrades to TextCat and the Language Models
Since language detection on short strings is generally difficult and the existing models were non-Unicode, I decided to retrain the models on actual wiki query data. In addition to using Unicode input, models built on query data may significantly different distributions of ngrams. For example, there may be a significantly different proportion of punctuation, diacritics, question words (leading to different proportions of "wh"s in English or "q"s in Spanish, for example), verbs (affecting counts for conjugation endings in some languages), or different number of inflected forms in general. (I didn't try to empirically verify these ideas independently, but they are the motivation for the use of query data for re-training.)

I modified TextCat in several ways:


 * updated it to handle Unicode characters
 * changed the default maximum number of languages to be 100 (instead of 10) so it always gives a result
 * modified the output to include scores (in case we want to limit based on the score)
 * pre-loaded all language models so that when processing line by line it is many times faster (a known deficiency mentioned in the comments of the original)
 * put in an alphabetic sub-sort after frequency sorting of ngrams (as noted in the comments of the original, not having this is faster, but without it, results are not unique, and can vary from run to run on the same input!!)
 * removed the benchmark timers (after re-shuffling some parts of the code, they weren't in a convenient location anymore, so I just took them out.

I also changed the way TextCat deals with the number of ngrams in a model and the number of ngrams in the sample. This requires a bit more explanation. The language models that come with TextCat have 400 ngrams (the 400 most frequent for each language), and by default TextCat considers the 400 most frequency ngrams from the sample to be identified. There is an option to use fewer ngrams from the sample (for speed, presumably), but the entire language model would still be used. There is a penalty for an unknown ngram, which is the same as the number of ngrams used in the sample. Confusing, no?

As an example, if you have language models with 400 ngrams, but you choose to only look at the 20 most frequent ngrams in your sample (a silly thing to do), then any unknown ngram would be given the same penalty as if it were the 20th n-gram. In this case, that's crazy, because a known ngram in 30th place counts against a language more than an unknown ngram (scored as if it were in 20th place). In practice, I assume 300-500 sample ngrams would be used, and the penalty for an unknown ngram would be similar to that of a low frequency ngram.

This makes sense when dealing with reasonably large texts, where the top most frequent ngrams really do the work of identifying a language, because they are repeated often. In really short samples (like short queries), the final decision may be made more on the basis of which language a string is least dissimilar to, rather than which is it most similar to, simply because it's too short to exhibit characteristic patterns. For example, in English, e is the most common letter, and is roughly 1.4 times as common as t, 1.6 times as common as a, and 1.7 times as common as o. You won't reliably get those proportions in a ten to twenty character string made up of English words.

As a result, it makes sense that very large language models, with thousands of ngrams, could be better at discriminating between languages, especially on very short strings. So we can see that while "ish_" (i.e., "ish" at the end of a word), is not super common in English (ngram #1014), it is even less common in Swedish (ngram #4100). In a long text, this wouldn't matter, because the preponderance of words ending in e, s, t, or y, or starting with s or t, or containing an, in, on, or er, or the relative proportions of single letters or some other emergent feature would carry the day. But that's not going to happen when the string you are assessing is just "zebrafish".

So I modified TextCat to limit the size of the language model being used rather than using the whole model available (i.e., the model may have 5,000 ngrams in it, but we only want to look at the first 3,000). This means we can use the same model file to test language models of various sizes without having to regenerate the models.

I made the penalty the size of the model we're using (i.e., if we look at 3,000 English ngrams, then any unknown ngram gets treated as if it were #3000, regardless of how many ngrams we look at in the sample). The number of ngrams looked at in the sample is still configurable, but I set it to 1,000, which is effectively "all of them" for most query strings.

Using the entire available model (or using the largest model possible) isn't necessarily a good idea. At some point random noise will begin to creep in, and with very low frequency counts, alphabetizing the ngrams may have as much of an effect as the actual frequency (i.e., an ngram may be tied for 15,683rd place, but may show up in 16,592nd place because there are a thousand ngrams with a count of 1). Also, larger models (with more ngrams) are more coarse when built on smaller training data sets, further exaggerating the differences between models built on larger vs. smaller corpora.

Query Data Collection
I started with 46,559,669 queries extracted from a week's worth of query logs (11/10/2015 through 11/16/2015). I collated the queries by wiki (with the various wikis in a language working as a initial stand-in for the corresponding language). There were 59 query sets with at least 10,000 raw queries (up to 18M+ for English): Albanian, Arabic, Armenian, Azerbaijani, Basque, Bengali, Bosnian, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Igbo, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Mongolian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Serbo-Croatian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese.

There's plenty of messiness in the queries, so I filtered queries according to a number of criteria:

- deduplication: I deduped the queries. Even though the same query could come from multiple sources, the most commonly repeated queries in general come from bots. Others are driven by current events, and don't reflect more general language or query patterns. Deduping reduces their ability to skew the language model stats.

- repetitive junk: A decent filter for junk (with very high precision) is to remove queries with the same character or two-letter sequence repeated at least four times in a row, or the same 3-6 character sequence at least three times in a row. I skimmed the queries being removed, and for some character sets my non-Unicode tool (grep) did some things not quite right, and so I adjusted accordingly. But as a general heuristic, this is a good way of reducing noise.

- inappropriate character set: For each language, I also filtered out queries that were entirely in an inappropriate character set. For example, a query with no Latin characters is not going to be in English. This is obviously much more precise for some languages (Thai, Greek), and since all query sets seem to have a fair number of English queries, it was also fairly effective for languages that don't use the Latin alphabet even if their writing system isn't unique to the language (Cyrillic, Arabic). I also filtered queries with characters, since there were bits of HTML and XML in some queries.

- bad key words: I took a look at highest frequency tokens across all queries and found a number of terms that were high-precision markers for "bad" queries, including insource, category, Cookbook, prefix, www, etc. These were all filtered out, too.

After filtering, a lot of queries were removed. English was down to ~14M. Telugu had the fewest left, losing almost 70% down to 3300 queries. The largest loss (percentage and total) was Italian, which lost about 89% (5.8M) queries. The largest factor here is probably deduplication, since searches on itwiki are repeated on a number of other Italian wikis.

The data was still messy, but filtering should have improved the signal strength of the main language of the wiki, while preserving the idiosyncrasies of actual queries (vs., say, wiki text in that language).

Variants Tested
My primary variables were (a) language model size (b) whether to use the sample ngram count or language model ngram count as the unknown ngram penalty, and (c) sample ngram count. As noted above, using the language model size as the penalty (b) performed much better, and the sample ngram count (c) seemed best when it was "all of them" (in practice, for queries, that's 1000).

I tested model sizes with 100 to 2000 ngrams (in increments of 100) and 2000 to 5000 ngrams (in increments of 500). In my experiments, 3000 to 3500 ngrams generally performed the best.

When I reviewed the results, there were clearly some detectors that performed very poorly. I was less concerned with recall (every right answer is a happy answer) and more concerned with precision. Some low precision models are the result of poor training data (the Igbo wiki, for example, gets a lot of queries in English), others are apparently just hard, esp. on small strings (like French). I removed language models with poor precision, in the hopes that, for example, English queries identified as French would be correctly identified once French was removed as an option. Removing options that had very low precision (and in some cases, no positive examples in the evaluation set, so they could only be wrong) resulted in improved performance.

A number of languages were dropped because there were no examples in the evaluation set, meaning they could only be wrong (and many were). Others, like French Tagalog, and German, were dropped even though they could theoretically help, because they got so many misses (false positives). The final list of languages used included

English, Spanish, Chinese, Portuguese, Arabic, Russian, Persian, Korean, Bengali, Bulgarian, Hindi, Greek, Japanese, Tamil, and Thai. The language models for Hebrew, Armenian, Georgian, and Telugu were also used, but didn't detect anything (i.e., they weren't problematic, so they weren't removed). Some of these are high accuracy because their writing systems are very distinctive: Armenian, Bengali, Chinese (esp. when not trying to distinguish Cantonese), Georgian, Greek, Hebrew, Hindi (in this set of languages), Korean, Tamil, Telugu, and Thai. Bulgarian and Portuguese (potentially confused with Russian and Spanish) actually didn't do particularly well, but their negatives were on a fairly small scale.

The best performing set up for enwiki then is: language models with 3000 ngrams, setting the unknown ngram penalty to the language model size, and limiting the languages to those that are very high precision or very useful to enwiki (listed in the paragraph above).

The Numbers
The baseline performance I am trying to beat is the ES Plugin (with spaces). A summary of F0.5 performance of the ES Plugin overall and for the most common languages in enwiki queries is provided below.

ES Plugin Baseline f0.5   recall  prec    total  hits  misses TOTAL       54.4%   39.0%   60.4%  775    302   198 English     71.8%   34.2%   99.0%  599    205   2 Spanish     62.8%   58.1%   64.1%  43     25    14 Chinese     90.3%   65.0%  100.0%  20     13    0 Portuguese  44.0%   42.1%   44.4%  19     8     10 Arabic      95.2%   80.0%  100.0%  10     8     0 French      13.6%   30.0%   12.0%  10     3     22 Tagalog     31.0%   77.8%   26.9%  9      7     19 German      36.8%   62.5%   33.3%  8      5     10 Russian     88.2%   60.0%  100.0%  5      3     0 Persian     75.0%   75.0%   75.0%  4      3     1

TextCat, limited to certain languages f0.5  recall  prec    total  hits  misses TOTAL       83.1%   83.2%   83.1%  775    645   131 English     90.5%   93.3%   89.9%  599    559   63 Spanish     51.4%   74.4%   47.8%  43     32    35 Chinese     85.5%   65.0%   92.9%  20     13    1 Portuguese  37.4%   73.7%   33.3%  19     14    28 Arabic      87.0%   80.0%   88.9%  10     8     1 French       0.0%    0.0%    0.0%  10     0     0 Tagalog      0.0%    0.0%    0.0%  9      0     0 German       0.0%    0.0%    0.0%  8      0     0 Russian     95.2%   80.0%  100.0%  5      4     0 Persian     83.3%  100.0%   80.0%  4      4     1 Korean      90.9%   66.7%  100.0%  3      2     0 Bengali    100.0%  100.0%  100.0%  2      2     0 Bulgarian   55.6%  100.0%   50.0%  2      2     2 Hindi      100.0%  100.0%  100.0%  2      2     0 Greek      100.0%  100.0%  100.0%  1      1     0 Tamil      100.0%  100.0%  100.0%  1      1     0 Thai       100.0%  100.0%  100.0%  1      1     0

These results are comparable to (actually slightly better than for F0.5) the using thresholds by language with the ES Plugin (which was optimized on the evaluation set and thus very much overfitted and brittle), with much, much better overall recall (83.2% vs 36.1%) and marginally worse precision (83.1% vs 90.3%).

Other Training Options Explored

 * I did initially and very optimistically build language models on the raw query strings for each language. The results were not better than the ES Plugin, hence the filtering.


 * I tried to reduce the noise in the training data. I chose English and Spanish because they are the most important languages for queries on enwiki. I manually reviewed 5699 enwiki queries and reduced them to 1554 English queries (so much junk!!), and similarly reduced 4101 eswiki queries to 2497 Spanish queries. I built models on these queries and used them with models for other languages built on the larger query sets above. They performed noticeably worse, probably because of the very small corpus size. It might be possible to improve the performance of lower-performing language models using this method, but it's a lot of work to build up sizable corpora.


 * I extracted text from thousands of Wiki articles for Arabic, German, English, French, Portuguese, Tagalog, and Chinese—the languages with the most examples in my test corpus for enwiki. I had 2.6MB of training data for each language; though it was obviously messy and included bits of text in other languages. I built language models on these samples, and used them in conjunction with the high-performing models for other wikis built on query data. The results were not as good as with the original models built on query data, regardless of how I mixed and matched them. So, query data does seem to have patterns that differ from regular text, at least Wikipedia article text. (Interestingly, these models were best at 5000 ngrams, so I tested model sizes in increments of 500 up to 10,000 ngrams. Perfomance maxed out around 5000.)


 * I looked at using the internal dissimilarity score (i.e., smaller is better) from TextCat as a threshold, but it didn't help.

Next Steps
There are a number of options we could explore from here, and some of these will be converted into Phabricator tickets for the Discovery Team.


 * 1) Stas has already started working on converting TextCat to PHP for use in Cirrus Search, and he and Erik have been brainstorming on ways of making it more efficient, too. That needs some testing (e.g., Unicode compatibility) and comparison to the Perl version (i.e., same results on test queries).
 * 2) Do a better assessment of the new language models to decide which ones are really not good (i.e., probably Igbo) and which ones are just not appropriate for enwiki (i.e., hopefully French and German). The obvious approach is to create a "fair" evaluation test set with equal numbers of examples for each language, and evaluating performance on that set.
 * 3) Use the training data created here for training models for the ES Plugin / Cybozu. Perhaps its difficulties with queries are partly due to inaccurate general language models. This could also include looking at the internals and seeing if there is any benefit to changing the model size or other internal configuration, including optionally disabling "unhelpful" models (I'm looking at you, Romanian).
 * 4) Create weighted evaluation sets for other wikis (in order by query volume or by wiki size) and determine the best mix of languages to use for each of them. (depends on 3 to make sure we aren't wasting time on a main language that will never perform well)
 * 5) Do an A/B test (or A/B/C test vs the ES Plugin) on enwiki using the best config determined here. (A/B test depends on 1; A/B/C test could benefit from 3)
 * 6) Do A/B tests on other wikis (depends on 4)
 * 7) Create larger manually "curated" test sets for languages with really crappy training data (e.g., Igbo) that's contaminated by English and other data. (could depend on and be gated by the results of 8; could be tested via re-test of data in 2)
 * 8) See if Wikipedia-based language models for languages with crappy training data do well. (could obviate the need for 7 in some cases; could be tested via re-test of data in 2)
 * 9) Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. (could link up with 7, 8, and/or 2)