User:TJones (WMF)/Notes/TextCat and Confidence

July 2016 — See TJones_(WMF)/Notes for other projects. See also T140289.

Introduction
Mikhail has written up and should soon release his report on our initial TextCat A/B tests; the results look good, and language identification and cross-wiki searching definitely improve the results (in terms of results shown and results clicked) for otherwise poorly performing queries (those that get fewer than 3 results).

Mikhail's report also suggests looking at some measure of confidence for the language identification to see if that has any effect on the quality (in terms of number of results, but more importantly clicks) of the crosswiki (also "interwiki") results. This sounds like a good idea, but TextCat doesn't make it super easy to do. I have some ideas, though, and I would love some suggestions from anyone else who has any ideas.

TextCat Scoring
Technically, TextCat generates costs, rather than scores, and the lowest cost is the best. The costs are not absolute, but depend on the number n-grams (1- to 5-character sequences in the text being identified), which is related to, but not exactly proportional to, the length of the text being identified. The cost for each n-gram is its rank in the frequency-sorted list in the model. For all the models we have (even Chinese!) the least costly n-gram is a single space, because there are more of those than anything else. As you go down the list in English, the most common single letters are next: e, a, i, o, n, r, t, s. More words end in e, s, or n than have an f, y, or b anywhere! All but two of the n-grams in "the" (t, h, e, _t, th, he, e_, _th, the, he_, and _the—but not the_ and _the_) are more common than z in any word anywhere!

The overall cost of a text compared to the model is the total of the costs of all the individual n-grams. Unknown n-grams—which could be an uncommon sequence of letters the language uses, or something from a completely different character set—get the max score for the model (i.e., if the model has 3,000 n-grams, the penalty for an unknown n-gram is 3,000). The model with the lowest cost wins.

TextCat Internal Quality Control
Internally, TextCat has two parameters that are related to the quality of the language identification it has done, and which together can result in TextCat failing to give a result because no answer was good enough.

The first parameter is the results ratio. By default it is set at 1.05—which means that any language model that has a cost less than 1.05 times the lowest cost (i.e., within 5% of the lowest cost) is reported as an alternative. So, TextCat can and will report that a particular string looks like it is Spanish, but the second best guess is Portuguese, the third best guess is maybe Italian.

The second parameter is the maximum returned languages. By default this is set to 5, which is the maximum number of languages that can be returned by TextCat. If more than maximum returned languages languages are within the of the results ratio, then TextCat can’t make up its mind, and returns “unknown” as the detected language.

Of course, the effect of these parameters also depends on the set of languages being used, and their similarity to one another. Spanish, Portuguese, Italian, and French are more likely to score similarly to each other than are Spanish, Arabic, Chinese, and Russian. Also, if only four languages are being considered for identification, it’s not possible to get more than five suggested potential languages, so “unknown” will never be the result.

Possible Confidence Measures
So far, I have come up with three ideas for generating a confidence score based on the scores TextCat provides:

Number of results returned: For given settings of results ratio and maximum returned languages, we could rate TextCat’s results. With the defaults of 1.05 and 5, only one language suggested means that no other model scored within 5% of the best. Two suggestions means only one other language got close, etc.

Ratio of first to second result scores: Rather than look at how many languages were suggested, we could look at the distance between first and second place, and ignore the rest. If second place is only 0.1% away from first, that’s not as good as if it were 4.9% away. Also, we don’t have to limit ourselves to the 5% cut-off determined by the results ratio. We could know the difference between second place being 8% away from first, or 45% away (neither of which would normally be shared by TextCat, since neither is very good).

Ratio of score to maximum “unknown” score: We could (pre)compute a maximum score for a text of a given length and compare the best score to that theoretical worst score. This actually comes up when the text in question doesn’t share an character set with a given language model. If you run text in the Latin alphabet against models built on Arabic, Chinese, and Russian, they will all score  the theoretical worst score for a string of that length.

Potential Problems
Each of these methods of generating confidence scores has potential problems.

Related languages: As noted above, related languages (especially with partially similar spelling systems) will often score more similarly than unrelated languages with very different spelling conventions. So while ã, õ, and ç very clearly distinguish words in Portuguese from those in Spanish, not every word has those letters, and in the informal writing environment of search queries, not all speakers bother to type them with their diacritics (i.e., we see a, o, and c instead). Strategically placed typos can make already similar words in related languages even more similar. I noticed in particular that queries on ptwiki had typos ended up being one letter off from a plausible word in both Portuguese and Spanish. Since I was on ptwiki, I assumed these kinds of queries were Portuguese, but TextCat would not.

Typos in general: In the context of poorly-performing queries, typos are very common. These can really screw up the n-gram statistics. The models we use are based on real query data, so they include typos and other data characteristic of informal writing in queries, but most typos won’t be among the top 3,000 n-grams for a given language.

The number of languages being considered: Based on the data analysis for each Wikipedia, a number of languages are selected to be considered as possible results for language identification on that wiki. The list considers both what languages are found in a sample of poorly performing queries from that wiki,  and how those models interact with each other. So, if there are fifty times as many examples of Spanish as Portuguese queries on a given wiki, it might make sense to disable Portuguese because it gets many false positives on Spanish queries, and there aren’t that many true positives to make it worth while.

As a result of this analysis, different Wikipedias can have very different numbers of languages suggested for them, which can make it more or less likely that there is another language that scores nearly as well as the best scoring language.

Correct identification does not equal good results: A general problem is that even if we identify the language of the query, there’s no guarantee that sending it to the right wiki will give any result. If you search for why can't textcat tell what language this is? on ruwiki, and TextCat correctly and confidently identifies it as English, sending it to enwiki still isn’t going to give you any results.

Non-linearity of n-gram count: This only applies to comparing to the maximum "unknown" score. I haven't looked carefully at the numbers yet, but it's possible that non-proportionality of the number of n-grams for strings of different lengths could heavily penalize shorter or longer strings, especially when considering a single typo: in a shorter string (< 5 characters), that would affect a significant portion of all n-grams; in a longer string (> 100 characters), it might be just noise.

Empirical Solutions
One possible solution is to try out a given metric against data from a particular wiki, and empirically determine thresholds that give better results; hopefully all the potential problems above will come out in the wash.

For example, perhaps on enwiki, having no second place language within 8% of the best is a very reliable result, while on ruwiki, anything within 20% is a bad sign, etc.

Depending on what we find, we could configure results based on a threshold per wiki, or a threshold per wiki per language, or other more complex arrangements—though I think we have a preference for simplicity unless we get significant accuracy gains.

We’d have to consider the quality benefits and complexity costs of any such solution, but we can try various permutations in vitro before deciding the best way to proceed in vivo.

Next Steps
If Mikhail can get me the query data from the TextCat A/B test, I can re-run TextCat on the queries and generate confidence scores using any of the metrics above (or others if anyone has suggestions!) and give them back to Mikhail to analyze in terms of zero results rate and user clicks.

Try the Demo!
Anyone can try out TextCat and see how it responds to various combinations of text to analyze and languages to consider with the TextCat demo.

More Ideas
This is a brain dump of some more random ideas I had that may prove useful, and are at least worth thinking about/testing:
 * ✓ Give a boost to the "host" language. If we're on enwiki, English gets a boost and maybe that'll keep French from getting too many false positives. On jawiki, Japanese gets a boost, and maybe Chinese doesn't get too many false positives. (DONE)
 * More generally, it may be possible to give boosts to several languages based on prior probabilities of being present, but that may be too complicated to configure precisely on the limited training data. One simplified possibility is to use the F0.5 score from the initial evaluation somehow.
 * Another generalization: have 2+ tiers of languages, and lower tiers (less likely in general) have to score better to be accepted. "host" language vs rest would be 2 tiers. Host vs commonly seen (i.e., > 0.5%) vs rest (including unseen languages) would be 3 tiers. Generally unambiguous scripts (Korean, Hebrew) could be raised a tier (or "half a tier") since they are unlikely to be wrong—though we still have to deal with mixed-script (and generally mixed-language) strings.
 * For languages with larger character sets, esp. Chinese, it may make sense to build a different kind of model that includes more distinct single characters, rather than n-grams. This could probably make distinguishing CJK languages from each other easier, but could make it harder to distinguish Mandarin from Cantonese, for example.
 * ✓ Add the ability to use Wikitext-based models with query-text-based models. It shouldn't be that hard to distinguish, say, မြန်မာအက္ခရာ from English even if the မြန်မာအက္ခရာ model is only based on Wikitext. That would reduce the number of errors by a little bit, and every little bit helps. (DONE)