TextCat

TextCat is a language detection library based on n-gram text categorization used by the CirrusSearch extension for MediaWiki, which provides on-wiki search.

On Wikipedias where TextCat is deployed, a search that gets very few results will be sent to TextCat for language identification. If a likely language for the query can be determined, then the relevant Wikipedia is also searched, and any results are also displayed.

It is currently deployed to the Dutch, English, French, German, Italian, Japanese, Portuguese, Russian, and Spanish Wikipedias.

As an example, searching English Wikipedia for málvísindi (‘linguistics’ in Icelandic) gets no results, so results from the Icelandic Wikipedia are shown.

How TextCat works
TextCat attempts to determine the language of a search string by comparing it to a model for each language. These models are basically a ranked list of n-grams (1- to 5-letter sequences found in the text) that are the most common in a particular language.

For example, as you can see in the language model for French that TextCat uses, the character "é" appears higher in the ranking (currently line 41) than say the model for English, where that character appears much lower in the list (currently line 3781).

Language identification is generally harder on shorter strings of text, and search queries are often less than a dozen characters long. Some words are inherently ambiguous (for example the words "liaison" or "magistrate" in both English and French). Special characters and diacritics can be fairly distinctive (like German "ß" or Vietnamese "ế"), but some visitors don't include special characters when searching for a particular phrase or word, while others do.

Because of the differences in the language people use in queries, where possible, our language models are built on samples of actual queries. For other languages, we use articles from the relevant Wikipedia as samples of general text in the language.

Limitations of TextCat in on-wiki search
There are several limits we have to place on TextCat for reasons of efficiency, complexity, or accuracy. And there are some things that it has a bit of trouble with.

We limit the number of languages considered by TextCat on a given wiki for a couple of different reasons, and so the exact list differs by wiki (see  for the current list per wiki).


 * It’s much faster to consider fewer languages, so we only consider languages that are relatively common in queries on a particular wiki. So, Japanese Wikipedia does not consider Spanish when doing language detection because, historically, there haven’t been that many queries in Spanish there.
 * Some languages are more likely to cause confusion, but also don’t occur that frequently, so we omit them from consideration. For example, Dutch and Afrikaans are closely related, and it may be the case that queries in those languages are frequently confused. If there are twenty times as many Dutch queries as Afrikaans queries, but half the Dutch queries are incorrectly identified as Afrikaans, we would drop Afrikaans, because missing ten Afrikaans queries is better than getting 100 Dutch queries wrong.

We only consider at most the one “best” answer returned from TextCat.


 * Searching additional Wikis is computationally expensive, especially if we searched a lot of them. Also, for lower-quality language matches, the results are less likely to be meaningful—for example, there may be only one instance of a word on a given wiki, which is found in a title of a reference work.
 * From a user interface perspective, showing results from multiple wikis is complicated. The search results page can already be crowded, and we don’t have any UI/UX experts on the search team.
 * Generally, when the results from TextCat are ambiguous—meaning that two or more languages scored very similarly—the “best” answer is much more likely to be wrong. Since we are trying to provide useful additional results to the searcher, omitting these results improves accuracy.

Similar queries may get different results from TextCat. You have to draw the line somewhere, and there will always be related pairs of words that land on opposite sides of the line.


 * TextCat has a bias in favor of the language of the wiki you are on, and in favor of English. Because queries on a given wiki are most likely to be in the language of that wiki, we give that language a boost. For example, adorable is a word in English, French, and Spanish. On the Dutch Wikipedia, it might be identified as Spanish (over French and English), while on the English Wikipedia, it would be too ambiguous, because English gets a boost. (Note that this is just a hypothetical example; there are too many results for adorable on Dutch Wikipedia for TextCat to be invoked.)
 * It turns out to be good to not only give the language of the wiki a boost, but also the second most common language seen in queries. For every other wiki, this happens to be English. (On English Wikipedia, the second most common is Chinese.)
 * Because capitalization matters a little—Folklore is ever so slightly more likely to be German than English if it is capitalized in the middle of a sentence, for example—there will always be cases where the difference between different capitalizations happens to cross the threshold for ambiguous. So Folklore might just barely be recognized as German, while folkore is too ambiguous between German and English and ignored.

Chinese and Korean are harder for TextCat than you might expect. If a query is nothing but Chinese characters, it’s probably Chinese, right? Similarly if it’s all Korean characters. However, TextCat’s n-gram model means that for writing systems with a very large number of individual Unicode characters (Korean has eleven thousand, Chinese has tens of thousands), not all of them are in the model. And since people often search Wikipedia for “interesting” things, rarer characters are not at all unlikely to occur in queries from time to time. For large pieces of text, the more common characters are almost certain to occur, but in short queries, they may not. Throw in a few Latin characters in a query, and the result may suddenly become too ambiguous.

Rationale
People sometimes search using words that are not in the language of the wiki they are searching. Sometimes it works (e.g., Луковичная глава, Unión de Radioaficionados Españoles, or 汉语 on English Wikipedia), and sometimes it doesn't (e.g., force électromotrice on English Wikipedia)—but would if we could redirect the query to the right wiki in the same language as the query (e.g., force électromotrice on French Wikipedia). In order to do that we need to be able to detect the language of the query.

Origins and development
The original version of TextCat is a Perl library developed by Gertjan van Noord, based on a 1994 paper by Cavnar and Trenkle. The original TextCat is relatively lightweight and reasonably accurate, compared to other language identification libraries available when the search team first looked into using language identification.

The Wikimedia Foundation maintains a PHP port of this library available as a Composer package. It is used by the CirrusSearch extension for MediaWiki.

The PHP port has several new features that take advantage of a couple of decades of improvements to computer hardware to use much larger and more accurate language models, use Unicode, and to otherwise improve the ability of TextCat’s n-gram models to distinguish between languages while remaining relatively lightweight.

There is also an updated Perl version maintained by Trey Jones which also has all of the new features of the updated PHP version.

Training Data
To understand what makes a language look (or not look) like a particular language, training data was developed based upon historical query strings. These query strings were run against TextCat and used to build up the model for a given language. These corpora of text, sanitized from bots and errant searches, helped to 'teach' TextCat what n-grams commonly appear in a language. Using query data for training, rather than general text like Wikipedia article text, also gives more positive results in testing and improves the accuracy of the language detection for queries.

The PHP port of TextCat includes models built on query data (for use with queries), and models built on general Wikipedia article text, which may be more useful for generic language detection.

Maintenance
You can find tasks related to TextCat in Phabricator.

Updating the library
In order to update the deployment library once a change has been merged into the library repository:


 * 1) Tag the library with the new version and push the tag
 * 2) Check on wikimedia/textcat that the tag is updated
 * 3) Update   in
 * 4) Test on non-production install that after   everything runs smoothly
 * 5) Check out   repo
 * 6) Edit   and put new version of   there
 * 7) Run
 * 8) Make patch of the changes and put it to review on Gerrit.