TextCat

TextCat is a language detection library based on n-gram text categorization. The original version is a Perl library developed by Gertjan van Noord. The Wikimedia Foundation maintains a PHP port of this library available as a composer package.

Rationale
Language detection in search, particularly on start strings of text, is hard. In multiple languages there are often words that are the same when written (say for example the words "liaison" or "reservoir" in both English and French). Queries are often only a few characters long. Some include special characters like umlauts "ö". Some visitors don't include special characters when searching for a particular phrase or word, others do.

One way to detect the intended language in search strings, and potentially route the visitor to the appropriate language wiki, is to use models of each language. These models are basically a list of characters that are common in a particular language that are scored against a set of training data.

For example, as you can see in the language model for French that TextCat uses, the character "é" appears higher in the ranking than say the model for English, where that character appears much lower in the list.

Training Data
To understand what makes a language look (or not look) like a particular language, training data is developed based upon past query strings. These query strings are then run against TextCat and compared to what is expected in the existing language model. These corpus of text, anonymized and sanitized from bots and errant searches, can help to 'teach' TextCat what characters commonly appear in a language. These slight changes held more positive results in testing and improve the accuracy of the language detection.

Wikimedia Search
What this means for search on wikis using TextCat is that the search function can be made to detect an intended language and present results in that language. While not in production on any wikis at the moment, TextCat is a step toward creating a search that is able to detect with a high level of accuracy the language that is desired and therefore produce better results for search queries.

You can find tasks related to TextCat in Phabricator.

Updating the library
In order to update the deployment library once the change has been merged into the library repository:
 * 1) Tag the library with the new version and push the tag
 * 2) Check on https://packagist.org/packages/wikimedia/textcat that the tag is updated
 * 3) Update   in
 * 4) Test on non-production install that after   everything runs smooth.
 * 5) Check out   repo.
 * 6) Edit   and put new version of   there.
 * 7) Run.
 * 8) Make patch of the changes and put it to review on Gerrit.