TextCat

From MediaWiki.org
Jump to navigation Jump to search

TextCat is a language detection library based on n-gram text categorization. The original version is a Perl library developed by Gertjan van Noord. The Wikimedia Foundation maintains a PHP port of this library available as a composer package.

Rationale[edit]

People sometimes search using words that are not in the language of the wiki they are searching. Sometimes it works (e.g., Луковичная глава, Unión de Radioaficionados Españoles, or 汉语 on English Wikipedia), and sometimes it doesn't (e.g., force électromotrice on English Wikipedia)—but would if we could redirect the query to the right wiki in the same language as the query (e.g., force électromotrice on French Wikipedia). In order to do that we need to be able to detect the language of the query.

Language detection is generally harder on shorter strings of text, and search queries are often less than a dozen characters long. Some words are inherently ambiguous (for example the words "liaison" or "magistrate" in both English and French). Special characters and diacritics can be fairly distinctive (like German "ß" or Vietnamese "ế"), but some visitors don't include special characters when searching for a particular phrase or word, while others do.

TextCat determines the language of a search string by comparing it to a model for each language. These models are basically a ranked list of n-grams that are common in a particular language, and that are built from a set of training data.

For example, as you can see in the language model for French that TextCat uses, the character "é" appears higher in the ranking (currently line 41) than say the model for English, where that character appears much lower in the list (currently line 3781).

Another way of looking at how TextCat will improve things is taking a Russian language query (первым экспериментом) that was typed into an enwiki search box. With the existing search capability, no results are found. Adding TextCat in, we still get no results in English, but because we detected that the Russian language was used, we can return results back in Russian.

Training Data[edit]

To understand what makes a language look (or not look) like a particular language, training data is developed based upon past query strings. These query strings are run against TextCat and used to build up the model for the given language. These corpora of text, sanitized from bots and errant searches, can help to 'teach' TextCat what n-grams commonly appear in a language. Using query data for training, rather than general text like Wikipedia article text, gives more positive results in testing and improves the accuracy of the language detection for queries.

The PHP port of TextCat includes models built on query data (for use with queries), and models built on general Wikipedia article text, which may be more useful for generic language detection.

Wikimedia Search[edit]

What this means for search on wikis using TextCat is that the search function can be made to detect the language of a query and present results in that language. While not in production on any wikis at the moment, TextCat is a step toward creating a search that is able to detect with a high level of accuracy the language that is desired and therefore produce better results for search queries.

You can find tasks related to TextCat in Phabricator.

Updating the library[edit]

In order to update the deployment library once the change has been merged into the library repository:

  1. Tag the library with the new version and push the tag
  2. Check on https://packagist.org/packages/wikimedia/textcat that the tag is updated
  3. Update composer.json in extension/CirrusSearch
  4. Test on non-production install that after composer update --no-dev everything runs smooth.
  5. Check out mediawiki/vendor repo.
  6. Edit composer.json and put new version of wikimedia/textcat there.
  7. Run composer update --no-dev.
  8. Make patch of the changes and put it to review on Gerrit.

See Also[edit]

External Links[edit]