Universal Language Selector/Technical Design

This is the technical design document of Universal Language Selector. It explains the structure of both jQuery.uls and how it is used by the MediaWiki Universal Language Selector extension.

Design principles

 * The jQuery plugin is generic and reusable even though it is developed together with MediaWiki integration.
 * Language search is possible in language independent way. Users can search languages in any script.
 * Language selection adapts for various different use cases.
 * All language related preferences should be in one place abstracting implementation details between anonymous and logged in users.

Core component: $.fn.uls


jQuery extension function  is used for adding ULS to a trigger. A trigger is an html element – it can be a link, button, image or theoretically anything. On click of that element ULS will be shown.

On binding ULS with a trigger, an instance of ULS will be associated to that trigger using data-uls html attribute.

is defined.

An example showing integration of ULS with an element is given below. Clicking on the element will open ULS. This example uses onSelect option to show the autonym of the selected language as text for element #pageLanguage.

Options

 * menu: This should the html template for the ULS popup. ULS has a default template html, but can be customized with this option.
 * onSelect: Callback function to be called when a language is selected. This get called when enter is pressed in language selector or user clicks on a language. Use this to do the wanted action with the language selection.
 * searchAPI: Language search API. This should be an API url that takes a parameter search with string value. And it should return a json with key as language code and value as language names. All the key value pairs should be in an object with key languagesearch. A sample API: http://translatewiki.net/w/api.php?action=languagesearch&search=finish. Default value is null.
 * languages: Languages to be used for ULS, default is all languages supported by ULS . If you want to work with a subset of languages pass that list here. The value should be an object with key as language code, value as language name to be displayed in the list.
 * quickList: Array of language codes or function that returns such an array. You can provide a list of most likely languages the user wants to choose. It may be based on geoip suggestions or previously used languages or such. Default value is null.

Language filter: $.fn.languagefilter
Language filter implements the search feature.

Region filter: $.fn.regionselector
Region selector handles updating the language list when regions are selector and updating active region when language list is scrolled.

Language category display: $.fn.lcd
Used for showing the search results. Languages are grouped by world regions, subregions and finally by scripts. Languages are displayed on columns and the columns are sorted alphabetically.

MediaWiki Universal Language Selector extension
MediaWiki Universal Language Selector extension (ULS) is implementation of jQuery.uls integrated with MediaWiki. It enhances jQuery.uls with additional features like search API, language change integration, preferences integration and limiting the languages to those supported by MediaWiki. Currently ULS only does MediaWiki interface language selection and webfonts support. In the future it will be extended to support content language selection (interwiki links and page translation) and language selection as form input.

Language Search API
The translations of language names to other languages are based on CLDR. We use Extension:CLDR for getting the language name translations. This is a big data set. If we have 1000 languages to support, we will have 1000 language names translated to all other 1000 languages resulting a matrix of 1000 X 1000 entries.

To do a cross language search in this matrix, and that search being triggered by a key press from the search window, should be really fast. For that we use a special indexing algorithm.

Indexing language names
The language name matrix is very big to search. To make the search very fast we index the language names. Language names will be placed in a lot of buckets, indexed with a key the unicode point of starting letter mod 1000. This is based on the following facts and assumptions


 * People does not make typo in first letter of search.
 * If the first letter is in say, Tamil, rest of the characters in search query will be in Tamil
 * If the first letter is in Tamil, we need to search only in Tamil. That means, if we have 1000 languages and all being translated to Tamil, in worst case we have to do 1000 comparisons.
 * Most of the languages are written in latin script, and if we use unicode point of starting letter mod 1000, those buckets will be dense/over populated.So for unicode point less than 1000, the index will be unicode point itself. That make all translations starting with 'a' in a single bucket and names starting with 'b' in another bucket.
 * Actually unicode ranges of a script need not span across 1000 code points. Remember, our search keys are unicode point % 1000. That is all language names in Malayalam script will be in a bucket with key 3 since the unicode ranges span between 3330 and 3455. For Tamil it is Bucket 2 or 3 since the code points span between 2946 and 3066. We use 1000 for mod, because we don't want too many buckets.

The index is version controlled and need to be updated very rarely(may be when a new CLDR version released). But we have a php script to do this. data/LanguageNameIndexer.php. The output is a serialized index file - data/langnames.ser ( it is around 8,99,726 bytes for some CLDR version 21)

Searching for a language
The following steps are involved in searching


 * Load the serialized language name index- only once.
 * Find out the unicode point of the first letter of query string, and calculate the bucket key(index) from that. It is codepoint mod 1000 if code point is greater than 1000. Or code point itself if it is less than 1000
 * Locate the bucket and iterate through each of the entries.
 * Compare the entry using levenshtein distance algrithm with a maximum distance one. That means search pass all strings which are exactly equal or differ by one typo.
 * Return the passed language code and translated name

API : LanguageNameSearch::search( $searchKey, $typos = 0 )

Typo correction
The language name comparison is based on Levenshtein distance. Basically it tells how many typos has a word compared to the other word. We allow one typo in the search criteria. That make 'finnish' match with a search with a typo 'finish'.

PHP's native Levenshtein distance implementation does not support multi byte characters, but in our case we need to do this search in all possible languages of the world defined in Unicode. So we wrote a custom function for Levenshtein algorithm that support multi byte unicode characters.

Naming convention
All jquery plugin components that are not dependent on mediaWiki is named with a prefix jquery.uls. For example jquery.uls.core.js jquery.uls.css etc.

Mediawiki specific customization and extension of jquery.uls is in the files with prefix ext.uls.

Coding guidelines
jQuery.uls follows jQuery core coding guidelines.

MediaWiki specific parts (ext.uls.*) follor MediaWiki coding conventions, which are similar to above but with less exceptions for whitespace.