Universal Language Selector/Technical Design

From mediawiki.org
Universal Language Selector with geoip based language suggestion for a user from India

This is the technical design document of Universal Language Selector. It explains the structure of both jQuery.uls and how it is used by the MediaWiki Universal Language Selector extension.

Design principles[edit]

  • The jQuery plugin is generic and reusable even though it is developed together with MediaWiki integration.
  • Language search is possible in language independent way. Users can search languages in any script.
  • Language selection adapts for various different use cases.
  • All language related preferences should be in one place abstracting implementation details between anonymous and logged in users.

jQuery plugin architecture[edit]

Core component: $.fn.uls[edit]

ULS trigger in personal toolbar

jQuery extension function $.fn.uls is used for adding ULS to a trigger. A trigger is an html element – it can be a link, button, image or theoretically anything. On click of that element ULS will be shown.

On binding ULS with a trigger, an instance of ULS will be associated to that trigger using data-uls html attribute.

$.fn.uls is defined jquery.uls.core.js.

An example showing integration of ULS with an element is given below. Clicking on the element will open ULS. This example uses onSelect option to show the autonym of the selected language as text for element #pageLanguage.

// Bind the ULS to jQuery element with id uls-trigger
$( '#uls-trigger' ).uls( { 
	onSelect: function( language ) {
		// Do something with the selected language
		var languageName = $.uls.data.autonym( language ); // get the language autonym
		$( '#pageLanguage' ).text( languageName );
	}
} );

Options[edit]

menu
This should the html template for the ULS popup. ULS has a default template html, but can be customized with this option.
onSelect
Callback function to be called when a language is selected. This get called when enter is pressed in language selector or user clicks on a language. Use this to do the wanted action with the language selection.
searchAPI
Language search API. This should be an API url that takes a parameter search with string value. And it should return a json with key as language code and value as language names. All the key value pairs should be in an object with key languagesearch. A sample API: http://translatewiki.net/w/api.php?action=languagesearch&search=finish. Default value is null.
languages
Languages to be used for ULS, default is all languages supported by ULS ($.uls.data.autonyms()). If you want to work with a subset of languages pass that list here. The value should be an object with key as language code, value as language name to be displayed in the list.
quickList
Array of language codes or function that returns such an array. You can provide a list of most likely languages the user wants to choose. It may be based on geoip suggestions or previously used languages or such. Default value is null.

Language filter: $.fn.languagefilter[edit]

Language filter is responsible for text based search of language names. Languages can be searched based on its autonym, written in current UI language, script name, ISO 639-2 language code. It uses the ULS data module to get all the information for search. All passed results will be passed to the result target.

The result target is a jquery object. In ULS it is the lcd component(see below). But it can be any element in the page. For example, it can be a div, ul , ol etc. languagefilter calls append() method on that object, and does not know about the rendering mechanism of target. ie, Language filter knows how to do search and rendering of results is not its responsibility.

If the searchAPI is given as option, language filter will use that API for getting results. Since API call is asynchronous, once results are obtained it will again call append methods on target. Note that every time append method is called once for each passed search.

Optionally, language filter takes two result handlers - a success handler to be called when atleast one result found and a no-result handler when no results found. These callback methods can be passed to the language filter. In ULS, the core component set these callbacks to the language filter.

Region filter: $.fn.regionselector[edit]

Region selector handles updating the language list when regions are selector and updating active region when language list is scrolled. Just like language filter, this filter also works on the data provided by uls.data. Since the number of regions are finite, the region filter results are cached.

Region filters are bind to each region and not to the full map. And it is not mandatory that it should be a map. Even a link will do. Based on data-region or data-regiongroup, filtering will happen. In ULS, each of the map sections are region filters. And in no results page, we give links to regions, they are also region filters.

Just like language filter, this also calls append method on target and target in ULS is lcd component(see below).

Region filters expose a next() method that loads the next region. This is triggered at the scroll end at lcd component.

Language category display: $.fn.lcd[edit]

Used for showing the search results. Languages are grouped by world regions, subregions and finally by scripts. Languages are displayed on columns and the columns are sorted alphabetically.

This component receives the language results from regionfilter or languagefilter. And this manages arranging the results in columns, and grouping by script groups. Results are also put into the appropriate regions.

The handler for click events on the languages are attached to the language links in this component.

If there is no results found, this component shows the quick list of languages as suggestion and a help text for search.

Language database[edit]

ULS comes with an extensive repository of language information. It has

  • language codes, all supported languages in CLDR
  • autonym for a language
  • The regions in which it is used. The regions are World Wide, America, Europe, Africa, Middle East, Asia and Pacific.
  • Territories in which the language is present. This uses CLDR supplimentary data to indicate the weightage of presence as well
  • Script information of languages.

The data is generated from a php script and it downloads the data from CLDR. It converts the data to javascript based format for ULS.

Around this data, there are a number of utility method to get languages based on script, regions etc. And there are utility methods to get autonym, script, territory etc. It can also sort the languages based on acronym.

Modules: $.uls.data and present in jquery.uls.data.js and jquery.uls.data.utils.js

MediaWiki Universal Language Selector extension[edit]

MediaWiki Universal Language Selector extension (ULS) is implementation of jQuery.uls integrated with MediaWiki. It enhances jQuery.uls with additional features like search API, language change integration, preferences integration and limiting the languages to those supported by MediaWiki. Currently ULS only does MediaWiki interface language selection and webfonts support. In the future it will be extended to support content language selection (interwiki links and page translation) and language selection as form input.

Language settings[edit]

This is a generic language settings jquery plugin. It provides a popup with settings items listed and a content pane where the specific language setting modules can render content. Language settings has a module system where specific language settings modules(for eg: input tool setting, display setting etc) can register as language settings module.

From the registered modules, language settings plugin expects the following:

  • A module ID. this is the key for the module
  • Module name. A string, this will be used for listing the setting
  • Module description. A short description about the module. This will also be used for listing the module in language settings popup
  • a render method. Language module will call a render method of module, and inside module it is responsible for rendering the settings pane content.


Display settings[edit]

ULS language settings

This settings module is for selecting the UI language and selecting the webfonts for language. It uses jquery.webfonts for working with webfonts and uses ULS preferences system(see below) for persisting the preferences.

ULS is triggered from this module to get the additional languages for UI. See the ... button the screen.

Input method settings[edit]

This language settings module integrate jquery.ime with webfonts and provides input method preferences for ULS

Preference system[edit]

Language settings are available for both anonymous and logged in users. For logged in users the preferences will be saved in back end preferences system. For anonymous users, preferences are saved in browser local storage where possible. ULS preference system abstracts the different ways of retrieving and storing the preferences. The preference instance is singleton and any updates to the preferences from any component in the screen is always in sync.

Module: mw.uls.preferences, defined in ext.uls.preferences.js

Language Search API[edit]

The translations of language names to other languages are based on CLDR. We use Extension:CLDR for getting the language name translations. This is a big data set. If we have 1000 languages to support, we will have 1000 language names translated to all other 1000 languages resulting a matrix of 1000 X 1000 entries.

To do a cross language search in this matrix, and that search being triggered by a key press from the search window, should be really fast. For that we use a special indexing algorithm.

Indexing language names[edit]

The language name matrix is very big to search. To make the search very fast we index the language names. Language names will be placed in a lot of buckets, indexed with a key the unicode point of starting letter mod 1000. This is based on the following facts and assumptions

  • People does not make typo in first letter of search.
  • If the first letter is in say, Tamil, rest of the characters in search query will be in Tamil
  • If the first letter is in Tamil, we need to search only in Tamil. That means, if we have 1000 languages and all being translated to Tamil, in worst case we have to do 1000 comparisons.
  • Most of the languages are written in latin script, and if we use unicode point of starting letter mod 1000, those buckets will be dense/over populated.So for unicode point less than 1000, the index will be unicode point itself. That make all translations starting with 'a' in a single bucket and names starting with 'b' in another bucket.
  • Actually unicode ranges of a script need not span across 1000 code points. Remember, our search keys are unicode point % 1000. That is all language names in Malayalam script will be in a bucket with key 3 since the unicode ranges span between 3330 and 3455. For Tamil it is Bucket 2 or 3 since the code points span between 2946 and 3066. We use 1000 for mod, because we don't want too many buckets.

The index is version controlled and need to be updated very rarely(may be when a new CLDR version released). But we have a php script to do this. data/LanguageNameIndexer.php. The output is a serialized index file - data/langnames.ser ( it is around 8,99,726 bytes for some CLDR version 21)

Searching for a language[edit]

The following steps are involved in searching

  • Load the serialized language name index- only once.
  • Find out the unicode point of the first letter of query string, and calculate the bucket key(index) from that. It is codepoint mod 1000 if code point is greater than 1000. Or code point itself if it is less than 1000
  • Locate the bucket and iterate through each of the entries.
  • Compare the entry using levenshtein distance algrithm with a maximum distance one. That means search pass all strings which are exactly equal or differ by one typo.
  • Return the passed language code and translated name

API : LanguageNameSearch::search( $searchKey, $typos = 0 )

Typo correction[edit]

The language name comparison is based on. Basically it tells how many typos has a word compared to the other word. We allow one typo in the search criteria. That make 'finnish' match with a search with a typo 'finish'.

PHP's native Levenshtein distance implementation does not support multi byte characters, but in our case we need to do this search in all possible languages of the world defined in Unicode. So we wrote a custom function for Levenshtein algorithm that support multi byte unicode characters.

Naming convention[edit]

All jquery plugin components that are not dependent on mediaWiki is named with a prefix jquery.uls. For example jquery.uls.core.js jquery.uls.css etc.

Mediawiki specific customization and extension of jquery.uls is in the files with prefix ext.uls.


Coding guidelines[edit]

jQuery.uls follows jQuery core coding guidelines.

MediaWiki specific parts (ext.uls.*) follor MediaWiki coding conventions, which are similar to above but with less exceptions for whitespace.