Internationalisation
From MediaWiki.org
This page gives a technical description of MediaWiki's internationalization (I18N) system.
Contents |
[edit] Code Structure
First, you have a Language object in Language.php. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)
The object is constructed from two sources: subclassed versions of itself (classes) and Message files (messages).
There's also the MessageCache class, which handles input of text via the MediaWiki namespace. And there's the wfMsg*() functions in GlobalFunctions.php. We have large amounts of message retrieval code in GlobalFunctions.php.
[edit] General use
You load a language object by calling the Language::factory() function. This function returns the class file for the object (taking in account fallback languages by using the fallback language's object but overloading the language key) and returns that object. Nothing else happens.
When a message/etc is requested, a lazy load initializor is called. Now the real work starts. We're first going to take the scenario that the language is not cached. The system loads the Messages file by:
require( $filename ); $cache = compact( self::$mLocalisationKeys );
...where self::$mLocalisationKeys is the name of variables that could be used in the localization file. This lets you use things like:
$fallback = false; $rtl = false;
...and easily siphon them into arrays.
Then, we load the $fallback language (if not set, English) to fill in the gaps in the messages. There is specialized behavior for certain keys, as they can be mergeable maps, lists or alias lists.
[edit] Caching
MediaWiki has lots of caching mechanisms built in, which make the code somewhat more difficult to understand. Before doing any loading, MediaWiki will check the following places to see if we can be lazy:
- $mLocalisationCache[$code] - just a variable where it may have been stashed.
- serialized/$code.ser - compiled serialized language file.
- Memcached version of file (with expiration checking).
Expiration checking consists of by ensuring all dependencies have a filemtime() that match the ones bundled with the cached copy. Similar checking could be implemented for serialized versions, as it seems that they are not updated until manually recompiled. However the manual recompilation model will probably be removed shortly, as it's inconvenient for site administrators. Caching is much more versatile, especially when you add dependency checking. The only problem is that you need to have a data store which is both fast to read and writable by the web server. Such a store is not always available.
[edit] Behavior
Things that are localizable:
- Weekdays (and abbrev)
- Months (and abbrev)
- Bookstores
- Skin names
- Math names
- Date preferences
- Date format
- Default date format
- Date preference migration map
- Default user option overrides
- Language names
- Timezones
- Character encoding conversion via iconv
- UpperLowerCase first (needs casemaps for some)
- UpperLowerCase
- Uppercase words
- Uppercase word breaks
- Case folding
- Strip punctuation for MySQL search
- Get first character
- Alternate encoding
- Recoding for edit (and then recode input)
- RTL
- Direction mark character depending on RTL
- Arrow depending on RTL
- Languages where italics cannot be used
- Number formatting (commafy, transform digits, transform separators)
- Truncate (multibyte)
- Grammar conversions for inflected languages
- Plural transformations
- Formatting expiry times
- Segmenting for diffs (Chinese)
- Convert to variants of language
- Language specific user preference options
- Link trails, e.g.: [[foo]]bar
- Language code (RFC 3066)
Neat functionality:
- I18N sprintfDate
- Roman numeral formatting
[edit] Parameter substitution
MediaWiki supports plurals, which makes for a nicer-looking product. For example:
'undelete_short' => 'Undelete {{PLURAL:$1|one edit|$1 edits}}',
Language-specific implementations of PLURAL: are found in pages such as LanguageFr.php (for French, code fr) or LanguageCs.php (for Czech, code cs).
Grammatical transformations for agglutinative languages is also available. For example for Finnish, where it was an absolute necessity to make language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since only a few words needed to be translated, such as the site name, we didn't need to include it.
MediaWiki has grammatical transformation functions for over 20 languages. Some of these are just dictionaries for Wikimedia site names, but others have proper algorithms.
Even before MediaWiki had arbitrary grammatical transformation, it had a nominative/genitive distinction for month names. This distinction is necessary if you wish to substitute month names into sentences.
The other (much simpler) issue with parameter substitution is HTML escaping. Despite being much simpler, MediaWiki does a pretty poor job of it. We have a plethora of poorly-named wfMsg*() functions, including the multitasking wfMsgExt(), with lots of ways to slip up and let through unescaped user input. There may be work done to clean this up at some stage in the future.
[edit] Avoid message reuse
The translators encourage reuse avoidance. Although two concepts can be expressed with the same word in English, this doesn't mean they can be expressed with the same word in every language. "OK" is a good example: in English this is used for a generic button label, but in some languages they prefer to use a button label related to the operation which will be performed by the button.
An easy way to duplicate messages across all languages would reduce the programmer's need to reuse messages. Preferably you would have a reference rather than a full copy, to reduce maintenance.
[edit] Keeping messages centralized and in sync
English messages are very rarely out of sync with the code. Experience has shown that it's convenient to have all the English messages in the same place. Revising the English text can be done without reference to the code, just like translation can. Programmers sometimes make very poor choices for the default text.

