Localisation

This page gives a technical description of MediaWiki's internationalization (I18N) system.

Code Structure
First, you have a Language object in Language.php. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)

The object is constructed from two sources: subclassed versions of itself (classes) and Message files (messages).

There's also the MessageCache class, which handles input of text via the MediaWiki namespace. And there's the wfMsg* functions in GlobalFunctions.php. We have large amounts of message retrieval code in GlobalFunctions.php.

General use
You load a language object by calling the  function. This function returns the class file for the object (taking in account fallback languages by using the fallback language's object but overloading the language key) and returns that object. Nothing else happens.

When a message/etc is requested, a lazy load initializor is called. Now the real work starts. We're first going to take the scenario that the language is not cached. The system loads the Messages file by:

...where  is the name of variables that could be used in the localization file. This lets you use things like:

...and easily siphon them into arrays.

Then, we load the  language (if not set, English) to fill in the gaps in the messages. There is specialized behavior for certain keys, as they can be mergeable maps, lists or alias lists.

Caching
MediaWiki has lots of caching mechanisms built in, which make the code somewhat more difficult to understand. Before doing any loading, MediaWiki will check the following places to see if we can be lazy:


 * 1) $mLocalisationCache[$code] - just a variable where it may have been stashed.
 * 2) serialized/$code.ser - compiled serialized language file.
 * 3) Memcached version of file (with expiration checking).

Expiration checking consists of by ensuring all dependencies have a filemtime that match the ones bundled with the cached copy. Similar checking could be implemented for serialized versions, as it seems that they are not updated until manually recompiled. However the manual recompilation model will probably be removed shortly, as it's inconvenient for site administrators. Caching is much more versatile, especially when you add dependency checking. The only problem is that you need to have a data store which is both fast to read and writable by the web server. Such a store is not always available.

Behavior
Things that are localizable:


 * Weekdays (and abbrev)
 * Months (and abbrev)
 * Bookstores
 * Skin names
 * Math names
 * Date preferences
 * Date format
 * Default date format
 * Date preference migration map
 * Default user option overrides
 * Language names
 * Timezones
 * Character encoding conversion via iconv
 * UpperLowerCase first (needs casemaps for some)
 * UpperLowerCase
 * Uppercase words
 * Uppercase word breaks
 * Case folding
 * Strip punctuation for MySQL search
 * Get first character
 * Alternate encoding
 * Recoding for edit (and then recode input)
 * RTL
 * Direction mark character depending on RTL
 * Arrow depending on RTL
 * Languages where italics cannot be used
 * Number formatting (commafy, transform digits, transform separators)
 * Truncate (multibyte)
 * Grammar conversions for inflected languages
 * Plural transformations
 * Formatting expiry times
 * Segmenting for diffs (Chinese)
 * Convert to variants of language
 * Language specific user preference options
 * Link trails, e.g.: foobar
 * Language code (RFC 3066)

Neat functionality:


 * I18N sprintfDate
 * Roman numeral formatting

Parameter substitution
Some messages take parameters. They are represented by $1, $2, $3, … in the (static) message texts, and replaced at run time. Typical parameter values are numbers ("Delete 3 versions?"), or user names ("Page last edited by $1"), page names, links, and so on, or sometimes other messages. They can be of arbitrary complexity.

Switches in messages …
Parameters values at times influence the exact wording, or grammatical variations in messages. Not resorting to ugly constructs like "$1 (sub)page(s) of his/her userpage", we make switches depending on values known at run time. The (static) message text then supplies each of the possible choices in a list, preceded by the name of the switch, and a reference to the value making a difference. This very much resembles the way, parser functions are called in MediaWiki. Several types of switches are available.

… on numbers via PLURAL
MediaWiki supports plurals, which makes for a nicer-looking product. For example:

Language-specific implementations of PLURAL: are found in pages such as LanguageFr.php (for French, code ) or LanguageCs.php (for Czech, code  ).

… on use context inside sentences via GRAMMAR
Grammatical transformations for agglutinative languages is also available. For example for Finnish, where it was an absolute necessity to make language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since only a few words needed to be translated, such as the site name, we didn't need to include it.

MediaWiki has grammatical transformation functions for over 20 languages. Some of these are just dictionaries for Wikimedia site names, but others have simple algorithms which will fail for all but the most common cases.

Even before MediaWiki had arbitrary grammatical transformation, it had a nominative/genitive distinction for month names. This distinction is necessary if you wish to substitute month names into sentences.

Filtering special characters in parameters and messages
The other (much simpler) issue with parameter substitution is HTML escaping. Despite being much simpler, MediaWiki does a pretty poor job of it. We have a plethora of poorly-named wfMsg* functions, including the multitasking wfMsgExt, with lots of ways to slip up and let through unescaped user input. There may be work done to clean this up at some stage in the future.

Message sources
Messages are obtained from these sources:
 * The MediaWiki name space. It allows wikis to adopt, or override, all of their messages, when standard messages do not fit or are not desired.
 * MediaWiki:message-name is the default message,
 * MediaWiki:message-name/language-code is the message to be used when a user has selected a language other then the wikis default language.
 * From message files.
 * Mediawiki itself, and few extensions, use a file per language, called, where zxx is the language code for the language.
 * Most extensions use a combined message file holding all messages in all languages, usually named after the extension, and having an  ending.
 * Very few extensions are using another, individual way.

Internationalization hints
Translators ask to consider some hints so as to make their work easier and more efficient. Even if only adding or editing messages in English, one should be aware of the needs of all languages. Messages are translated to more than 300 languages each, which should be done in the best possible way.

There a two main place, where you can find assistance of experienced and knowledgeable people regarding I18n: Please do ask them.
 * translatewiki.net, ask on their support page
 * the #mediawiki-i18n irc channel on http://freenode.org.

Avoid message reuse
The translators encourage reuse avoidance. Although two concepts can be expressed with the same word in English, this doesn't mean they can be expressed with the same word in every language. "OK" is a good example: in English this is used for a generic button label, but in some languages they prefer to use a button label related to the operation which will be performed by the button.

An easy way to duplicate messages across all languages would reduce the programmer's need to reuse messages. Preferably you would have a reference rather than a full copy, to reduce maintenance.

Avoid patchwork messages
Languages have varying word orders, and complex grammatical and syntactic rules. Messages put together from lots of pieces of text, possibly with some indirection, are very hard, if not impossible, to translate. Better make messages complete sentences each, with a full stop at the end. Several sentences can usually much more easily be combined into a text block, if needed.

Be aware of PLURAL use on all numbers
When a number has to be inserted into a message text, be aware that, some languages will have to use PLURAL on it even if always larger than 1. The reason is that PLURAL in languages other than English can make very different and complex distinctions, comparable to English 1st, 2nd, 3rd, 4th, … 11th, 12th, 13th, … 21st, 22nd, 23rd, … etc.

Do not try to supply three different messages for cases like 0, 1, more items counted. Rather let one message take them all, and leave it to translators and PLURAL to properly treat possible differences of presenting them in their respective languages.

Separate times from dates in sentences
Some languages have to insert something between a date and a time which grammatically depends on other words in a sentence. Thus they will not be able to use date/time combined. Others may find the combination convenient, thus it is usually the best choice to supply three parameter values (date/time, date, time) in such cases.

Users have grammatical genders
When a message talks about a user, or relates to a user, or addresses a user directly, the user name should be passed to the message as a parameter. Thus languages having to, or wanting to, use proper gender dependent grammar, can do so. This should be done even when the user name is not intended to appear in the message, such as in "inform the user on his/her talk page", which is better made "inform the user on { {GENDER:$1|his|her|their}} talk page" in English as well.

Avoid in messages
has several disadvantages. It can be anything (acronym, word, short phrase, etc.) and, depending on language, may need  on each occurrence. No matter what, very likely in most wiki languages, each message having  will need review for each new wiki installed. When there is not a general GRAMMAR program for a language, as almost always, sysops will have to add or amend php code so as to get  for   working. This requires both more skills, and more understanding, than otherwise. It is more convenient to have generic references like "this wiki". This does not keep installations from altering these messages to use, but at least they don't have to, and they can postpone message adaption until the wiki is already running and used.

Have message elements before and after input fields
While most modern Western European languages, including English, allow efficient use of prompting in the form "item colon space input-field" that is not so for many other languages, unless sacrificing good grammatical taste or politeness, or resorting to excessively complicated and lengthy wording. Even in Engish, you often want to use "Distance: ___ feet" rather than "Distance (in feet): ___". Leaving  aside, just think of each and any input field following the "Distance: ___ feet" pattern, and give it two messages, even if the 2nd one is most often empty in English.

An alternate solution might be allowing the placement of input fields via $i parameters.

Messages are usually longer than you think!
Skimming foreign language message files, you find messages almost never shorter than Chinese ones, rarely shorter than English ones, and most usually much longer than Englich ones.

Especially in forms, in front of input fields, English messages tend to be terse, and short. That is often not kept in translations. Especially genuinely untechnical third world languages, vernacular, medieval, or ancient languages require multiple words or even complete sentences to explain foreign, or technical, prompts. E.g. "TSV file:" may have to be translatd as: "Please type a name here which denotes a collection of computer data that is comprised of a sequentially organized series of typewritten lines which themselves are organized as a series of informational fields each, where said fields of information are fenced, and the fences between them are single signs of the kind that slips a typewriter carriage forward to the next predefined position each. Here we go: _____ (thank you)" — admittedly an extreme example, but you got the trait. Imagine this sentence in a colum in a form where each word occupies a line of its own, and the input field is vertically centered in the next column. :-(

Avoid using very close, similar, or identical words to denote different things, or concepts
For example, pages may have older revisions (of a specific date, time, and edit), comprising past versions of said page. The words revision, and version can be used interchangeably. A problem arises, when versioned pages are revised, and the revision, i.e. the process of revising them, is being mentioned, too. This may not pose a serious problem when the two synonyms of "revision" have different translations. Do not rely on that, however. Better is to avoid the use of "revision" aka "version" altogether, then, so as to avoid it being misinterpreted.

Basic words may have unforeseen connotations, or not exist at all
There are some words that are hard to translate because of their very specific use in MediaWiki. Some may not be translated at all. For example "namespace", and "appartment", translate the same in Kölsch. There is no word "user" relating to "to use something" in several languages. Sticking to Kölsch, they say "corroborator and participant" in one word since any reference to "use" would too strongly imply "abuse" as well. "Wiki farm" is translated as "stable full of wikis", since a single crop farm would be a contradiction in terms in the language, and not understood, etc.

Expect untranslated words
It is not uncommon that computerese English is not translated and taken as loanwords, or foreign words. In the latter case, technically correct translations mark them as belonging to another language, usually with apropriate html markup, such as  …. Thus make sure that, you message output handler passes it along unmolested, even if you do not need it both in English, and in your language.

Do not expext symbols and interpunktuation to survive translation
Languages written from right to left (as opposed to english) usually exchange arrow symbols being presented with "next" and "previous" links, and their placement relative to a message text may, or many not, be inverted as well. Ellipsis may be translated to "etc." or to words. Question marks, exclamation marks, colons do appear at other places than at the end of sentences, or not at all, or twice. As a consequence, always include all of those in your messages, never insert them programmatically.

Use full stops
Do terminate normal sentences with full stops. This is often the only indicator for a translator to know that they are not headlines or list items, which may need to be translated differently.

Keeping messages centralized and in sync
English messages are very rarely out of sync with the code. Experience has shown that it's convenient to have all the English messages in the same place. Revising the English text can be done without reference to the code, just like translation can. Programmers sometimes make very poor choices for the default text.

Message documentation
There is a pseudo-language, having the code  (message documentation). It is one of the ISO 639 codes reserved for private use. There we do not keep translations of each message, but collect English sentences about each message; telling us where it is used, giving hints about how to translate is, enumerate and describe its parameters, link to related messages, etc.. In translatewiki.net, these hints are shown to translators when the hit the "" button for messages.

Programmers are encouraged to contribute to message documentation.