Localisation

This page gives a technical description of MediaWiki's internationalisation and localisation (I18N) system.

Translatewiki.net
Translatewiki.net supports in-wiki translation of the complete interface. If you would like to have nothing to do with all the technicalities of editing files, subversion, creating patches, this is the place for you.

Subversion and Bugzilla
Only few languages are maintained by translators (Hebrew, some Chinese languages), who commit directly to MediaWiki svn repository. All new efforts should go through previously introduced translatewiki.net

I18n mailing list
You can subscribe i18n list, but at the moment it is very low traffic (almost none).

Code Structure
First, you have a Language object in Language.php. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)

The object is constructed from two sources: subclassed versions of itself (classes) and Message files (messages).

There's also the MessageCache class, which handles input of text via the MediaWiki namespace. And there's the wfMsg* functions in GlobalFunctions.php. We have large amounts of message retrieval code in GlobalFunctions.php.

Language objects
There are two ways to get a language object. You can use the globals $wgLang and $wgContLang for user interface and content language respectively. For an arbitrary language you can construct an object by using, by replacing   with the code of the language. The list of codes is in languages/Names.php.

Language objects are needed for doing language specific functions, most often to do number, time and date formatting, but also to construct lists and other things. There are multiple layers of caching and merging with fallback languages, but the details are irrelevant in normal use.

Using messages
MediaWiki uses a central repository of messages which are referenced by keys in the code. This is different from example Gettext, which just extracts the translatable strings from the source files. Key-based system makes some things easier, like refining the original texts and tracking changes to messages. The drawback is of course that the list of used messages and the list of source texts for those keys can get out of sync. In practise this isn't a big problem, sometimes extra messages which are not used anymore still stay up for translation.

The message system in MediaWiki is quite complex, a bit too complex. One of the reasons for this is that MediaWiki is a web application. Messages can go trough all kinds of processing. The four major ones covering almost all cases are:
 * 1) as-is, no processing at all
 * 2) light wiki-parsing, parserfunction references starting with   are replaced with their results
 * 3) full wiki-parsing

Case 1. is for processing, not really for user visible messages. Light wiki-parsing should always be combined with html-escaping.

Recommended ways
Longer messages that are not used hundreds of times on a page:
 * OutputPage::addWikiMsg
 * OutputPage::wrapWikiMsg
 * wfMsgExt ... parse, parseinline

OutputPage methods parse messages and add them directly to the output buffer. can be used when a message should not be added to the output buffer. removes enclosing html tags from the parsed result, usually, but can generate invalid code for example if there is no root tag in parsed result, for example. Usage examples:

Other messages with light wiki-parsing can use  and   with the. wfMsgExt must always be used if the message has parts that depend on linguistic information, like NaN undefineds. Do not use wfMsg, wfMsgHtml for those kind of messages! They seem to work but are broken.

Some messages have mixed escaping and parsing. Most commonly when using raw links in messages that should not be escaped. and  with   can do that. Note that there cannot be any linguistic dependent variables in those messages! Be especially wary of using, it only escapes the message, not parameters. This has caused at least one XSS in MediaWiki. If you do not need the  functionality, use some parsing function or non-parsing function with.

Short list of functions to avoid:
 * (don't use unless you really want unescaped parameters)
 * (breaks up linguistic functions, as does wfMsg)
 * OutputPage::parse and parseInline, addWikiText (if you know the message, use  or  )

Remember that almost all Xml::-functions escape everything fed into them, so avoid double-escaping and parsed text with those.

Adding new messages

 * 1) Decide a name for the message. Try to follow global or local conventions for naming.
 * 2) Make sure that you are using suitable handling for the message (parsing, {{-replacement, escaping for html etc)
 * 3) Add it to languages/messages/MessageEn.php (core) or your extensions i18n file under 'en'.
 * 4) Take a pause and consider the wording of the message. Is it as clear as possible? Can it be understood wrong? Ask comments from other developers if possible.
 * 5) Consider adding documentation to MessagesQqq.php or your extensions i18n file under 'qqq'. Read more about #message documentation.
 * 6) If you added message to code, add the message key also to maintenance/language/messages.inc (also add the section if you created a new one). This file will define the order and formatting of messages in all message files.

Removing existing messages

 * 1) Remove it from MessagesEn.php. Don't bother with other languages - updates from translatewiki.net will handle those automatically.
 * 2) Remove it from maintenance/language/messages.inc

Changing existing messages

 * 1) Consider updating the message documentation (see )
 * 2) Change the message key if old translations are not suitable for the new meaning. This also includes changes in message handling (parsing, escaping). If in doubt, ask in #mediawiki-i18n or in the Support page at translatewiki.net.

Caching
MediaWiki has lots of caching mechanisms built in, which make the code somewhat more difficult to understand. Since 1.16 there is a new caching system, which caches messages either in cdb-files or in the database. Customised messages are cached in the filesystem and in memcached (or alternative), depending on the configuration.

What can be localized

 * Weekdays (and abbrev)
 * Months (and abbrev)
 * Bookstores
 * Skin names
 * Math names
 * Date preferences
 * Date format
 * Default date format
 * Date preference migration map
 * Default user option overrides
 * Language names
 * Timezones
 * Character encoding conversion via iconv
 * UpperLowerCase first (needs casemaps for some)
 * UpperLowerCase
 * Uppercase words
 * Uppercase word breaks
 * Case folding
 * Strip punctuation for MySQL search
 * Get first character
 * Alternate encoding
 * Recoding for edit (and then recode input)
 * RTL
 * Direction mark character depending on RTL
 * Arrow depending on RTL
 * Languages where italics cannot be used
 * Number formatting (commafy, transform digits, transform separators)
 * Truncate (multibyte)
 * Grammar conversions for inflected languages
 * Plural transformations
 * Formatting expiry times
 * Segmenting for diffs (Chinese)
 * Convert to variants of language
 * Language specific user preference options
 * Link trails, e.g.: foobar
 * Language code (RFC 3066)

Neat functionality:


 * I18N sprintfDate
 * Roman numeral formatting

Message parameters
Some messages take parameters. They are represented by $1, $2, $3, … in the (static) message texts, and replaced at run time. Typical parameter values are numbers ("Delete 3 versions?"), or user names ("Page last edited by $1"), page names, links, and so on, or sometimes other messages. They can be of arbitrary complexity.

Switches in messages…
Parameters values at times influence the exact wording, or grammatical variations in messages. Not resorting to ugly constructs like "$1 (sub)page(s) of his/her userpage", we make switches depending on values known at run time. The (static) message text then supplies each of the possible choices in a list, preceded by the name of the switch, and a reference to the value making a difference. This very much resembles the way, parser functions are called in MediaWiki. Several types of switches are available. These only work if you do full parsing or {{-transformation for the messages.

…on numbers via PLURAL
MediaWiki supports plurals, which makes for a nicer-looking product. For example:

Language-specific implementations of PLURAL: are found in pages such as LanguageFr.php (for French; uses singular for 0, code ) or LanguageCs.php (for Czech, Polish and some other Slavic languages, code  ).

…on user names via GENDER
If you refer to an user in a message, pass the user name as parameter to the message and add a mention in the message documentation that gender is supported.

…on use context inside sentences via GRAMMAR
Grammatical transformations for agglutinative languages is also available. For example for Finnish, where it was an absolute necessity to make language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since only a few words needed to be translated, such as the site name, we didn't need to include it.

MediaWiki has grammatical transformation functions for over 20 languages. Some of these are just dictionaries for Wikimedia site names, but others have simple algorithms which will fail for all but the most common cases.

Even before MediaWiki had arbitrary grammatical transformation, it had a nominative/genitive distinction for month names. This distinction is necessary if you wish to substitute month names into sentences.

Filtering special characters in parameters and messages
The other (much simpler) issue with parameter substitution is HTML escaping. Despite being much simpler, MediaWiki does a pretty poor job of it. We have a plethora of poorly-named wfMsg* functions, including the multitasking wfMsgExt, with lots of ways to slip up and let through unescaped user input. There may be work done to clean this up at some stage in the future.

Message sources
Messages are obtained from these sources:
 * The MediaWiki name space. It allows wikis to adopt, or override, all of their messages, when standard messages do not fit or are not desired.
 * MediaWiki:message-name is the default message,
 * MediaWiki:message-name/language-code is the message to be used when a user has selected a language other then the wikis default language.
 * From message files.
 * Mediawiki itself, and few extensions, use a file per language, called, where zxx is the language code for the language.
 * Most extensions use a combined message file holding all messages in all languages, usually named after the extension, and having an  ending.
 * Very few extensions are using another, individual way.

Internationalization hints
Translators ask to consider some hints so as to make their work easier and more efficient. Even if only adding or editing messages in English, one should be aware of the needs of all languages. Messages are translated to more than 300 languages each, which should be done in the best possible way.

There a two main places, where you can find assistance of experienced and knowledgeable people regarding I18n: Please do ask them.
 * translatewiki.net, ask on their support page
 * the #mediawiki-i18n irc channel on http://freenode.org.

Avoid message reuse
The translators encourage reuse avoidance. Although two concepts can be expressed with the same word in English, this doesn't mean they can be expressed with the same word in every language. "OK" is a good example: in English this is used for a generic button label, but in some languages they prefer to use a button label related to the operation which will be performed by the button. If you are adding multiple identical messages, please add message documentation to describe the differences in their contexts. Don't worry too much about the extra work for translators. Translation memory helps a lot in these while keeping the flexibility to have different translations if needed.

Avoid patchwork messages
Languages have varying word orders, and complex grammatical and syntactic rules. Messages put together from lots of pieces of text, possibly with some indirection, are very hard, if not impossible, to translate. Better make messages complete sentences each, with a full stop at the end. Several sentences can usually much more easily be combined into a text block, if needed.

Be aware of PLURAL use on all numbers
When a number has to be inserted into a message text, be aware that, some languages will have to use PLURAL on it even if always larger than 1. The reason is that PLURAL in languages other than English can make very different and complex distinctions, comparable to English 1st, 2nd, 3rd, 4th, … 11th, 12th, 13th, … 21st, 22nd, 23rd, … etc.

Do not try to supply three different messages for cases like 0, 1, more items counted. Rather let one message take them all, and leave it to translators and PLURAL to properly treat possible differences of presenting them in their respective languages.

Always include the number as a parameter if possible. Always add NaN undefineds syntax to the source messages if possible, even if it makes no sense in English. The syntax is guide for translators.

Separate times from dates in sentences
Some languages have to insert something between a date and a time which grammatically depends on other words in a sentence. Thus they will not be able to use date/time combined. Others may find the combination convenient, thus it is usually the best choice to supply three parameter values (date/time, date, time) in such cases.

Users have grammatical genders
When a message talks about a user, or relates to a user, or addresses a user directly, the user name should be passed to the message as a parameter. Thus languages having to, or wanting to, use proper gender dependent grammar, can do so. This should be done even when the user name is not intended to appear in the message, such as in "inform the user on his/her talk page", which is better made "inform the user on { {GENDER:$1|his|her|their}} talk page" in English as well.

Avoid in messages
has several disadvantages. It can be anything (acronym, word, short phrase, etc.) and, depending on language, may need  on each occurrence. No matter what, very likely in most wiki languages, each message having  will need review for each new wiki installed. When there is not a general GRAMMAR program for a language, as almost always, sysops will have to add or amend php code so as to get  for   working. This requires both more skills, and more understanding, than otherwise. It is more convenient to have generic references like "this wiki". This does not keep installations from altering these messages to use, but at least they don't have to, and they can postpone message adaption until the wiki is already running and used.

Avoid references to screen layout and positions
What is rendered where depends on skins. Most often screen layouts of languages written from left to right are mirrored compared to those used for languages written from right to left, but not always, and for some languages and wikis, not entirely. Handheld devices, narrow windows, and so on show blocks underneath each other, that appear side to side on large displays. Since user selected and user written javascript gadgets can, and do, hide parts, or move things around in unpredictable ways, there is no reliable way of knowing the actual screen layout. Acoustic screen readers, and other auxiliary devices do not even have a concept of layout. So, you cannot refer to layout posisitons.

Have message elements before and after input fields

 * This rule has not become de facto standard in MediaWiki development

While most modern Western European languages, including English, allow efficient use of prompting in the form "item colon space input-field" that is not so for many other languages, unless sacrificing good grammatical taste or politeness, or resorting to excessively complicated and lengthy wording. Even in English, you often want to use "Distance: ___ feet" rather than "Distance (in feet): ___". Leaving  aside, just think of each and any input field following the "Distance: ___ feet" pattern, and give it two messages, even if the 2nd one is most often empty in English.

An alternate solution might be allowing the placement of input fields via $i parameters.

Messages are usually longer than you think!
Skimming foreign language message files, you find messages almost never shorter than Chinese ones, rarely shorter than English ones, and most usually much longer than English ones.

Especially in forms, in front of input fields, English messages tend to be terse, and short. That is often not kept in translations. Especially genuinely untechnical third world languages, vernacular, medieval, or ancient languages require multiple words or even complete sentences to explain foreign, or technical, prompts. E.g. "TSV file:" may have to be translatd as: "Please type a name here which denotes a collection of computer data that is comprised of a sequentially organized series of typewritten lines which themselves are organized as a series of informational fields each, where said fields of information are fenced, and the fences between them are single signs of the kind that slips a typewriter carriage forward to the next predefined position each. Here we go: _____ (thank you)" — admittedly an extreme example, but you got the trait. Imagine this sentence in a column in a form where each word occupies a line of its own, and the input field is vertically centered in the next column. :-(

Avoid using very close, similar, or identical words to denote different things, or concepts
For example, pages may have older revisions (of a specific date, time, and edit), comprising past versions of said page. The words revision, and version can be used interchangeably. A problem arises, when versioned pages are revised, and the revision, i.e. the process of revising them, is being mentioned, too. This may not pose a serious problem when the two synonyms of "revision" have different translations. Do not rely on that, however. Better is to avoid the use of "revision" aka "version" altogether, then, so as to avoid it being misinterpreted.

Basic words may have unforeseen connotations, or not exist at all
There are some words that are hard to translate because of their very specific use in MediaWiki. Some may not be translated at all. For example "namespace", and "appartment", translate the same in Kölsch. There is no word "user" relating to "to use something" in several languages. Sticking to Kölsch, they say "corroborator and participant" in one word since any reference to "use" would too strongly imply "abuse" as well. "Wiki farm" is translated as "stable full of wikis", since a single crop farm would be a contradiction in terms in the language, and not understood, etc.

Expect untranslated words

 * This rule has not become de facto standard in MediaWiki development

It is not uncommon that computerese English is not translated and taken as loanwords, or foreign words. In the latter case, technically correct translations mark them as belonging to another language, usually with apropriate html markup, such as  …. Thus make sure that, your message output handler passes it along unmolested, even if you do not need it in English, or in your language.

Symbols, colons, brackets, etc. are parts of messages
Many symbols are translated, too. Some scrips have other kinds of brackets than the Latin script has. A colon may not be appropriate after a label or input prompt in some languages. Having those symbols included in messages helps to better and less anglo-centric translations, and by the way reduces code clutter.

Do not expect symbols and punctuation to survive translation
Languages written from right to left (as opposed to English) usually swap arrow symbols being presented with "next" and "previous" links, and their placement relative to a message text may, or may not, be inverted as well. Ellipsis may be translated to "etc." or to words. Question marks, exclamation marks, colons do appear at other places than at the end of sentences, or not at all, or twice. As a consequence, always include all of those in your messages, never insert them programmatically.

Use full stops
Do terminate normal sentences with full stops. This is often the only indicator for a translator to know that they are not headlines or list items, which may need to be translated differently.

Wikicode of links
Link anchors can be put into messages in several technical ways: The latter is often hard or impossible to handle for translators, avoid patchwork messages here, too. Make sure that, " " does not contain spaces.
 * 1) via wikitext: …   …
 * 2) via wikitext: …   …
 * 3) the anchor text is a message in the MediaWiki name space. Avoid it!

Use meaningful link anchors
Care for your wording. Link anchors play an important role in search engine assessment of pages, both the linking ones, and the ones linked to. Make sure that, the anchor describes the target page well. Do avoid commonplace and generic words! For example, "Click here" is an absolute nogo, since target pages never are about "click here". Do not put that in sentences around links either, because "here" was not the place to click. Use precise words telling what a user will get to when following the link, such as "You can upload a file if you wish."

Avoid jargon and slang
Avoid developer and power user jargon in messages. Try to use as simple language as possible.

One sentence per line
Try to have one sentence or similar block in one line. This helps to compare the messages in different languages, and may be used as an hint for segmentation and alignment in translation memories.

Keeping messages centralized and in sync
English messages are very rarely out of sync with the code. Experience has shown that it's convenient to have all the English messages in the same place. Revising the English text can be done without reference to the code, just like translation can. Programmers sometimes make very poor choices for the default text.

Message documentation
There is a pseudo-language, having the code  (message documentation). It is one of the ISO 639 codes reserved for private use. There we do not keep translations of each message, but collect English sentences about each message; telling us where it is used, giving hints about how to translate is, enumerate and describe its parameters, link to related messages, etc.. In translatewiki.net, these hints are shown to translators when they hit the "" button for messages.

Programmers are encouraged to contribute to message documentation. Useful information include:
 * 1) message handling (parsing, escaping, plain text)
 * 2) type of parameters with example values
 * 3) where the message is used (pages, locations in the user interface)
 * 4) how the message is used where it is used (a page title, button text, etc..)
 * 5) what other messages are used together with this message, or which other messages this message refers to
 * 6) anything else which could be understood when the message is seen on the context, but not when the message is displayed alone (which is the case when it is being translated)
 * 7) if the message has special properties, like is a page name, or if it should not be a direct translation

License
Any edits made to the language must be licensed under the terms of the GNU General Public License (and GFDL?) to be included in the MediaWiki software.

History
With MediaWiki 1.3.0 a new system has been set up for localizing MediaWiki. Instead of editing the language file and asking developers to apply the change, users can now edit the interface strings directly from their wikis. This is the system in use as of August 2005. People can find the message they want to translate in Special:Allmessages and then edit the relevant string in the MediaWiki: namespace. Once edited, these changes are live. There is no more need to request an update, and wait for developers to check and update the file.

The system is great for Wikipedia projects; however a side effect is that the MediaWiki language files shipped with the software are no longer quite up-to-date, and it is harder for developers to keep the files on meta in sync with the real language files.

As the default language files do not provide enough translated material, we face two problems:
 * 1) New Wikimedia projects created in a language which have not been updated for a long time, need a total re-translation of the interface.
 * 2) Other users of MediaWiki [not Wikimedia projects] are left with untranslated interfaces. This is especially unfortunate for the smaller languages which don't have many translators.

This is not such a big issue anymore, because translatewiki.net is advertised prominently and used by almost all translations. Local translations still do happen sometimes.

'''This section is missing about the changes in the i18n system related to extensions. The format was standardized and messages are automatically loaded.'''