Localisation

From MediaWiki.org

Jump to: navigation, search

This page gives a technical description of MediaWiki's internationalisation and localisation (I18N) system.

Contents

[edit] History

With MediaWiki 1.3.0 a new system has been set up for localizing MediaWiki. Instead of editing the language file and asking developers to apply the change, users can now edit the interface strings directly from their wikis. This is the system in use as of August 2005. People can find the message they want to translate in Special:Allmessages and then edit the relevant string in the MediaWiki: namespace. Once edited, these changes are live. There is no more need to request an update, and wait for developers to check and update the file.

The system is great for Wikipedia projects; however a side effect is that the MediaWiki language files shipped with the software are no longer quite up-to-date, and it is harder for developers to keep the files on meta in sync with the real language files.

As the default language files do not provide enough translated material, we face two problems:

  1. New Wikimedia projects created in a language which have not been updated for a long time, need a total re-translation of the interface.
  2. Other users of MediaWiki [not Wikimedia projects] are left with untranslated interfaces. This is especially unfortunate for the smaller languages which don't have many translators.

This is not such a big issue anymore, because translatewiki.net is advertised prominently and used by almost all translations. Local translations still do happen sometimes.

[edit] translatewiki.net

translatewiki.net supports in-wiki translation of the complete interface. If you would like to have nothing to do with all the technicalities of editing files, subversion, creating patches, this is the place for you. Even if you can work with the process, you should consider trading some personal efficiency for benefit of a group. Please visit translatewiki.net, create an account and request translator privileges.

[edit] Subversion and Bugzilla

Only few languages are maintained by translators (Hebrew, some Chinese languages), who commit directly to MediaWiki svn repository. All new efforts should go trough translatewiki.net

[edit] Subscribe to i18n mailinglist

You can subscribe i18n list, but at the moment it is very low traffic (almost none).

http://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n

[edit] Code Structure

First, you have a Language object in Language.php. This object contains all the localisable message strings, as well as other important language-specific settings and custom behavior (uppercasing, lowercasing, printing dates, formatting numbers, etc.)

The object is constructed from two sources: subclassed versions of itself (classes) and Message files (messages).

There's also the MessageCache class, which handles input of text via the MediaWiki namespace. And there's the wfMsg*() functions in GlobalFunctions.php. We have large amounts of message retrieval code in GlobalFunctions.php.

[edit] General use (for developers)

[edit] Language objects

There are two ways to get a language object. You can use the globals $wgLang and $wgContLang for user interface and content language respectively. For an arbitrary language you can construct an object by using Language::factory( 'en' ), by replacing en with the code of the language. The list of codes is in languages/Names.php.

Language objects are needed for doing language specific functions, most often to do number, time and date formatting, but also to construct lists and other things. There are multiple layers of caching and merging with fallback languages, but the details are irrelevant in normal use.

[edit] Using messages

MediaWiki uses a central repository of messages which are referenced by keys in the code. This is different from example Gettext, which just extracts the translatable strings from the source files. Key-based system makes some things easier, like refining the original texts and tracking changes to messages. The drawback is of course that the list of used messages and the list of source texts for those keys can get out of sync. In practise this isn't a big problem, sometimes extra messages which are not used anymore still stay up for translation.

The message system in MediaWiki is quite complex, a bit too complex. One of the reasons for this is that MediaWiki is a web application. Messages can go trough all kinds of processing. The four major ones covering almost all cases are:

  1. as-is, no processing at all
  2. light wiki-parsing, parserfunction references starting with {{ are replaced with their results
  3. full wiki-parsing

Case 1. is for processing, not really for user visible messages. Light wiki-parsing should always be combined with html-escaping.

Recommended ways

Longer messages that are not used hundreds of times on a page:

  • OutputPage::addWikiMsg
  • OutputPage::wrapWikiMsg
  • wfMsgExt ... parse, parseinline

OutputPage methods parse messages and add them directly to the output buffer. wfMsgExt can be used when a message should not be added to the output buffer. parseinline removes enclosing html tags from the parsed result, usually <p>..</p>, but can generate invalid code for example if there is no root tag in parsed result, for example <p>..</p><p>..</p>. Usage examples:

 $wgOut->addWikiMsg( 'foobar', $wgLang->formatNum( count($items) ) );
 $wgOut->wrapWikiMsg( '<div class="baz">\n$1\n</div>', array( 'foobar', $wgUser->getName() ) );
 $text = wfMsgExt( 'foobar', 'parse', $wgLang->date( $ts ) );

Other messages with light wiki-parsing can use wfMsg and wfMsgExt with the parsemag. wfMsgExt must always be used if the message has parts that depend on linguistic information, like {{PLURAL:$1}}. Do not use wfMsg, wfMsgHtml for those kind of messages! They seem to work but are broken.

 $out = Xml::submitButton( wfMsg( 'foobar' ) ); # no linguistic information
 $out = Xml::label( wfMsgExt( 'foobar', 'parsemag', $wgLang->formatNum( $count ) ) ); # uses plural on $count


Some messages have mixed escaping and parsing. Most commonly when using raw links in messages that should not be escaped. wfMsgHtml and wfMsgExt with replaceafter can do that. Note that there cannot be any linguistic dependent variables in those messages! Be especially wary of using wfMsgHtml, it only escapes the message, not parameters. This has caused at least one XSS in MediaWiki. If you do not need the replaceafter functionality, use some parsing function or non-parsing function with htmlspecialchars().

Short list of functions to avoid:

  • wfMsgHtml (don't use unless you really want unescaped parameters)
  • wfMsgWikiHtml (breaks up linguistic functions, as does wfMsg)
  • OutputPage::parse and parseInline, addWikiText (if you know the message, use addWikiMsg or wrapWikiMsg)

Remember that almost all Xml::-functions escape everything fed into them, so avoid double-escaping and parsed text with those.

Caching

MediaWiki has lots of caching mechanisms built in, which make the code somewhat more difficult to understand. Since 1.16 there is a new caching system, which caches messages either in cdb-files or in the database. Customised messages are cached in the filesystem and in memcached (or alternative), depending on the configuration.

Behavior

Things that are localizable:

  • Weekdays (and abbrev)
  • Months (and abbrev)
  • Bookstores
  • Skin names
  • Math names
  • Date preferences
  • Date format
  • Default date format
  • Date preference migration map
  • Default user option overrides
  • Language names
  • Timezones
  • Character encoding conversion via iconv
  • UpperLowerCase first (needs casemaps for some)
  • UpperLowerCase
  • Uppercase words
  • Uppercase word breaks
  • Case folding
  • Strip punctuation for MySQL search
  • Get first character
  • Alternate encoding
  • Recoding for edit (and then recode input)
  • RTL
  • Direction mark character depending on RTL
  • Arrow depending on RTL
  • Languages where italics cannot be used
  • Number formatting (commafy, transform digits, transform separators)
  • Truncate (multibyte)
  • Grammar conversions for inflected languages
  • Plural transformations
  • Formatting expiry times
  • Segmenting for diffs (Chinese)
  • Convert to variants of language
  • Language specific user preference options
  • Link trails, e.g.: [[foo]]bar
  • Language code (RFC 3066)

Neat functionality:

  • I18N sprintfDate
  • Roman numeral formatting

Parameter substitution

Some messages take parameters. They are represented by $1, $2, $3, … in the (static) message texts, and replaced at run time. Typical parameter values are numbers ("Delete 3 versions?"), or user names ("Page last edited by $1"), page names, links, and so on, or sometimes other messages. They can be of arbitrary complexity.

Switches in messages …

Parameters values at times influence the exact wording, or grammatical variations in messages. Not resorting to ugly constructs like "$1 (sub)page(s) of his/her userpage", we make switches depending on values known at run time. The (static) message text then supplies each of the possible choices in a list, preceded by the name of the switch, and a reference to the value making a difference. This very much resembles the way, parser functions are called in MediaWiki. Several types of switches are available.

… on numbers via PLURAL

MediaWiki supports plurals, which makes for a nicer-looking product. For example:

'undelete_short' => 'Undelete {{PLURAL:$1|one edit|$1 edits}}',

Language-specific implementations of PLURAL: are found in pages such as LanguageFr.php (for French, code fr) or LanguageCs.php (for Czech, code cs).

… on user names via GENDER

to be added

… on use context inside sentences via GRAMMAR

Grammatical transformations for agglutinative languages is also available. For example for Finnish, where it was an absolute necessity to make language files site-independent, i.e. to remove the Wikipedia references. In Finnish, "about Wikipedia" becomes "Tietoja Wikipediasta" and "you can upload it to Wikipedia" becomes "Voit tallentaa tiedoston Wikipediaan". Suffixes are added depending on how the word is used, plus minor modifications to the base. There is a long list of exceptions, but since only a few words needed to be translated, such as the site name, we didn't need to include it.

MediaWiki has grammatical transformation functions for over 20 languages. Some of these are just dictionaries for Wikimedia site names, but others have simple algorithms which will fail for all but the most common cases.

Even before MediaWiki had arbitrary grammatical transformation, it had a nominative/genitive distinction for month names. This distinction is necessary if you wish to substitute month names into sentences.

Filtering special characters in parameters and messages

The other (much simpler) issue with parameter substitution is HTML escaping. Despite being much simpler, MediaWiki does a pretty poor job of it. We have a plethora of poorly-named wfMsg*() functions, including the multitasking wfMsgExt(), with lots of ways to slip up and let through unescaped user input. There may be work done to clean this up at some stage in the future.

Message sources

Messages are obtained from these sources:

  • The MediaWiki name space. It allows wikis to adopt, or override, all of their messages, when standard messages do not fit or are not desired.
    • MediaWiki:message-name is the default message,
    • MediaWiki:message-name/language-code is the message to be used when a user has selected a language other then the wikis default language.
  • From message files.
    • Mediawiki itself, and few extensions, use a file per language, called MessagesZxx.php, where zxx is the language code for the language.
    • Most extensions use a combined message file holding all messages in all languages, usually named after the extension, and having an .i18n.php ending.
    • Very few extensions are using another, individual way.

Internationalization hints

Translators ask to consider some hints so as to make their work easier and more efficient. Even if only adding or editing messages in English, one should be aware of the needs of all languages. Messages are translated to more than 300 languages each, which should be done in the best possible way.

There a two main places, where you can find assistance of experienced and knowledgeable people regarding I18n:

Please do ask them.

Avoid message reuse

The translators encourage reuse avoidance. Although two concepts can be expressed with the same word in English, this doesn't mean they can be expressed with the same word in every language. "OK" is a good example: in English this is used for a generic button label, but in some languages they prefer to use a button label related to the operation which will be performed by the button.

An easy way to duplicate messages across all languages would reduce the programmer's need to reuse messages. Preferably you would have a reference rather than a full copy, to reduce maintenance.

Avoid patchwork messages

Languages have varying word orders, and complex grammatical and syntactic rules. Messages put together from lots of pieces of text, possibly with some indirection, are very hard, if not impossible, to translate. Better make messages complete sentences each, with a full stop at the end. Several sentences can usually much more easily be combined into a text block, if needed.

Be aware of PLURAL use on all numbers

When a number has to be inserted into a message text, be aware that, some languages will have to use PLURAL on it even if always larger than 1. The reason is that PLURAL in languages other than English can make very different and complex distinctions, comparable to English 1st, 2nd, 3rd, 4th, … 11th, 12th, 13th, … 21st, 22nd, 23rd, … etc.

Do not try to supply three different messages for cases like 0, 1, more items counted. Rather let one message take them all, and leave it to translators and PLURAL to properly treat possible differences of presenting them in their respective languages.

Separate times from dates in sentences

Some languages have to insert something between a date and a time which grammatically depends on other words in a sentence. Thus they will not be able to use date/time combined. Others may find the combination convenient, thus it is usually the best choice to supply three parameter values (date/time, date, time) in such cases.

Users have grammatical genders

When a message talks about a user, or relates to a user, or addresses a user directly, the user name should be passed to the message as a parameter. Thus languages having to, or wanting to, use proper gender dependent grammar, can do so. This should be done even when the user name is not intended to appear in the message, such as in "inform the user on his/her talk page", which is better made "inform the user on {{GENDER:$1|his|her|their}} talk page" in English as well.

[edit] Avoid {{SITENAME}} in messages

{{SITENAME}} has several disadvantages. It can be anything (acronym, word, short phrase, etc.) and, depending on language, may need {{GRAMMAR}} on each occurrence. No matter what, very likely in most wiki languages, each message having {{SITENAME}} will need review for each new wiki installed. When there is not a general GRAMMAR program for a language, as almost always, sysops will have to add or amend php code so as to get {{GRAMMAR}} for {{SITENAME}} working. This requires both more skills, and more understanding, than otherwise. It is more convenient to have generic references like "this wiki". This does not keep installations from altering these messages to use {{SITENAME}}, but at least they don't have to, and they can postpone message adaption until the wiki is already running and used.

[edit] Avoid references to screen layout and positions

What is rendered where depends on skins. Most often screen layouts of languages written from left to right are mirrored compared to those used for languages written from right to left, but not always, and for some languages and wikis, not entirely. Handheld devices, narrow windows, and so on show blocks underneath each other, that appear side to side on large displays. Since user selected and user written javascript gadgets can, and do, hide parts, or move things around in unpredictable ways, there is no reliable way of knowing the actual screen layout. Acoustic screen readers, and other auxiliary devices do not even have a concept of layout. So, you cannot refer to layout posisitons.

[edit] Have message elements before and after input fields

While most modern Western European languages, including English, allow efficient use of prompting in the form "item colon space input-field" that is not so for many other languages, unless sacrificing good grammatical taste or politeness, or resorting to excessively complicated and lengthy wording. Even in English, you often want to use "Distance: ___ feet" rather than "Distance (in feet): ___". Leaving <textarea> aside, just think of each and any input field following the "Distance: ___ feet" pattern, and give it two messages, even if the 2nd one is most often empty in English.

An alternate solution might be allowing the placement of input fields via $i parameters.

[edit] Messages are usually longer than you think!

Skimming foreign language message files, you find messages almost never shorter than Chinese ones, rarely shorter than English ones, and most usually much longer than Englich ones.

Especially in forms, in front of input fields, English messages tend to be terse, and short. That is often not kept in translations. Especially genuinely untechnical third world languages, vernacular, medieval, or ancient languages require multiple words or even complete sentences to explain foreign, or technical, prompts. E.g. "TSV file:" may have to be translatd as: "Please type a name here which denotes a collection of computer data that is comprised of a sequentially organized series of typewritten lines which themselves are organized as a series of informational fields each, where said fields of information are fenced, and the fences between them are single signs of the kind that slips a typewriter carriage forward to the next predefined position each. Here we go: _____ (thank you)" — admittedly an extreme example, but you got the trait. Imagine this sentence in a column in a form where each word occupies a line of its own, and the input field is vertically centered in the next column. :-(

[edit] Avoid using very close, similar, or identical words to denote different things, or concepts

For example, pages may have older revisions (of a specific date, time, and edit), comprising past versions of said page. The words revision, and version can be used interchangeably. A problem arises, when versioned pages are revised, and the revision, i.e. the process of revising them, is being mentioned, too. This may not pose a serious problem when the two synonyms of "revision" have different translations. Do not rely on that, however. Better is to avoid the use of "revision" aka "version" altogether, then, so as to avoid it being misinterpreted.

[edit] Basic words may have unforeseen connotations, or not exist at all

There are some words that are hard to translate because of their very specific use in MediaWiki. Some may not be translated at all. For example "namespace", and "appartment", translate the same in Kölsch. There is no word "user" relating to "to use something" in several languages. Sticking to Kölsch, they say "corroborator and participant" in one word since any reference to "use" would too strongly imply "abuse" as well. "Wiki farm" is translated as "stable full of wikis", since a single crop farm would be a contradiction in terms in the language, and not understood, etc.

[edit] Expect untranslated words

It is not uncommon that computerese English is not translated and taken as loanwords, or foreign words. In the latter case, technically correct translations mark them as belonging to another language, usually with apropriate html markup, such as <span lang="en"></span>. Thus make sure that, your message output handler passes it along unmolested, even if you do not need it in English, or in your language.

[edit] Symbols, colons, brackets, etc. are parts of messages

Many symbols are translated, too. Some scrips have other kinds of brackets than the Latin script has. A colon may not be appropriate after a label or input prompt in some languages. Having those symbols included in messages helps to better and less anglo-centric translations, and by the way reduces code clutter.

[edit] Do not expect symbols and punctuation to survive translation

Languages written from right to left (as opposed to English) usually swap arrow symbols being presented with "next" and "previous" links, and their placement relative to a message text may, or may not, be inverted as well. Ellipsis may be translated to "etc." or to words. Question marks, exclamation marks, colons do appear at other places than at the end of sentences, or not at all, or twice. As a consequence, always include all of those in your messages, never insert them programmatically.

[edit] Use full stops

Do terminate normal sentences with full stops. This is often the only indicator for a translator to know that they are not headlines or list items, which may need to be translated differently.

[edit] Link anchors

Link anchors can be put into messages in several technical ways:

  1. via wikitext: … [[a wiki page|anchor]] …
  2. via wikitext: … [some-url anchor] …
  3. the anchor text is a message in the MediaWiki name space. Avoid it!

The latter is often hard or impossible to handle for translators, avoid patchwork messages here, too. Make sure that, "some-url" does not contain spaces.

Care for your wording. Link anchors play an important role in search engine assessment of pages, both the linking ones, and the ones linked to. Make sure that, the anchor describes the target page well. Do avoid commonplace and generic words! For example, "Click here" is an absolute nogo, since target pages never are about "click here". Do not put that in sentences around links either, because "here" was not the place to click. Use precise words telling what a user will get to, when following the link, such as "You can upload a file if you wish."

[edit] Keeping messages centralized and in sync

English messages are very rarely out of sync with the code. Experience has shown that it's convenient to have all the English messages in the same place. Revising the English text can be done without reference to the code, just like translation can. Programmers sometimes make very poor choices for the default text.

[edit] Message documentation

There is a pseudo-language, having the code qqq (message documentation). It is one of the ISO 639 codes reserved for private use. There we do not keep translations of each message, but collect English sentences about each message; telling us where it is used, giving hints about how to translate is, enumerate and describe its parameters, link to related messages, etc.. In translatewiki.net, these hints are shown to translators when they hit the "Edit" button for messages.

Programmers are encouraged to contribute to message documentation.

[edit] License

Any edits made to the language must be licensed under the terms of the GNU General Public License (and GFDL?) to be included in the MediaWiki software.

[edit] See also