Jump to content

Language Converter

From mediawiki.org
(Redirected from LanguageConverter)
i18n docs

Language Converter (LC) is a MediaWiki module automatically translating the source code of a page to a different writing system or orthography of the same language on the fly. It is used extensively on Chinese Wikipedia to present pages localized in six regional locales, and on Serbian Wikipedia to convert between Cyrillic and Latin scripts, for instance.

Background

[edit]

Language Converter aims to solve the problem of digraphia, where a single language is written using multiple writing systems or orthographies. It was initially created for the use of Chinese Wikipedia, later expanding support to other languages. In the implementations for some languages, it performs simple character substitution. In contrast, the implementation for Mandarin Chinese is most complex and feature rich, to address problems unique to the Chinese writing system. Therefore, the backgrounds will be presented using Mandarin Chinese as an example.

Wikipedias were created on the basis of spoken languages, and due to this convention, Chinese Wikipedia was created as a multi-locale wiki. However, Chinese is a complex language:

  • Chinese has script-internal synchronic digraphia. Its writing system has two variants—Traditional Chinese uses the original and complicated Chinese characters, mainly used in Taiwan, Hong Kong and Macau; Simplified Chinese uses a simplified version of Chinese characters instead, and is used prominently in Mainland China, Singapore and Malaysia.
  • Chinese is a pluricentric language. During its evolution, different regions have started to standardize written Chinese in different ways. For example, people in Simplified Chinese-using regions are not formally taught to write or read in the traditional version, and vice versa.

To make Chinese Wikipedia accessible to Chinese speakers all over the world, a conversion engine must be created, otherwise articles for same topics would have to be manually duplicated.

Character mapping: from 1-1 to n-n

[edit]

An initial approach is to write a 1-1 mapping of these characters. However, during the simplification process, some characters were dropped or merged into one of the characters, most of the time with a simpler form. For example, 幹 (to do) and 乾 (dry) were merged into 干 whose meaning depends on the context (e.g. the word it forms, or order of characters to determine if it is a verb or an adjective). Therefore, this is at least a 1-n relationship.

However, it should be noted that all the dropped characters also have their own special meanings, sometimes close, sometimes vastly different. In fact, depending on the context, the character 乾 (aforementioned dry) here has another philosophical meaning from ancient Chinese literature, describing Heaven or a creative force. In this case, it is preserved as-is, unconverted. An example of this is in 隆, the name of a Chinese emperor. Furthermore, throughout the history of simplification, some of the characters supposed to be merged remained unchanged in the name of celebrities due to its distinctiveness. In some cases, a bespoke simplified version was created specifically to maintain such distinction.

Regional divergence in terminology: n words for n meanings

[edit]

As of now, it is clear that we need an n-n relationship of single characters. Unfortunately, this is still not the full picture. Since a lot of industries (e.g. Information Technology) have developed separately in these regions in past few decades, a considerable number of foreign words have started to be translated distinctly, resulting in separate orthographies for different regions. For example, computer RAM is translated as 内存 (lit. internal / inside + to store / to exist) in Mainland China. If we mapped it character by character to Traditional Chinese, it would be 內存 (notice the difference between 内 vs 內 as the first character). However, it is actually called 記憶體 in Taiwan. Using 內存 in Taiwan might cause confusion due to the pluricentric nature of Chinese.

It seems that a glossary list of these terms should be maintained. But what's worse is that, mappings of these words are also sometimes n-to-n relationships. To illustrate, table tennis is translated to 乒乓球 in all regions, and billiard is translated to 撞球 in Taiwan and 台球 in Mainland China, Hong Kong and Macau. But here's the catch. There is another word called 桌球, which transliterates to table ball. It refers to billiard in Hong Kong and Mainland China, but table tennis in Taiwan! Apparently, in source code, the meaning of 桌球 can only be inferred through the context, thus its conversion rule cannot be fit into a single glossary list.

Comparison of translations across Chinese-speaking regions
Term Taiwan Mainland China Hong Kong & Macau
Table tennis 乒乓球
桌球 N/A
Billiard 撞球 桌球
台球

Special challenge: Chinese word boundaries and "over-conversion" trap

[edit]

You might think we’ve covered the major quirks of this natural language. Unfortunately we've overlooked the elephant in the room—Chinese doesn't have explicit word separators when written, therefore the conversion system might incorrectly regard characters across word boundaries as a single word!

One infamous example is 海内存知己 (lit. within the sea exist friends; fig. [if] you have a friend who knows your heart...), which is supposed to be segmented as 海内/存/知己 (lit. within the sea / exist / friends). It is a line of a famous ancient Chinese poem, thus should be converted word by word. However, as we discussed previously, 内存 together stands for RAM in Mainland China, which is converted to 記憶體 in technical context for the Taiwan variant. As the substitution is performed greedily, it would become 海/記憶體/知己 (lit. sea / RAM / friends), which is totally nonsensical! This is known as over-conversion (过度转换).

Ideal conversion framework

[edit]

To conclude, the technical requirements of our conversion system are as follows:

  1. It needs to include basic character to character conversion table.
  2. In the case of exception, it should be able to perform word-level conversion.
  3. The glossary table of word-level conversion should be separated based on the topic or context of the page, applied either manually or intelligently.
  4. It should have a word segmentation engine (for writing systems like Chinese) to prevent over-conversion.

Language Converter is a partial implementation of these requirements. It allows the original wikitext to be stored in a mixture of all possible language variants with conversion syntax, and to be served according to the user's preferred variant.

Variant selection

[edit]

Language Converter is implemented and enabled for a specific set of languages. As of Jan 2026, it is as follows (see the source code for up-to-date info):

Language name & code Supported variants Versions
Balinese (ban)
  • Balinese (ban-bali)
  • Latin (ban-latn)
1.36
Crimean Tatar (crh)
  • Latin (crh-latn)
  • Cyrillic (crh-cyrl)
English (en)
(Available only when $wgUsePigLatinVariant is enabled, for testing purposes)
  • Pig Latin (en-x-piglatin)
1.30
Gerrit change 72053
Gan (gan)
  • Simplified (gan-hans)
  • Traditional (gan-hant)
Inuktitut (iu)
  • Latin (ike-latn)
  • Syllabics (ike-cans)
1.18
Kazakh (kk)
(Removed due to lack of use and poor implementation quality)
  • Cyrillic (kk-cyrl)
  • Latin (kk-latn)
  • Arabic (kk-arab)

1.41
Gerrit change 972472
Kurdish (ku)
  • Latin (ku-latn)
  • Arabic (ku-arab)
1.11
r23067
Serbo-Croatian (sh)
  • Cyrillic (sh-cyrl)
  • Latin (sh-latn)
1.40
Tachelhit (shi)
  • Tifinagh (shi-tfng)
  • Latin (shi-latn)
1.19
Serbian (sr)
  • Cyrillic (sr-ec)
  • Latin (sr-el)
Tajik (tg)
  • Cyrillic (tg-cyrl)
  • Latin (tg-latn)
Talysh (tly)
  • Cyrillic (tly-cyrl)
  • Latin (tly-latn)
1.36
Uzbek (uz)
  • Cyrillic (uz-cyrl)
  • Latin (uz-latin)
1.20
Wu (wuu)
  • Simplified (wuu-hans)
  • Traditional (wuu-hant)
1.41
Standard Moroccan Tamazight (zgh)
  • Tifinagh (zgh-tfng)
  • Latin (zgh-latn)
1.42
Chinese (zh)
  • Simplified (zh-hans)
    • Simplified, Chinese (zh-cn)
    • Simplified, Malaysia (zh-my)
    • Simplified, Singapore (zh-sg)
  • Traditional (zh-hant)
    • Traditional, Taiwan (zh-tw)
    • Traditional, Hong Kong (zh-hk)
    • Traditional, Macau (zh-mo)

(Note: the generic Simplified & Traditional variants are not aware of regional word variations. For example, zh-hant requires using Traditional character set, but words and phrases in either Hong Kong or Taiwan are allowed)

The base generic language code is inherently included in supported variants, regarded as a mixture of all possible variants. For example, zh variant represents a mixture of all zh-* variants.

Note that the language code must be set to the generic code in the left column of the table to enable Language Converter. Setting language to the code on the right (e.g. zh-cn instead of zh) will not take effect.

Support for some additional languages is under discussion, e.g. Cantonese (phab:T59106). As a result, Cantonese Wikipedia uses a local gadget for conversion, rather than Language Converter.

Language code tags for scripts should follow the ISO 15924 standard. However, for historical reasons, Serbian is an exception, which uses sr-ec instead of sr-cyrl, and sr-el instead of sr-latn. This is discussed in phab:T117845.

Automatic conversion

[edit]

Language Converter automatically renders the original wikitext which is a mixture of all possible variants to the target variant user requests on the fly. During this process, glossary tables are taken into consideration.

Technically, there are four layers of glossary tables: the first one is built into the MediaWiki codebase; the second one is defined on MediaWiki:Conversiontable and is configured per wiki; the third one acts as glossary tables on wiki for each topic, referred to as common conversion group; and the last one refers to inline glossary markups in each article. The higher layer takes precedence over the lower layer.

Common conversion group

[edit]

Common conversion group is not a functionality native to Language Converter, but rather a series of templates and Lua modules to inject fixed sets of glossary rules to the article. A RFC was proposed to merge it to LC itself, alongside improved functionality.

On Chinese Wikipedia, these groups are implemented on the subpages of zh:Template:CGroup and zh:Module:CGroup. For example, zh:Module:CGroup/IT contains the glossaries for articles related to information technology. These rules are brought into article scope by zh:Template:NoteTA template, whose usage is:

{{NoteTA|G1=IT|G2=<other group>|...}}

Chinese word segmentation problem

[edit]

Currently, Language Converter is not able to detect Chinese word boundaries. As a result, the problem of over-conversion is mitigated by extensive dogfooding from editors. For instance, the problem of 海内存知己 described above is solved by adding a more general rule of 海内存知己<=>海內存知己 with higher precedence.

Another method to solve this problem is to use empty -{}- as a means to split Chinese words. Splitting the phrase above into -{}-知己 will effectively disable the rule for 内存.

Inline glossary markup

[edit]

The general rule of inline glossary markups is:

-{ <flag1>; <flag2>; ... | <locale1>: <text1>; <locale2>: <text2>; ...}-
flag
A single character or a list of semicolon-separated variants, representing the working mode of the rule;
locale
One of supported variants;
text
Term in the specified variant.

Note that whitespace before and after colons and semicolons are automatically ignored, and the semicolon after the last text can be omitted.

Common flags include:

A
Add glossary rule, and output its content at insertion location in human readable format;
H
Same as A, but no output;
D
Same as A, but no effect (a.k.a. dry run);
-
Remove previously added glossary rule;
T
Forcibly overwrite the page title according to the rule.

As these flags indicate, a glossary rule only takes effect for text afterwards, so they are sensitive to insertion location.

A rule for RAM in Chinese would be described as:

-{A|zh-cn: 内存; zh-tw: 記憶體;}-

Rules defined like this are two-way conversions. So if user prefers zh-cn, all <tvar name=1>記憶體</tvar> in source code will be rendered as 内存, and vice versa for zh-tw.

For zh-hk, since it is not included in the rule, the text remains unchanged.

Disable automatic conversion

[edit]

Wrapping text in -{}- completely disables conversion, e.g. -{内存}-. It also has the side effect of treating the text within as a single token, which is another method of manually splitting words in Chinese context.

__NOCONTENTCONVERT__ or __NOCC__ disables conversion for the entire page body, while __NOTITLECONVERT__ or __NOTC__ disables conversion for the page title completely.

Manual conversion syntax

[edit]

Manual conversion is mainly used to temporarily override conversion from glossary rules. The syntax of manual conversion is:

-{ <locale1>: <text1>; <locale2>: <text2>; ...}-

which specifies <text1> for <locale1>, <text2> for <locale2>, etc.

Note that the specified result will not go through a second automatic conversion pass, thus will be preserved as-is. Identical to glossary rules, whitespace before and after colons and semicolons are automatically ignored, and the semicolon after the last text can be omitted.

For instance, if written manually, the rule for RAM in Chinese would be described as:

-{zh-cn: 内存; zh-tw: 記憶體;}-
Fallback chain of MediaWiki locales

For an unspecified variant (e.g. zh-hk), its output will be determined using MediaWiki's fallback chain.

Scope of Language Converter

[edit]

Language Converter only handles page body. Examples not processed by LC include:

  • HTML attribute values except those of alt and title
  • most extension tags emitting monospace content (e.g. ‎<pre>, ‎<code>, ‎<syntaxhighlight>)
  • ‎<nowiki>
  • interface text
  • category names
  • DOM content added by JavaScript at frontend
  • Lua constants (with the exception of text added to the page)

However, there is a hack to manually turn on LC for ‎<pre> and ‎<code>—by putting an -{}- anywhere inside it:

<pre>-{}-
LC will work on me!
</pre>

Text marked as in other languages by HTML lang attribute will also not be handled by LC. This is common and required for Japanese text on Chinese Wikipedia, as both languages share a large portion of Chinese characters (or with Unicode terms, CJK Unified Ideographs). In such scenarios, the language conversion markup -{}- will be preserved as is.

Currently it is not possible to call Language Converter from Lua. See phab:T49725 for more details. The zh:Module:WikitextLC is a wrapper for inserting LC syntax to output, and the zh:Module:ZhConversion module is a pure-Lua implementation of LC Chinese converter.

Configuration & integration

[edit]

On LC-enabled wikis, users can adjust their variant preference in Special:Preferences. It can also be overridden by the variant URL parameter. For anonymous users, if unspecified in the URL, their variant preference is detected from the browser languages (specifically the Accept-Language header), falling back to the generic one if detection fails.

$wgVariantArticlePath configures direct article paths with variant preference included, in addition to the variant URL parameter. For example, it is configured so pages on Chinese Wikipedia can be accessed using https://zh.wikipedia.org/<variant>/<article>.

With the help of LC, MediaWiki is capable of automatically redirecting nonexistent titles to existing ones in another writing system, using built-in conversion tables. For instance, provided that page 干 exists, any request to the traditional version 幹 will be automatically redirected to it. This behavior can be overridden by creating a page under the name 幹.

See also

[edit]