Multilingual MediaWiki

''These are development specifications, not documentation. This feature does not exist yet.''

Rationale
Support for multiple languages in MediaWiki is a component milestone of Wikidata and OmegaWiki development. From the standpoint of Wikidata, this is needed because MediaWiki page titles play an integral role in the Wikidata model: they can be used as keys to access resources in a set of tables, and a history of transactions related to these tables. Because page titles currently have no internal awareness of the language they represent, it is not possible to have content under the same titles in different languages without resorting to hacks (such as appending the language to the title string).

From the standpoint of a regular MediaWiki user, the current situation is that the only way to run a multilingual site is to create separate databases for each language. MediaWiki does not provide any facilities to do so. Furthermore, the administrator also has to:
 * configure the entry points for the different wikis on the web server, and set them up to use the same code base (or use completely separate installations)
 * configure the wikis to use the same account database
 * set up a shared upload repository
 * set up interlanguage links
 * manage user blocking and other site policies across multiple languages

Again, MediaWiki does not provide facilities that would make any of this significantly easier. MediaWiki also does not support:
 * getting an index of all pages across languages
 * getting a list of recent changes across all or certain languages
 * maintaining a single watchlist across multiple languages
 * indeed, operating any special page across a set of languages.

Due to the setup and maintenance costs involved, of the hundreds and hundreds of sites using MediaWiki, only a small number support multiple languages, and usually only a small number of languages as well. Language communities cannot evolve naturally on a MediaWiki; they usually have to jump through processual steps to convince the administrator that a new language has to be "set up" -- if this is possible at all.

Beyond that, there are wikis where a split into separate databases is entirely undesirable, because the wikis are inherently multilingual and centralized, and cross-language interaction on single pages is desired. Examples of this are Meta-Wiki (for votes) and Wikimedia Commons (for file description pages) and Wikiversity beta (for discussions).

Support for multiple content languages in a single MediaWiki installation and database will address these concerns and others.

Caveats
While there may be technical reasons to split databases, such as easier decentralization and exports and easier localization (e.g. sort order, timestamps), and possibly (depending on query efficiency) better scalability, there are no reasons on the level of application logic to do so. This is because everything that can be modeled using multiple databases can be modeled using a single one.

However, since managing multiple languages in a single database makes it, theoretically, easier to add certain features, such as language filters, great attention to detail has to be given to the question how such filters and other features might affect community interaction in a wiki.

Admin choices
The site administrator has to make the following choice in LocalSettings.php:
 * Support all languages (including very minor and constructed languages)
 * Support all languages, except for specific ones (blacklist)
 * It may be desirable to provide certain preset groups, such as constructed languages.
 * Support only a certain set of languages (whitelist)

In addition, the administrator can choose which, if any, language should be used by default for viewing content. This option, $wgDefaultLanguage, could be set to a language code, or to 'auto,&lt;fallback language code&gt;', meaning that the browser's preferences are evaluated. If the detected language(s) is/are not supported by the wiki, the fallback code is used.

Backend
MediaWiki needs to come with information about languages. For this, the following two tables are added:

Table LANGUAGE ++-+--+-+-++ | Field         | Type        | Null | Key | Default | Extra          | ++-+--+-+-++ | language_id   | int(10)     |      | PRI | NULL    | auto_increment | | english_name  | varchar(255)|      |     |         |                | | native_name   | varchar(255)|      |     |         |                | | iso639_2      | varchar(10) |      |     |         |                | | iso639_3      | varchar(10) |      |     |         |                | | wikimedia_key | varchar(10) |      |     |         |                | | dialect_of_lid | int(10)    |      |     | 0       |                | | is_enabled    | tinyint(1)  |      |     | 0       |                | ++-+--+-+-++ Table LANGUAGE_GROUPS ++-+--+-+-++ | Field         | Type        | Null | Key | Default | Extra          | ++-+--+-+-++ | language_id   | int(10)     |      | PRI | 0       |                | | group_name    | varchar(255)|      | PRI | ''      |                | +--+

The fields are fairly self-explanatory. The ISO keys refer to the ISO 639-3 and ISO 639-2 codes. The "Wikimedia key" is the code, if any, under which this language is known in the Wikimedia projects, e.g. "en" for English. Because the languages need to be loaded into memory on each pageview if no caching is available, there should be an index on is_enabled.

The groups allow us to build certain language groups, such as all constructed languages, all languages with Latin scripts, and so forth. The user can select at setup time which languages his installation should support, or change the is_enabled flags manually later. The most common choice will probably be "all Wikimedia project languages", which can be derived from the wikimedia_key (also used for other purposes) being non-empty.

We also want to know what users can or want to do with these languages:

Table USER_LANGUAGES +-+-+--+-+-+---+ | Field      | Type        | Null | Key | Default | Extra | +-+-+--+-+-+---+ | user_id    | int(10)     |      | PRI | 0       |       | | language_id | int(10)    |      | PRI | 0       |       | | attribute  | varchar(15) |      | PRI |         |       | | level      | int(10)     |      |     | 0       |       | +-+-+--+-+-+---+

Attribute can be something like 'read', 'translate', 'communicate', 'see_ui'. Level is a numeric preference or proficiency. For now, only 'communicate' and 'see_ui' will be used.

For the "create this page in another language" feature, we need a set of default languages that are likely to be known by a speaker of a certain language. This is very fallible, and should really be linked to a locale rather than a language, but it will do for now, especially as it can be customized in the user's language preferences.

Table LANGUAGE_DEFAULTS +-+-+--+-+-+---+ | Field              | Type    | Null | Key | Default | Extra | +-+-+--+-+-+---+ | language_id        | int(10) |      | PRI | 0       |       | | default_language_id | int(10) |     | PRI | 0       |       | +-+-+--+-+-+---+

Note that this is a many-to-many relationship: a language usually has multiple default languages, and any language can be part of the set of default languages for any other.

In order to connect content in different languages, we need another table:

Table LANGUAGELINKS +-+-+--+-+-+---+ | Field      | Type    | Null | Key | Default | Extra | +-+-+--+-+-+---+ | set_id     | int(10) |      | PRI | 0       |       | | page_id    | int(10) |      |     | 0       |       | +-+-+--+-+-+---+

The functionality of language links is explained below. Besides this, the following tables need to have LANGUAGE_ID keys and lookup indexes that include the language ID:
 * PAGE
 * PAGELINKS
 * TEMPLATELINKS
 * CATEGORYLINKS
 * RECENTCHANGES
 * possibly QUERYCACHE

To not necessarily complicate matters in wikis which do not use multiple languages, the existing indexes should continue to exist; however, UNIQUE or PRIMARY keys need to be modified to include the language (it can be 0 for multilanguage wikis).

Logic and frontend
Content languages in MediaWiki should essentially act like meta-namespaces that exist hierarchically above all regular namespaces. Accordingly, these should be part of the page title, so any URL in a multilingual wiki would be of the form:

http://mywiki.example.org/index.php?title=en:Main_Page http://mywiki.example.org/index.php?title=de:Talk:Hauptseite http://mywiki.example.org/index.php?title=mult:Babel

Note that regular namespace names, at present, cannot be automatically localized, though using the new namespace manager features in MediaWiki 1.6., synonyms could be gradually created by the site manager as language communities emerge (see below).

The prefixes would be identical to the current Wikimedia key, if any, if no Wikimedia key exists, the ISO 639-3 three-letter code would be used, prefixed with "iso_" to make it unique.

The prefix "mult:" stands for pages which support multiple languages, such as votes or certain templates. These would have the language code 0.

Note that a monolingual wiki would continue to act exactly as it does now, and use no prefixes whatsoever.

Using the new title rendering code that is part of the namespace changes in MediaWiki 1.6, it will be possible to show the language code as part of the rendered page title, but to style it separately (e.g. smaller font size).

The linking behavior within a language meta-namespace should be similar to a namespace with the "prefix" option set to its own name, i.e., all unprefixed links should point to pages in the same language. So, if you created a link to Portada from a page in Catalan, it would point to the Catalan Main Page, and only if you linked to Portada, you would be referring to the Spanish version.

Language preferences
One major new feature of multilingual MediaWiki should be the ability to set language preferences without creating an account (if cookies are enabled). For this purpose, Special:Preferences needs to support a subset of preferences that anonymous users can set (this could include some other user interface options). On the language level, it would include current user interface language selector, showing only those languages for which there are, in fact, interface translations. In addition, there would be a form element like the following:



Note that the user interface deliberately does not make use of dropdown boxes, as the number of supported languages can range in the thousands (hence the link to the list of languages). The ideal UI would be an AJAX-based autocompletion interface with a repeated form, but since the necessary libraries will be implemented as part of the Wikidata UI layer, it does not make sense to be fancy at this point in time.

Interlanguage links
MediaWiki currently supports "interlanguage links", links to a page in other languages that are displayed in the sidebar (in the MonoBook skin). However, these links relate to separate wiki databases. In order to distinguish the same feature within multilingual MediaWiki, we use the term "innerlanguage links".

One major deficit of the way interlanguage links are currently implemented is that it is necessary for each page to maintain a list of all languages to which it is connetected. If, for example, you have a page in 10 languages, each of these pages needs to have a list of interlanguage links to 9 other languages. Proposals have been made to reform this using a central database. This is a complex problem, requiring central versioning and possibly single login to be solved.

It is much easier to solve within a single database. The LANGUAGELINKS table above works differently from the current interlanguage link system. Instead of having separate lists of links for each page, there are sets of pages which are connected. This is done through a new multilingual namespace called "Set:". Any page can be linked to a page in the "Set:" namespace, which contains the language links (one per line). This association will be managed from the edit screen. It would also be desirable for the contents of a page in the Set: namespace to be editable from the same view as the contents of a regular page, that is, to have two textareas, one for editing language links and one for editing text.

Below the primary textarea, there would be two new links: "Add links to other languages" and "Join with existing set of language links". The first would allow the user to create a new page in the Set: namespace using a separate textarea (the title of the "Set:" page should be pre-filled with the title of the current page by default, if it doesn't exist, or made unique if it does), the second one would load the contents of an existing page in the Set: namespace into a separate textarea. Since, many times, a page will have the same title in many languages (e.g. the page title "Microsoft"), it would be a nice convenience feature to check if a page in the "Set:" namespace with the current page title exists, and to offer linking to it.

Innerlanguage links are rendered the same way as interlanguage links. All innerlanguage links are shown, regardless of language preferences. Experience has shown that it is desirable to make multilingual activity visible in this manner; it also makes caching easier.

Namespace-specific behavior
Some MediaWiki namespaces have certain functionality associated with them. This functionality is affected in a multilingual wiki.

Templates
Unless you explicitly refer to a multilingual template or one in another language, templates would be looked for in the same language namespace as the page where they are used.

MediaWiki namespace
The MediaWiki namespace currently supports multilingual content using a somewhat hackish subpage syntax to disambiguate languages. It should be ported to use the new language code system. Ideally, this could also be used to deprecate $wgForceUIMsgAsContentMsg - if a multilingual page exists in the MediaWiki namespace for a message, it is used for all languages. (Some messages could be multilingual by default.) For example, it would be possible to either create a multilingual portal as the frontpage (by creating mult:MediaWiki:Mainpage), or separate portals for each language.

File descriptions
File descriptions can be multilingual, but the page title is identical across all languages (the filename).

Links to files (displayed as an image or not) will point to the description page in the language of the linking page. If no description exists in that language, viewing the description page will show the languages for which descriptions are available. As a nice to have feature, the user language preferences could be evaluated to provide an automatic fallback, if possible; e.g., if the user speaks English, and an English language description is available, but a description in the language of the current context is not, the English description is shown.

An additional useful feature would be special treatment of descriptions in the mult: (multilingual) language namespace. Those descriptions could be shown above every description in a specific language. This would make it possible to have licensing information easily shared across languages, for example, and to port the existing description pages on wikis like Commons to the new system. However, in the long run, we will definitely want to move some of the image metadata into true, structured Wikidata.

Categories
It is possible to add innerlanguage links to categories, however, doing so does not mean that the translated category name (e.g. "en:Horse" => "de:Pferd") will inherit the category hierarchy of the original. Instead, category hierarchies can evolve separately in separate languages. In a multilingual repository like Wikimedia Commons, this means that separate file description pages in separate languages have separate categories.

In the long run, we want to complement the category system with meaning tags which relate directly to an element in the OmegaWiki thesaurus structure, allowing for a multilingual concept structure that is identical across languages and that can be automatically rendered into languages where expressions (word, phrases) for a concept are available (see a first mock-up of meaning tagging). This is a more desirable approach than perfectioning wiki-local category schemas, as it promotes the use of OmegaWiki as a single, global, structured, omnilingual conceptual database of the world.

User interface language default
When no user interface language is set, the UI language should be identical to the content language, that is, when viewing pages in English, the UI language should be English; in German, the UI language should be German, and so on. This ensures that each language community can customize the interface messages according to their needs, and emulates the current behavior of multilingual Wikimedia projects with split databases.

Regular links
In a number of places where relations between pages are evaluated (e.g. "What links here"), the language codes will have to be shown alongside the page title. Otherwise, the behavior of regular links is not affected.

Language filtering




At least three special pages, Special:Contributions, Special:Allpages and Special:Recentchanges, should offer the ability to filter pages by language. It would be useful if this ability would gradually be added to other pages as well.

However, language filtering should be disabled by default. There should be a special, multilingual system message, MediaWiki:Language communities, which would contain a comma-separated list of language codes (but be blank by default). These language codes would identify self-organizing communities within a wiki.

For example, if a new wiki is started in English, and people start adding content in German, these additions should initially be visible to everyone. Other users can then try to determine whether they are legitimate additions to the wiki. If a true community seems to be forming, the language can be added to the list of language communities. Only then can it be filtered.

Once the first language community exists, the default filter would still be "All languages". The individual language communities would become available as single-language filters. Only if the user has set their language preferences, a new option would appear (and become the default filter): "Languages I speak and new communities". "New communities" in this context would refer to languages which are not yet part of the list of language communities.

The mechanism of identifying language communities, and generally showing all languages by default, hopefully ensures that vandalism and spamming will only be hidden from the view of all users once a legitimate community exists to deal with it. It also promotes interaction between the existing community and newly forming language communities, allowing for guidance and advice in formulating initial policies and setting up pages.

Language proficiency
The language proficiency, if known, should be shown in two places:
 * user pages
 * user list, administrator list (it would be ideal if these could be filtered).

It serves a similar purpose as the current "Babel" templates (see commons:User:Eloquence as an example - the boxes at the bottom are language proficiency templates), but can be reliably accessed by MediaWiki itself.

Go button
The Go button would behave as it currently does; however, it would search for pages in the language of the currently viewed page, unless a language prefix is provided. As a nice to have feature, if language communities are configured (see above) these should be available as a dropdown in the Go/Search toolbox.

"Create a page in this language"


One unique new feature that should be part of the first implementation of multilingual MediaWiki is the ability to easily create a version of a page in another language. At the bottom of each page, a menu is shown which allows the user to select a language, enter a title (prefilled by default with the current page title), and create the page.

It is important that the word "Translate" is avoided in the user interface in this context, as the linked page is not necessarily a direct translation (and perhaps most frequently will not be).

The languages shown in the dropdown selection depend on three factors:
 * Languages where a corresponding page already exists are not listed in the selection.
 * If the user is anonymous or has not set their language preferences yet, the languages come from LANGUAGE_DEFAULTS.
 * Otherwise, they come from the USER_LANGUAGES table.

A link with the title "Customize languages" or similar should open the language preferences dialog.

It would be helpful if an associated language Set: with the origin page would be updated, or a new one would be created, in order to store the language link relationship.

Future ideas
Moved to |discussion page