Multilingual MediaWiki

''These are development specifications, not documentation. This feature does not exist yet. For details on setting up a family of wikis in different languages connected by interlanguage links, see Manual:Wiki family''

Rationale
Support for multiple languages in MediaWiki is a component milestone of Wikidata and OmegaWiki development. From the standpoint of Wikidata, this is needed because MediaWiki page titles play an integral role in the Wikidata model: they can be used as keys to access resources in a set of tables, and a history of transactions related to these tables. Because page titles currently have no internal awareness of the language they represent, it is not possible to have content under the same titles in different languages without resorting to hacks (such as appending the language to the title string).

From the standpoint of a regular MediaWiki user, the current situation is that the only way to run a multilingual site is to create separate databases for each language. MediaWiki does not provide any facilities to do so. Furthermore, the administrator also has to:
 * configure the entry points for the different wikis on the web server, and set them up to use the same code base (or use completely separate installations)
 * configure the wikis to use the same account database
 * set up a shared upload repository
 * set up interlanguage links
 * manage user blocking and other site policies across multiple languages

Again, MediaWiki does not provide facilities that would make any of this significantly easier. MediaWiki also does not support:
 * Getting an index of all pages across languages
 * Getting a list of recent changes across all or certain languages
 * Maintaining a single watchlist across multiple languages
 * Indeed, operating any special page across a set of languages.

Due to the setup and maintenance costs involved, of the hundreds and hundreds of sites using MediaWiki, only a small number support multiple languages, and usually only a small number of languages as well. Language communities cannot evolve naturally on a MediaWiki; they usually have to jump through processual steps to convince the administrator that a new language has to be "set up" -- if this is possible at all.

Beyond that, there are wikis where a split into separate databases is entirely undesirable, because the wikis are inherently multilingual and centralized, and cross-language interaction on single pages is desired. Examples of this are Meta-Wiki (for votes) and Wikimedia Commons (for file description pages) and Wikiversity beta (for discussions).

Support for multiple content languages in a single MediaWiki installation and database will address these concerns and others.

Caveats
While there may be technical reasons to split databases, such as easier decentralization and exports and easier localization (e.g. sort order, timestamps), and possibly (depending on query efficiency) better scalability, there are no reasons on the level of application logic to do so. This is because everything that can be modeled using multiple databases can be modeled using a single one.

However, since managing multiple languages in a single database makes it, theoretically, easier to add certain features, such as language filters, great attention to detail has to be given to the question how such filters and other features might affect community interaction in a wiki.

Admin choices
The site administrator has to make the following choice in LocalSettings.php:
 * Support all languages (including very minor and constructed languages)
 * Support all languages, except for specific ones (blacklist)
 * It may be desirable to provide certain preset groups, such as constructed languages.
 * Support only a certain set of languages (whitelist)

In addition, the administrator can choose which, if any, language should be used by default for viewing content. This option, $wgDefaultLanguage, could be set to a language code, or to 'auto,&lt;fallback language code&gt;', meaning that the browser's preferences are evaluated. If the detected language(s) is/are not supported by the wiki, the fallback code is used.

Backend
MediaWiki needs to come with information about languages. For this, the following two tables are added:


 * Note that we don't really need two columns for ISO 639-2 or ISO 639-3 (and either or both should be default NULL). Only one is needed for BCP 47 and this is the prefered code that should be used. Several codes do exist anyway, but are aliases of a preferred BCP 47 code.
 * In that case, you'll have a "alias_of_lid" columns and a single "bcp47" column for the language code (which may be ISO 639-1 or ISO 639-2 or ISO 630-3, or some legacy aliased ISO 639-2 codes like "jw" and "he", or bibliographic codes distinct from technical codes in ISO 639-2, but may be also more precise with scripts or variants and dialects).
 * The "wikimedia_key" should be set only if the code used on the side is different from the preferred BCP 47 code. This key should still be a language code with its own entry in this table, and pointing to the preferred BCP 47 code via the "alias_of_lid". In that case, the "wikimedia_key" should be renamed as "local_lid" and would be of type int(10) like all other language ids.
 * The concept of "dialect of" is also be extremely fuzzy and collides with the concept of language groups, explained below. It is probably not needed in this table.
 * On the opposite, there will be the case of "ambiguous" language codes that would need to be split into several more precise language codes. This case occurs within BCP 47 since always and is now more precise since ISO 639-3 :
 * we have a formal definition of language families (some of them have their code in ISO 639-1/2 and ISO 639-3, but whose content of individual languages is still not very precise : this should be modeled like the language groups discussed below, except that below, these groups can't have language code as they should)
 * we have a very precise definition of "individual" languages (those that have a 3-letter code in ISO 639-3 and type "I") : they can't be dialects of each other.
 * we have a very previse definition of "macrolanguages" (some of them preexisted in ISO 639-1/2 and now they also have a code in ISO 639-3 and type "M") : they can't be dialect of each other, but contain a very precise list of individual languages. One of them, "zh", is extremely used and is still the preferred BCP 47 language code ; this is possible because in BCP 47 it is treated as implicity being a synonym of "cmn", if it is not followed by legacy qualifiers.
 * language names are also localized in multiple languages ; they are standard multilingual ressources ; for this reason it should not even be stored in this table (not even the English one or the native one).
 * In summary, this table should contain the following columns only:
 * CREATE TABLE language AS
 * ( language_id        INT         NULL WITH AUTOINCREMENT
 * , language_code      VARCHAR(16) NOT NULL
 * , local_lid          INT         NULL
 * , resource_id        INT         NULL
 * , bcp47_alias_of_lid INT         NULL
 * , iso639_scope       CHAR(1)     NULL -- 'M' for macro, 'I' for individual, 'F' for family, 'S' for special
 * , iso639_macro_lid   INT         NULL
 * , CONSTRAINT PRIMARY KEY INDEX ON (language_id)
 * , CONSTRAINT UNIQUE     INDEX ON (language_code)
 * , SECUNDARY             INDEX ON (macrolanguage_lid)
 * ALTER TABLE language ADD
 * ( CONSTRAINT FOREIGN KEY      ON (local_lid)   REFERENCES language(language_id)
 * , CONSTRAINT FOREIGN KEY      ON (resource_id) REFERENCES resource(resource_id)
 * Note that the column is_enabled is not necessary : a language "is enabled" when it has as "local_lid" set to non null (to the same value of the language_id column in either the same row, or in another row where the preferred BCP 47 code is replaced by another non-BCP47 code).
 * And language names are stored separately as a localizable text ressource, where there must just exist at least one localized name for the language to determine its resource id. Note that the resource_id that must be created first, may also be used as the value used in the language_id above, but they are really unrelated as it's impossible to insert a new resource id before knowing at least one language for it.
 * For this reason, the language table above contains a resource_id that may be set later (the only condition being to know at least one code for creating the language entry), and then language names will be stored in:
 * CREATE TABLE resource AS
 * ( resource_id  INT          NULL WITH AUTOINCREMENT
 * , language_id  INT          NOT NULL
 * , variant_tag  VARCHAR(255) NULL
 * , text_value   VARCHAR(255) NOT NULL -- Unicode needed on this column
 * , CONSTRAINT PRIMARY KEY INDEX ON (resource_id, language_id, variant_tag)
 * ALTER TABLE resource ADD
 * ( CONSTRAINT FOREIGN KEY      ON (language_lid) REFERENCES language(language_id)
 * , CONSTRAINT FOREIGN KEY      ON (resource_id)  REFERENCES resource(resource_id)
 * where language_id is the language effectively used in the localized ressource_text, variant_tag is an optional variant tag used to qualify standards or sources (such as the IANA registry, tag="IANA", or CLDR, tag="CLDR"), or the prefered name displayed on a project like Wikimedia projects (tag="WM"), or tags for indistinct accepted synonyms (tag="SYN:1", "tag="SYN:2", etc.) ; the variant tag may also contain some version info or date, but it should have a hierarchical prefix structure (exactly like namespaces) permitting the management of synonyms and updates from various normative, bibliographic, academic, or community sources.
 * ( CONSTRAINT FOREIGN KEY      ON (language_lid) REFERENCES language(language_id)
 * , CONSTRAINT FOREIGN KEY      ON (resource_id)  REFERENCES resource(resource_id)
 * where language_id is the language effectively used in the localized ressource_text, variant_tag is an optional variant tag used to qualify standards or sources (such as the IANA registry, tag="IANA", or CLDR, tag="CLDR"), or the prefered name displayed on a project like Wikimedia projects (tag="WM"), or tags for indistinct accepted synonyms (tag="SYN:1", "tag="SYN:2", etc.) ; the variant tag may also contain some version info or date, but it should have a hierarchical prefix structure (exactly like namespaces) permitting the management of synonyms and updates from various normative, bibliographic, academic, or community sources.
 * where language_id is the language effectively used in the localized ressource_text, variant_tag is an optional variant tag used to qualify standards or sources (such as the IANA registry, tag="IANA", or CLDR, tag="CLDR"), or the prefered name displayed on a project like Wikimedia projects (tag="WM"), or tags for indistinct accepted synonyms (tag="SYN:1", "tag="SYN:2", etc.) ; the variant tag may also contain some version info or date, but it should have a hierarchical prefix structure (exactly like namespaces) permitting the management of synonyms and updates from various normative, bibliographic, academic, or community sources.

The fields are fairly self-explanatory. The ISO keys refer to the ISO 639-3 and ISO 639-2 codes. The "Wikimedia key" is the code, if any, under which this language is known in the Wikimedia projects, e.g. "en" for English. Because the languages need to be loaded into memory on each pageview if no caching is available, there should be an index on is_enabled.

The groups allow us to build certain language groups, such as all constructed languages, all languages with Latin scripts, and so forth. The user can select at setup time which languages his installation should support, or change the is_enabled flags manually later. The most common choice will probably be "all Wikimedia project languages", which can be derived from the wikimedia_key (also used for other purposes) being non-empty.


 * Note that the concept of language groups is extremely fuzzy. It is not even needed to build a multilingual system. Languages may be grouped without this table using the existing MediaWiki catagories, possibly using different but parallel hierarchies.

We also want to know what users can or want to do with these languages:

Attribute can be something like 'read', 'translate', 'communicate', 'see_ui'. Level is a numeric preference or proficiency. For now, only 'communicate' and 'see_ui' will be used.

For the "create this page in another language" feature, we need a set of default languages that are likely to be known by a speaker of a certain language. This is very fallible, and should really be linked to a locale rather than a language, but it will do for now, especially as it can be customized in the user's language preferences.

Note that this is a many-to-many relationship: a language usually has multiple default languages, and any language can be part of the set of default languages for any other.


 * Note also that the term "default" in the name of the second column is not well chosen. It should preferably be "fallback" instead.
 * This table should probably include an additional column, of type  , part of the primary key (on the second position) with default value  . This is to correctly sort the list of default language ids for the same language ids according to best matches or proximity. The column   doesn't need then to be part of the primary key. But if you think that the dual primary key does not need to be changed, then you'll need to add a secondary index in (language_id, default_priority).
 * Note also that this table should also just be used as a default order of fallback for a given language, because users have their own preferences about their secondary languages. So the list of fallback language ids for a given language id should be indexed by a user id (and a default user id = 0 could be used for the default list of fallback languages when the users are not identified or have not set preferences for their fallback languages). In that case, you don't need two columns for the languages in the other table named "USERLANGUAGES" : just use one and design this second per-user table simply as (user_id, language_priority, language_id).

In order to connect content in different languages, we need another table:

The functionality of language links is explained below. Besides this, the following tables need to have LANGUAGE_ID keys and lookup indexes that include the language ID:
 * PAGE
 * PAGELINKS
 * TEMPLATELINKS
 * CATEGORYLINKS
 * RECENTCHANGES
 * possibly QUERYCACHE

To not necessarily complicate matters in wikis which do not use multiple languages, the existing indexes should continue to exist; however, UNIQUE or PRIMARY keys need to be modified to include the language (it can be 0 for multilanguage wikis).

Logic and frontend
Content languages in MediaWiki should essentially act like meta-namespaces that exist hierarchically above all regular namespaces. Accordingly, these should be part of the page title, so any URL in a multilingual wiki would be of the form:

http://mywiki.example.org/index.php?title=en:Main_Page http://mywiki.example.org/index.php?title=de:Talk:Hauptseite http://mywiki.example.org/index.php?title=mult:Babel

Note that regular namespace names, at present, cannot be automatically localized, though using the new namespace manager features in MediaWiki 1.6., synonyms could be gradually created by the site manager as language communities emerge (see below).

The prefixes would be identical to the current Wikimedia key, if any, if no Wikimedia key exists, the ISO 639-3 three-letter code would be used, prefixed with "iso_" to make it unique.

The prefix "mult:" stands for pages which support multiple languages, such as votes or certain templates. These would have the language code 0.

Note that a monolingual wiki would continue to act exactly as it does now, and use no prefixes whatsoever.

Using the new title rendering code that is part of the namespace changes in MediaWiki 1.6, it will be possible to show the language code as part of the rendered page title, but to style it separately (e.g. smaller font size).

The linking behavior within a language meta-namespace should be similar to a namespace with the "prefix" option set to its own name, i.e., all unprefixed links should point to pages in the same language. So, if you created a link to Portada from a page in Catalan, it would point to the Catalan Main Page, and only if you linked to Portada, you would be referring to the Spanish version.

Language preferences
One major new feature of multilingual MediaWiki should be the ability to set language preferences without creating an account (if cookies are enabled). For this purpose, Special:Preferences needs to support a subset of preferences that anonymous users can set (this could include some other user interface options). On the language level, it would include current user interface language selector, showing only those languages for which there are, in fact, interface translations. In addition, there would be a form element like the following:



Note that the user interface deliberately does not make use of dropdown boxes, as the number of supported languages can range in the thousands (hence the link to the list of languages). The ideal UI would be an AJAX-based autocompletion interface with a repeated form, but since the necessary libraries will be implemented as part of the Wikidata UI layer, it does not make sense to be fancy at this point in time.

Interlanguage links
MediaWiki currently supports "interlanguage links", links to a page in other languages that are displayed in the sidebar (in the MonoBook skin). However, these links relate to separate wiki databases. In order to distinguish the same feature within multilingual MediaWiki, we use the term "innerlanguage links".

One major deficit of the way interlanguage links are currently implemented is that it is necessary for each page to maintain a list of all languages to which it is connetected. If, for example, you have a page in 10 languages, each of these pages needs to have a list of interlanguage links to 9 other languages. Proposals have been made to reform this using a central database. This is a complex problem, requiring central versioning and possibly single login to be solved.

It is much easier to solve within a single database. The LANGUAGELINKS table above works differently from the current interlanguage link system. Instead of having separate lists of links for each page, there are sets of pages which are connected. This is done through a new multilingual namespace called "Set:". Any page can be linked to a page in the "Set:" namespace, which contains the language links (one per line). This association will be managed from the edit screen. It would also be desirable for the contents of a page in the Set: namespace to be editable from the same view as the contents of a regular page, that is, to have two textareas, one for editing language links and one for editing text.

Below the primary textarea, there would be two new links: "Add links to other languages" and "Join with existing set of language links". The first would allow the user to create a new page in the Set: namespace using a separate textarea (the title of the "Set:" page should be pre-filled with the title of the current page by default, if it doesn't exist, or made unique if it does), the second one would load the contents of an existing page in the Set: namespace into a separate textarea. Since, many times, a page will have the same title in many languages (e.g. the page title "Microsoft"), it would be a nice convenience feature to check if a page in the "Set:" namespace with the current page title exists, and to offer linking to it.

Innerlanguage links are rendered the same way as interlanguage links. All innerlanguage links are shown, regardless of language preferences. Experience has shown that it is desirable to make multilingual activity visible in this manner; it also makes caching easier.

Namespace-specific behavior
Some MediaWiki namespaces have certain functionality associated with them. This functionality is affected in a multilingual wiki.

Templates
Unless you explicitly refer to a multilingual template or one in another language, templates would be looked for in the same language namespace as the page where they are used.

MediaWiki namespace
The MediaWiki namespace currently supports multilingual content using a somewhat hackish subpage syntax to disambiguate languages. It should be ported to use the new language code system. Ideally, this could also be used to deprecate $wgForceUIMsgAsContentMsg - if a multilingual page exists in the MediaWiki namespace for a message, it is used for all languages. (Some messages could be multilingual by default.) For example, it would be possible to either create a multilingual portal as the frontpage (by creating mult:MediaWiki:Mainpage), or separate portals for each language.

File descriptions
File descriptions can be multilingual, but the page title is identical across all languages (the filename).

Links to files (displayed as an image or not) will point to the description page in the language of the linking page. If no description exists in that language, viewing the description page will show the languages for which descriptions are available. As a nice to have feature, the user language preferences could be evaluated to provide an automatic fallback, if possible; e.g., if the user speaks English, and an English language description is available, but a description in the language of the current context is not, the English description is shown.

An additional useful feature would be special treatment of descriptions in the mult: (multilingual) language namespace. Those descriptions could be shown above every description in a specific language. This would make it possible to have licensing information easily shared across languages, for example, and to port the existing description pages on wikis like Commons to the new system. However, in the long run, we will definitely want to move some of the image metadata into true, structured Wikidata.

Categories
It is possible to add innerlanguage links to categories, however, doing so does not mean that the translated category name (e.g. "en:Horse" => "de:Pferd") will inherit the category hierarchy of the original. Instead, category hierarchies can evolve separately in separate languages. In a multilingual repository like Wikimedia Commons, this means that separate file description pages in separate languages have separate categories.

In the long run, we want to complement the category system with meaning tags which relate directly to an element in the OmegaWiki thesaurus structure, allowing for a multilingual concept structure that is identical across languages and that can be automatically rendered into languages where expressions (word, phrases) for a concept are available (see a first mock-up of meaning tagging). This is a more desirable approach than perfectioning wiki-local category schemas, as it promotes the use of OmegaWiki as a single, global, structured, omnilingual conceptual database of the world.

User interface language default
When no user interface language is set, the UI language should be identical to the content language, that is, when viewing pages in English, the UI language should be English; in German, the UI language should be German, and so on. This ensures that each language community can customize the interface messages according to their needs, and emulates the current behavior of multilingual Wikimedia projects with split databases.

Regular links
In a number of places where relations between pages are evaluated (e.g. "What links here"), the language codes will have to be shown alongside the page title. Otherwise, the behavior of regular links is not affected.

Language filtering




At least three special pages, Special:Contributions, Special:Allpages and Special:Recentchanges, should offer the ability to filter pages by language. It would be useful if this ability would gradually be added to other pages as well.

However, language filtering should be disabled by default. There should be a special, multilingual system message, MediaWiki:Language communities, which would contain a comma-separated list of language codes (but be blank by default). These language codes would identify self-organizing communities within a wiki.

For example, if a new wiki is started in English, and people start adding content in German, these additions should initially be visible to everyone. Other users can then try to determine whether they are legitimate additions to the wiki. If a true community seems to be forming, the language can be added to the list of language communities. Only then can it be filtered.

Once the first language community exists, the default filter would still be "All languages". The individual language communities would become available as single-language filters. Only if the user has set their language preferences, a new option would appear (and become the default filter): "Languages I speak and new communities". "New communities" in this context would refer to languages which are not yet part of the list of language communities.

The mechanism of identifying language communities, and generally showing all languages by default, hopefully ensures that vandalism and spamming will only be hidden from the view of all users once a legitimate community exists to deal with it. It also promotes interaction between the existing community and newly forming language communities, allowing for guidance and advice in formulating initial policies and setting up pages.

Language proficiency
The language proficiency, if known, should be shown in two places:
 * user pages
 * user list, administrator list (it would be ideal if these could be filtered).

It serves a similar purpose as the current "Babel" templates (see commons:User:Eloquence as an example - the boxes at the bottom are language proficiency templates), but can be reliably accessed by MediaWiki itself.

Go button
The Go button would behave as it currently does; however, it would search for pages in the language of the currently viewed page, unless a language prefix is provided. As a nice to have feature, if language communities are configured (see above) these should be available as a dropdown in the Go/Search toolbox.

"Create a page in this language"


One unique new feature that should be part of the first implementation of multilingual MediaWiki is the ability to easily create a version of a page in another language. At the bottom of each page, a menu is shown which allows the user to select a language, enter a title (prefilled by default with the current page title), and create the page.

It is important that the word "Translate" is avoided in the user interface in this context, as the linked page is not necessarily a direct translation (and perhaps most frequently will not be).

The languages shown in the dropdown selection depend on three factors:
 * Languages where a corresponding page already exists are not listed in the selection.
 * If the user is anonymous or has not set their language preferences yet, the languages come from LANGUAGE_DEFAULTS.
 * Otherwise, they come from the USER_LANGUAGES table.

A link with the title "Customize languages" or similar should open the language preferences dialog.

It would be helpful if an associated language Set: with the origin page would be updated, or a new one would be created, in order to store the language link relationship.

Future ideas

 * See Talk:Multilingual MediaWiki