This is a draft document of internationalization requirements for the VisualEditor.
Detailed requirements for right-to-left / RTL / bidirectional text are written separately on VisualEditor/Bidirectional text requirements.
Existing standards (HTML, CSS, Unicode etc.) already provide pretty good support for internationalization. The same goes for the implementation of the standards in modern versions of common browsers. This is certainly true for Gecko and Webkit (with different bugs in both), and possibly more problematic in Opera and IE.
As much as possible the existing standards and functionalities must be reused and not reimplemented. In particular this regards the following features:
- Caret appearance and movement
- The caret (a.k.a cursor) has a familiar stick shape in English, but it may be different in other languages. In bidirectional environments it may have an arrow pointing to the writing direction and in Chinese / Japanese / Korean environments it may have various shapes and behaviors according to the user's chosen input method. It's best to leave this to the OS and the browser.
- Word boundary calculation
- Different languages and scripts have very different understanding of word boundaries. Put simply, it's not necessarily spaces and punctuation marks. Browsers and operating systems are supposed to know this and act accordingly. It must not be reimplemented without need.
- Keyboard behavior
- Combinations with Ctrl and Alt keys may produce characters rather than actions in some keyboard layouts. In some operating systems (notably Mac) combinations with arrows affect directionality.
Fine-grained specification of content language: the lang attribute
It is becoming more and more important to use the HTML lang attribute as much as possible. The WebFonts extension uses it to apply the correct font. Firefox uses it to apply correct punctuation style and starting at version 16 (or so) it will use it to apply the correct spelling checker to a textarea. Mobile devices may use it to show the appropriate on-screen keyboard. Machine translation software may use it for correct translations.
In Wikipedia and in other projects text is written in a mix of languages so often that it should be assumed to be the norm. In the simplest cases, an encyclopedic article written in English includes names of people and places in foreign writing, quotations of foreign literary works, titles of foreign-language books, etc.
The lang attribute can be applied to almost any HTML element - the generic <span> and <div>, and also <b>, <p>, <a>, <textarea>, <img>, <pre>, etc.
lang - Current situation
Currently MediaWiki automatically applies the lang attribute to the <html> element.
To indicate elements in other languages it can be applied manually. This is done haphazardly - either using raw HTML tags or templates such as w:en:Template:Lang and its many derivatives, such as w:en:Template:Lang-sl, w:en:Template:Nihongo, s:en:Template:GHGheb etc. Needless to say, in every project they are implemented separately and differently, if at all.
lang - How it should be
It must be possible to specify that the language of the article's content is different from the language of the wiki. This must be somewhere in the page properties. getPageLanguage() can be a starting point, but its actual functionality is very limited. The Page Translation feature in the Translate extension applies language metadata to the pages it creates and this functionality could be moved to the core.
It must also be possible to specify the language of parts of the page using a visual dialog, that can also be used for defining fonts and other properties. Obviously, reusing generic HTML abilities would be a good idea.
The language of the currently selected text must be displayed somewhere; since language is not as visually obvious as font weight, it may be displayed on some kind of a status bar. The fact that the language of a part of a page is different can also be indicated by presenting it with a differently colored font or background. In LibreOffice and MS Word it is displayed on a status bar at the bottom.
Naturally, the lang attribute value is supposed to be useful not only for displaying, but for editing, too, for example for choosing the right font while the text is being edited and for spell checking.
Relationship with directionality
The HTML standard specifically says that text direction must not be deduced from the value of the lang attribute, so browsers don't do it. It's unfortunate, but that's how it is. The value of lang can be used to apply correct direction to element server-side.
By default, elements with different directionality should be directionally isolated, either using the <bdi> tag or the "unicode-bidi: isolate" CSS rule.
Fonts - current situation
Currently, there are three ways to use fonts in the MediaWiki world:
- For server-side rendering of SVG images, fonts can be installed on the server. For an example, see Bug 16284 - Support Linux Libertine Font for SVG rendering
- The WebFonts extension can apply fonts according to the lang attribute and CSS font-family declaration. It also allows changing the display font of the rendered page. The Universal Language Selector may change its current functionality.
- MathJax is an externally developed product. It has its own mechanism to deliver fonts for rendering of mathematical symbols.
Fonts - how it should be
The Visual editor should of course, support fonts for all languages. It should not, however, go too far and allow the user to choose any font. Wikimedia projects are, after all, rather conservative in their content. Readability and language diversity are essential goals for the projects, but design diversity is not. The Visual editor must, therefore, be flexible enough to allow the user to choose the language of a content element (lang attribute) and the content style (heading, body text, citation, poem, formula, source code example etc.). Creation of new styles, if needed, can be allowed, too. However, explicit and direct application of font names to the page is not required and possibly should be even prevented.
The functionality of the WebFonts extension should probably be merged into the core. Instead of connecting fonts with languages using this extension, the Language* classes should be aware of the fonts that are relevant for them.
Though math formulas are not exactly a linguistic element, their editing and font usage may be related to how fonts are applied to languages, so this should be remembered.
- It's not a precise measurement. It's just that the author of the document rarely uses Opera and IE and when he does, he runs into various problems with fonts, keyboards and directionality all the time.
- Probably mostly <span>, but it would be useful to check more precisely.
- That's another thing that should be built into the HTML standard, but currently isn't. A bug report in W3C's Bugzilla: 18490: make any element with an explicit lang or dir attribute bidi-isolated by default.