VisualEditor/Design/Bidirectional text requirements

UNFINISHED

This is a very, very rough draft. Any comment is welcome.

Preamble
Since MediaWiki in particular and Wikimedia projects in particular are used in right-to-left languages, such as Arabic, Pashto, Persian and Hebrew, the projected visual editor must support writing and displaying them easily.

"Easily" in this context means that any user must be able to write mixed right-to-left and left-to-right text as easily as he would do it with pen and paper and see the results while editing.

Although this is true for any text in right-to-left languages, proper support for mixing right-to-left and left-to-right is of particular importance for Wikimedia projects, which are heavily multilingual and frequently cite foreign names.

This document tries to be more functional - to describe what the software should do, rather than technical - to describe how it should be implemented. Some implementation suggestions may creep into the document nevertheless.

The direction property
It must be possible to apply the direction property at several levels:
 * the level of a content page
 * a block-level element (usually a paragraph, but possibly also a table cell, and possible a template etc.)
 * an inline element (effectively , but probably set using a template).

Page-level
The page-level direction property is inherited from the direction of the content language of the wiki, but there must be a possibility to override it. This is not currently implemented in any structured way in MediaWiki and is usually addressed by manually applying  to the whole page. (See bug 28970.) Having this is particularly useful for current Wikimedia sites like Meta, Commons, Outreach, Strategy etc., for "embassy" pages in various projects, and for other purposes.

Paragraph-level
The paragraph-level direction property is inherited from the direction of the page, but it must be possible to override it.

Inline
A common example of an inline element where a direction setting would be useful is the English spelling of names (Yahoo!, 7 Up) or a code snipper in an Arabic Wikipedia article, or a Hebrew name ( וואללה! ) in an English article. Most likely this would be an HTML , but it can also be (in HTML5), and other elements. It is likely that it would be applied using a template.

Direction setting
The usual keyboard shortcut in Windows and Linux for setting direction is Ctrl-Shift. Left Ctrl-Shift sets the direction to LTR and right Ctrl-Shift sets the direction to LTR.

In rich text editors such as LibreOffice and Microsoft Word it sets the direction of the current paragraph or of all the selected paragraphs. In plain text editors such as Notepad and browser textareas it sets the direction of the whole text, as there's no robust way to set the direction of a single paragraph in plain text (RLE/PDF is not quite robust).

Notable exceptions:
 * Firefox: Ctrl-Shift does nothing. Ctrl-Shift-X, no matter at which side, switches the current direction (bidi.browser.ui must be set to true so this would actually work). It seems to be so because DOM doesn't define precisely what to do with Ctrl-Shift by itself. There's a very old request to change it to Ctrl-Shift (Mozilla Bug 98160 - Replace Accel+Shift+X (Switches text direction) with more intuitive keyboard shortcuts: Ctrl+Left/Right Shift on Windows).
 * Notepad++: Ctrl-Alt-L, Ctrl-Alt-R.
 * The commonly used rich text editor in GMail, which is actually a complete HTML iframe, doesn't provide any way to change paragraph direction with the keyboard. It can be changed using a toolbar button (which only appears if the interface language in RTL).

(XXX: No info about Mac.)

Since Ctrl-Shift is the common shortcut, the MediaWiki visual editor should use it in a fashion similar to LibreOffice and Word - to set the direction of the current selected paragraphs.

Important question: Should Ctrl-Shift be used to set the direction of a selected inline element? The intuition says that it should only work for paragraphs, but maybe user testing will prove that users like it. (And maybe Ctrl-Shift-X should be used for that?..)

Movement
It must be possible to use the right and left arrow keys intuitively. While the cursor is in words in the main language of a paragraph, the cursor should move in the direction of the arrow. The hardest decision to make is how should the cursor move in a text in the other direction, for example in a Hebrew word within an English paragraph. In different programs it works differently:

(Note: In Firefox this could be changed by setting bidi.controlstextmode in about:config, but this was disabled.)

Text selection always works in the logical order, because visual selection will cause non-contiguous text to be selected.

The movement order should probably be logical, since it is consistent with the selection order and more common (and probably easier to implement, too).

Another consideration is Ctrl-Arrow, which makes the cursor jump from word to word. It should be obvious that its direction must be identical to that of the regular movement arrows, but in Chrome it's implemented incorrectly (Chromium Issue 10741: Ctrl + Right/Left combination moves cursor to opposite direction between words in RTL text box).

''(Also Ctrl-Shift-Arrow, Ctrl-Del etc.)

Usage of control characters
Several transparent control characters are used for controlling directionality:
 * RLM / LRM - transparent characters with strong directionality. It's easy to understand them as Hebrew and Latin letters, respectively, but invisible. They are used for isolating inline pieces of text with ambiguous directionality, for example a Latin letter from a following number in a right-to-left paragraph. Equivalent to the HTML entities &amp;rlm; and &amp;lrm;.
 * RLE / LRE - start an embedded inline block of right-to-left or left-to-right text that runs until a PDF character. Similar to /.
 * RLO / LRO - starts a section in which the Unicode bidirectional algorithm doesn't apply. That is, Hebrew letters will appear left-to-right. Equivalent to the HTML tag.
 * PDF - pop directional formatting. Ends a block of text started by RLE, LRE, RLO or LRO.

There's no practical use for RLE, LRE, RLO, LRO and PDF in text that can be marked up with HTML.

RLM and LRM are currently frequently used in MediaWiki, as there is no other common way to isolate pieces of text with ambiguous directionality. They are often inserted as a template (such as w:he:Template:כ). Implementation of the HTML5 tag may diminish the need for them, but currently they are needed.

Other considerations

 * Offline and PDF export: These features shouldn't be directly affected by this if parsing and rendering stays the same, but they must be tested anyway.
 * Web fonts.

Current bidirectional features in MediaWiki

 * Directionality support
 * Existing and resolved bugs related to right-to-left text

Related standards

 * Unicode Bidirectional Algorithm
 * Language information and text direction - Existing HTML4 standard.
 * Additional Requirements for Bidi in HTML - Prospective bidirectional features in HTML5.
 * Open Document Format for Office Applications (OpenDocument) v1.1 - a.k.a. ODF and ODT, used in LibreOffice, OpenOffice and other applications.
 * Bidi-related about:config settings in Mozilla