Requests for comment/Scoped language converter

The current language converter contains a rule-based engine. Rules are applied from the point of definition on. See Writing systems/Syntax for more information.

This has a number of disadvantages. Rules defined in a template can unexpectedly leak into the body of an article. Changes to a template require the entire page to be re-rendered from the point of inclusion on. Finally, it exposes unnecessary complexity to the author/editor of an article, who would like to edit language converter rules on a global basis (in a page properties dialog, for example) without having to worry about exactly *where* in the article the rules are defined.

There are some advantages to the system, which should be preserved. First, by allowing rules to leak out of templates it is possible to write templates which contain a number of useful rules. All articles in a given topic area can include that template at the top of the page.

The current plan for variant support in Parsoid is outlined in bug 41716.

It is proposed to add "glossaries" as a new first-class construct to encapsulate language converter rules. Each glossary will be stored in a machine-readable format (for example, JSON) and be easily editable. A page can use multiple glossaries, and glossary information is stored in page properties. This basic plan seems to have consensus support.

There are a number of unresolved design issues, which are the topic of this RFC:
 * What types of rules should be allowed in a glossary? How are conflicts between rules resolved?
 * What is the scope of a rule? Do glossaries for a given page affect templates included in that page? Can glossaries be included in a template?  Do they affect pages which include the template?
 * Do we need wikitext syntax for glossary-related properties? If so, what should that syntax be?

The new "glossaries" specified by this proposal will replace the current point-of-definition rule system. Storing rules in glossaries would provide better encapsulation and more efficient rendering. In order to provide a migration path, we propose to first add new wikitext rule syntax. Existing rules using the old point-of-definition semantics can then be incrementally migrated to the new syntax (with the aid of bots or other tools in most cases). The old rule syntax could then be deprecated and removed. It is proposed to add the new syntax at the same time bug 52661 is fixed.

In addition to rule syntax, we will also need a means to associate glossaries with pages. The eventual goal is to store this information as a page property, and not as invisible markup in the page. (Categories and other page properties will eventually be migrated out of the markup as well.)

However, the timing of the page property store is uncertain. We will also present wikitext syntax for glossaries, in case it proves to be convenient to store this in the page markup during initial implementation and/or transition to the page property store.

= Proposal = We introduce a new namespace,. Most language converter rules reside in glossaries, although there is still per-page markup to add/override/remove rules. The page markup maintains the basic syntax of the language converter unchanged. The following existing flags are also unchanged:
 * no flag : a one-off conversion that does not modify the rule table
 * - : disable conversion
 * D : describe conversion
 * T : override language conversion in title

The T flag might be deprecated if  can be made to work after bug 52661 is fixed.

We probably also need to differentiate "unidirectional" from "bidirectional" rules.

We also propose the following features, although these are more tentative:
 * The "Science" glossary is to be found in the page "Glossary:Science". Eventually there will be special-purpose editing tools for glossaries.  (In the initial implementation, the glossary page may rules in the wikitext format.)
 * The order of the rules matter, but there is no other priority system. If multiple rules apply to the same text, the last-specified one is applied.  Because rule order matters, the set of glossaries associated with a page is also ordered.  The last glossary takes precedence over earlier glossaries.

Scope
There are two proposals for how glossaries interact with templates. Only one of these proposals should be necessary.

Template scope
In this proposal rules defined inside a template apply to the template, but do not leak out into surrounding content. Similarly, rules defined in a page do *not* apply to templates included on the page.

Global scope
In this proposal, rules defined on a page apply to the page and to all content included in the page.

= Discussion = The proposed RFC allows a incremental transition to more easily-editable language converter rule semantics. It allows editors to begin to migrate content prior to the implementation of Parsoid/VE support. It may also save effort by the VE team -- "old style" rules can be treated as uneditable content (as VE currently treats other constructs it does not support) and a clean interface to edit page- or category-level language converter rules can be implemented from the start.

Note that the proposal rule semantics are worth discussing even if it is decided to implement scoped language converter rules without adding corresponding wikitext syntax.

Implementation
I propose to implement the proposed semantics concurrent with fixing bug 52661 on the PHP side and implementing a corresponding parser in Parsoid. This will help keep the parsing and semantics consistent.

Migration
A migration tool will probably be written built on Parsoid. As a given page is parsed, it is straightforward to keep track of whether new rules added can be safely hoisted to page-level scope. All rules that can be safely hoisted can be automatically converted.

Template-based rules must be converted to category-based rules in the "category" proposal. There are a relatively small number of templates which define rules; it should be feasible to convert these semi-automatically or even manually. Again, the parser can be used to verify that the page text does not change when the rules are converted.

NoteTA
The zhwiki as a gadget installed named 'NoteTA'. See zh:Module:NoteTA and zh:MediaWiki:Gadget-noteTA.js. This is used to display the current set of word conversions for a page.

Currently the noteTA dialog groups rules according to "section 1: title rule; section 2: page rules; section 3+: rules transcluded from other templates"; it uses a template like: in order to allow it to easily pull out the different categories of rules (see https://zh.wikipedia.org/w/index.php?title=User:Cscott/lct&diff=28211999&oldid=28210325 ). The template emits meta information about rule sets, like: which it then scrapes out of the HTML in order to build its dialog. The retrieved rules are embedded in wikitext markup using the -{D|...}- construct, and then sent to the Parser API to yield a human-readable description of the rule.

Under the "category" proposal, this should probably be rewritten as something like

-{P|zh-cn:something;zh-tw:else;}- With the template being something like:

Page-scope rules can't be generated by a template (unless we use the "global" proposal), and so the template can't directly emit the meta-information used by the noteTA module. Perhaps a better language-converter specific API can be added to list the rules (and their scopes) used by a page more directly. Alternatively, a variant on the D flag could be implemented that dumps all active rules used as easily-extractable &lt;meta&gt; tags. Or else the parsoid DOM can be parsed directly for rules.

= See also =
 * bug 41716, describing future Parsoid support for language conversion
 * Parsoid/MediaWiki_DOM_spec, draft Parsoid spec for language converter syntax.
 * Requests for comment/Page and category based language variant conversion, more detailed write-up of the ideas from the draft Parsoid spec. Non-leaking page / category-global conversion rules using page property storage. No additional rule syntax in page content.
 * Parsoid/Language_conversion, more discussion.
 * An older version of this RFC

Language policy
The question if a new language edition may be created is what the language policy is about. The language policy does not allow for the creation of a new Wikipedia based on orthography. Consequently, it is not ok to even consider splitting zh.wp or sr.wp. GerardM (talk) 05:23, 22 May 2014 (UTC)
 * But are Cantonese and Mandarin really just split by questions of "orthography"? What about Hindi and Urdu?  These questions are never quite so simple as they first appear. cscott (talk) 16:00, 18 June 2014 (UTC)