Requests for comment/Scoped language converter/v1

Proposal Version 1
'''NOTE: the below draft requires revision to work with the bug 41716's current plan for variant support in Parsoid. Unless first-class page properties land soon, we will still need wikitext syntax for "glossaries" and "page- (and template-) specific rules which override glossaries"; the below proposal can be revised to fit those needs.'''

The current language converter contains a rule-based engine. Rules are applied from the point of definition on. See Writing systems/Syntax for more information.

This has a number of disadvantages. Rules defined in a template can unexpectedly leak into the body of an article. Changes to a template require the entire page to be re-rendered from the point of inclusion on. Finally, it exposes unnecessary complexity to the author/editor of an article, who would like to edit language converter rules on a global basis (in a page properties dialog, for example) without having to worry about exactly *where* in the article the rules are defined.

There are some advantages to the system, which should be preserved. First, by allowing rules to leak out of templates it is possible to write templates which contain a number of useful rules. All articles in a given topic area can include that template at the top of the page.

This RFC proposes to add syntax for scoped rules to the language converter. These rules would provide better encapsulation and more efficient rendering. In order to provide a migration path, we propose to first add new syntax. Existing rules using the old point-of-definition semantics can then be incrementally migrated to the new syntax (with the aid of bots or other tools in most cases). The old syntax could then be deprecated and removed.

It is proposed to add the new syntax at the same time bug 52661 is fixed.

= Proposals = There are two proposals for new language converter syntax/semantics. Both proposals maintain the basic syntax of the language converter unchanged. The following existing flags are also unchanged:
 * no flag : a one-off conversion that does not modify the rule table
 * R : disable conversion
 * D : describe conversion
 * T : override language conversion in title

The T flag might be deprecated if  can be made to work after bug 52661 is fixed.

The two proposals differ on how to support the existing use case where common groups of rules are defined in templates in order to be used by the including page. Only one of these proposals should be necessary.

"Category" Proposal
This proposal adds "category rules" to handle common groups of rules. Pages inherit the category rules defined by all categories they are member of. The new flags are:


 * P : page-scoped rule. Regardless of where in the document this rule occurs, it applies to all text in the document.  It does *not* apply to text included from templates.  If the document is included (as a template or page transclusion) the rule applies to the included text but does not leak out to the including document.
 * C : category rule. If this rule is included in a document in the Category namespace, then this rule will apply to all pages in the category.  It does *not* apply to the text on the category page itself. (Similarly, rules with the P flag will affect the category page but will not affect pages in the category.)
 * p (lowercase P): scoped rule removal. The given rule is disabled (regardless of whether it was a P or C rule) in the same scope as a P rule.  This allows the page author to disable a few rules which would otherwise be active due to the page's category, for example.  (p rules override any P rules in the same scope.)

''Possibly P rules should also apply to text included from templates, although this complicates template caching. If they do apply, then the p rules allow the template to disable rules which might be inherited from the including scope.''

The p rule might be renamed to p-.

"Global" Proposal
This proposal adds "global rules" which deliberately leak from templates into the entire page scope. The new flags are:
 * G : global rule. Regardless of whether this comes from a (possibly nested) inclusion or where in the document this rule occurs, it applies to all text on the page.
 * C : child rule. This rule applies to the current page and any pages it includes, but not to any pages which include it.
 * L : local rule. This rule applies to the current page, but *not* to any included pages or pages which include it.
 * l (lowercase L): local rule removal. The given rules is disabled (regardless of whether it was a G or L rule) in the same scope as an L rule. (l rules override any L rules in the same scope.)

One of C or L might be removed as unnecessary.

= Examples = In our examples, we are writing pages about the en:Premier League. We use  to define useful rules for the (fictional) en-gb and en-us variants, as well as to provide an appropriate info box and categories.

"Category" Proposal
Template:Premier League:

-{p|en-gb:football; en-us:soccer;}-

Category:Premier League: -{C|en-gb:football; en-us:soccer;}- -{C|en-gb:pitch; en-us:field;}-

The main article for this category is Premier League.

Game 39 (article):

The top football league in England, the Premier League is currently played on a double round robin basis... ...It needed the support only of the -{R|Football}- Association (FA)...

"Global" Proposal
Template:Premier League:

-{G|en-gb:football; en-us:soccer;}- -{G|en-gb:pitch; en-us:field;}-

-{l|en-gb:football; en-us:soccer;}-

Game 39 (article):

The top football league in England, the Premier League is currently played on a double round robin basis... ...It needed the support only of the -{R|Football}- Association (FA)...

= Discussion = The proposed RFC allows a incremental transition to more easily-editable language converter rule semantics. It allows editors to begin to migrate content prior to the implementation of Parsoid/VE support. It may also save effort by the VE team -- "old style" rules can be treated as uneditable content (as VE currently treats other constructs it does not support) and a clean interface to edit page- or category-level language converter rules can be implemented from the start.

Note that the proposal rule semantics are worth discussing even if it is decided to implement scoped language converter rules without adding corresponding wikitext syntax.

Implementation
I propose to implement the proposed semantics concurrent with fixing bug 52661 on the PHP side and implementing a corresponding parser in Parsoid. This will help keep the parsing and semantics consistent.

Migration
A migration tool will probably be written built on Parsoid. As a given page is parsed, it is straightforward to keep track of whether new rules added can be safely hoisted to page-level scope. All rules that can be safely hoisted can be automatically converted.

Template-based rules must be converted to category-based rules in the "category" proposal. There are a relatively small number of templates which define rules; it should be feasible to convert these semi-automatically or even manually. Again, the parser can be used to verify that the page text does not change when the rules are converted.

NoteTA
The zhwiki as a gadget installed named 'NoteTA'. See zh:Module:NoteTA and zh:MediaWiki:Gadget-noteTA.js. This is used to display the current set of word conversions for a page.

Currently the noteTA dialog groups rules according to "section 1: title rule; section 2: page rules; section 3+: rules transcluded from other templates"; it uses a template like: in order to allow it to easily pull out the different categories of rules (see https://zh.wikipedia.org/w/index.php?title=User:Cscott/lct&diff=28211999&oldid=28210325 ). The template emits meta information about rule sets, like: which it then scrapes out of the HTML in order to build its dialog. The retrieved rules are embedded in wikitext markup using the -{D|...}- construct, and then sent to the Parser API to yield a human-readable description of the rule.

Under the "category" proposal, this should probably be rewritten as something like

-{P|zh-cn:something;zh-tw:else;}- With the template being something like:

Page-scope rules can't be generated by a template (unless we use the "global" proposal), and so the template can't directly emit the meta-information used by the noteTA module. Perhaps a better language-converter specific API can be added to list the rules (and their scopes) used by a page more directly. Alternatively, a variant on the D flag could be implemented that dumps all active rules used as easily-extractable &lt;meta&gt; tags. Or else the parsoid DOM can be parsed directly for rules.

= See also =
 * bug 41716, describing future Parsoid support for language conversion
 * Parsoid/MediaWiki_DOM_spec, draft Parsoid spec for language converter syntax.
 * Requests for comment/Page and category based language variant conversion, more detailed write-up of the ideas from the draft Parsoid spec. Non-leaking page / category-global conversion rules using page property storage. No additional rule syntax in page content.
 * Parsoid/Language_conversion, more discussion.