Parsoid/MediaWiki DOM spec/Language conversion blocks

Status: provisional / strawman. See bug 41716. Also see Writing_systems/Syntax.

Alternative 1
Basically as described in bug 41716#c37. Render the default variant according to the fallback chain for output-producing rules.

 foo-{bar baz}- quux 

 foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux 

 foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux 

Alternative 1b
Same basic idea as Alternative 1, but using more-specific  attributes, and we don't store information in data-mw which is redundant with the content of the   which helps to make WTS more predictable.

 foo-{bar baz}- quux 

 foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux 

 foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux 

For unsupported conversion blocks we use :  -{T|zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- 

Alternative 2
This is the alternative currently implemented in Parsoid.

This option leaves the 'content' portion of the span empty, to allow post-processing (or a JS switcher) to swap in the correct things.

The attribute is named  since it affects the read-only rendering of the page, and   attributes are supposed to be ignored for rendering and only needed for editing.

Top-level fields in the JSON are:,  ,  ,  ,  , and. If the wikitext "show" flag is not present or implicit, the DOM markup will use the  element. If "show" is present or implicit, the DOM markup will use  if contents are inlineable, or   otherwise.

 foo-{bar baz}- quux 

 foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux </tt>

 foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux </tt>

 foo-{H|WEBLOG=>zh-cn:blog;WEBLOG=>zh-hk:WEBJOURNAL}- quux </tt>

 a-{b c d}-e  </tt>

Alternative 3
This option puts all the alternatives into the DOM, more smoothly handling nested markup. This uses  on the inner spans.

 foo-{bar baz}- quux </tt>

 foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux </tt>

 foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux </tt>

Alternative 3b
Like alternative 3, this option puts all the alternatives into the DOM to smoothly handle nested markup. This variant uses the standard HTML5 lang attribute whenever possible; including setting it to the empty string (signifying "language unknown") where language conversion is disabled.

By making the nested content visible in the DOM, this more easily allows direct editing in the style described by T17161.

 foo-{bar baz}- quux </tt>

 foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux </tt>

 foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux </tt>

General language conversion plan
The wikitext language variant converter interface documented in Writing_systems/Syntax exposes two classes of operations:
 * 1) Selecting content in place by variant, and
 * 2) dynamic modification of conversion rules that apply from that point in the page on.

In-place content selection is not just used for regular translation pairs, but also for constructs like. Content is mostly well-nested, so we can represent this as an element. The exception from grepping (see regexp used below) are constructs like. Those partly stem from times when the language converter could not be used inside attributes. We can probably fix this automatically by moving the variant block inside the attribute.

In general, we will render content-producing variant code based on the wiki's default variant and the fallback chain. Regular content conversion will only happen as a post-processing step on the saved Parsoid HTML.

Dynamic modification of rules does not seem to be needed in general. Page-global and per-category rules can replace template-based definitions. Until that is implemented, we need however represent existing add / remove rules inline. For also content-producing constructs like  we can both render and record the rule modification in data-mw. Pure modifications (H flag) can be represented as meta tags.

Rule format for separately stored page-global rules
-{H|..}- and -{-|..}- can be represented as metas, others as spans. Block-level content seems to be rare.


 * {"*": "XXX"} for rules migrated from -{A|XXX}-
 * {"zh-cn": "tom hanks", "zh-hk": "SOUP HANS", "zh-tw": "TOM HANKS"}
 * {"zh-cn": {"HUGEBLOCK":"macro"}, "zh-hk": {"BLOCKHUGE":"big"}} for migrated -{H|HUGEBLOCK=>zh-cn:macro;BLOCKHUGE=>zh-hk:big;}-
 * {"zh-cn": {"HUGEBLOCK":"macro"}, "zh-hk": {"HUGEBLOCK":"big"}} for migrated -{H|HUGEBLOCK=>zh-cn:macro;HUGEBLOCK=>zh-hk:big;}-

For consumers of this format:


 * If a rule value is a string, it is a direct translation rule
 * If a rule value is an object, it contains one or more unidirectional nested rules

Other considerations

 * $wgDefaultLanguageVariant and fallback chain for it (search for variantfallbacks in LanguageZh.php, retrievable from . Note that getVariantFallbacks can return a string OR an array for different input... It seems to make more sense to have getVariantFallbacks do array_diff itself but it's not doing so currently... ) is not currently exposed in the API. We'll need both to pick the right content to render for -{zh-tw:foo;zh-cn:bar}-.
 * See also 52700.


 * What to store in  when it contains some other structures?
 * HTML, but the content needs to be properly nested. Run  on a zhwiki dump to find potentially problematic language conversion blocks, then check nesting for them. Problematic cases seem to be
 * different start tags per variant that really only differ in an attribute (title for example). Conversion pairs are now also supported in attributes, so try to fix wikitext to convert attribute only.

Result of : Total revisions: 2234532 Total matches: 773 Ratio: 0.034593373467016804%

Naive use of  will be misleading, as almost all of the -{ }- markup comes from templates.

 <div title="-{foo}-"> </tt>
 * Need a way to mark up in-place variant conversions in attributes. Idea that might also be useful for transclusion-affected attributes: