Parsoid/Language conversion/Preprocessor fixups

The Problem
LanguageConverter markup was not well-integrated with the wikitext parser (or with subsequent additions to wikitext), resulting in a number of unusual corner cases when pages contain  markup. These caused difficulties for editors on Wikipedias in multiple writing systems; we'd like to better support their scripts. The T54661 task tracks a number of these corner cases, which have been steadily fixed without much issue.

The remaining bug is T146304: the wikitext preprocessor doesn't understand language converter markup, so it splits template arguments incorrectly when language converter markup is present. For instance:

This is interpreted as two arguments to the  template:   and , instead of one argument using language converter markup.

The most straight-forward fix makes the preprocessor recognize  constructs in all wikitext, since preprocessor operation is not dependent on page language (although actual "conversion" of   constructs at a later stage still only occurs on pages in certain languages). This is mostly safe, but there exists markup such as the following, from en:Glutathione: This breaks under the new preprocessor rules. The  sequence in this chemical name begins a new "language converter" construct, and although we have preprocessor rules to handle "unclosed constructs" and precedence, they kick in only when the template is finally closed with the   sequence. All the template arguments between the  and   are still swallowed up as apparent arguments to the language converter construct, and thus the template appears broken.

The fix is simple:  sequences need to be  'ed, either by wrapping   around an entire phrase or argument, or simply by separating the   and the   like so:

Occasionally the  sequence appears in URLs, and in that case one of the characters needs to be URL encoded, such as in en:Alan Turing: We recommend encoding the  character as %2D, like so: Alan Turing

False positives
The preprocessor handles constructs like  without a problem, using a "righthand precedence" rule. The rightmost opening construct (in this case, which is to the right of  ) is matched.

"Broken" constructs are also usually fine: Lonely, like   or  , get inserted as literals in the output. Problems only occur when a broken construct is inside a template, wikilink  , or template argument  . In this case the "broken" construct causes the preprocessor to ignore the close token for the surrounding context (since it is still "trying to close" the broken construct).

The grep results below don't eliminate cases "outside a template" or "outside a wikilink", because to do so would require reimplementing the wikitext preprocessor. That causes some false positives, which are probably safest to fix up anyway. But some wikis have a large number of matches due to (for example) inclusion in an active editor's wiki signature. These are probably safe to ignore, since wiki signatures generally don't occur inside links or templates.

How do we tell how widespread this is?
We initially used dumps of all wiki projects with >1,000,000 articles generated on 2017-03-20, plus  since we stumbled across some problematic markup there. In a second pass we used all wikis from the 2017-05-01 dump.

Command used: Using a fork of  which allows line-by-line searches and the equivalent of a   pipeline. The command above says, print all lines which contain  but not   (or  ) or , since the normal preprocessor precedence yields the expected (ie, non-language converter) result for those constructs. We then use a  script to post-process the results into wikitext. The relevant pages on mediawiki are checked out using : You can then have the  script overwrite the appropriate   files in the checkout, ,  , and   to update the copy on-wiki.

As of 2017-05-01
See the breakdown by wiki (including all non-private non-closed wikis) at /20170501.

As of 2017-03-20
This was the initial grep done on the 2017-03-20 wiki dump, on only the >1,000,000 article wikis.
 * The number of pages may not be correct (+/- 10%)

Edit Rules

 * See: .../2017-03-20 list: edits (the edit process log page)