Parsoid/Language conversion/Preprocessor fixups

The Problem
LanguageConverter markup was not well-integrated with the wikitext parser (or with subsequent additions to wikitext), resulting in a number of unusual corner cases when pages contain  markup. These caused difficulties for editors on Wikipedias in multiple writing systems; we'd like to better support their scripts. The T54661 task tracks a number of these corner cases, which have been steadily fixed without much issue.

The remaining bug is T146304: the wikitext preprocessor doesn't understand language converter markup, so it splits template arguments incorrectly when language converter markup is present. For instance:

This is interpreted as two arguments to the  template:   and , instead of one argument using language converter markup.

The most straight-forward fix makes the preprocessor recognize  ...   constructs in all wikitext, since preprocessor operation is not dependent on page language (although actual "conversion" of   ...   constructs at a later stage still only occurs on pages in certain languages). This is mostly safe, but there exists markup such as the following, from en:Glutathione:

This breaks under the new preprocessor rules. The  sequence in this chemical name begins a new "language converter" construct, and although we have preprocessor rules to handle "unclosed constructs" and precedence, they kick in only when the template is finally closed with the   sequence. All the template arguments between the  and   are still swallowed up as apparent arguments to the language converter construct, and thus the template appears broken.

The fix is simple:  sequences need to be  'ed, either by wrapping   around an entire phrase or argument, or simply by separating the   and the   like so:

Occasionally the  sequence appears in URLs, and in that case one of the characters needs to be URL encoded, such as in en:Alan Turing: I recommend encoding the  character as %2D, like so: Alan Turing
 * Alan Turing RKBExplorer
 * Alan Turing RKBExplorer

How do we tell how widespread this is?
Used 2017-03-20 dumps of all wiki projects with >1,000,000 articles, plus  since I stumbled across some problematic markup there.

Command used: for wiki in $(cat ~/DumpGrepper/wikis.txt ) ; do  bzcat ~/DumpGrepper/$wiki-20170320-pages-articles.xml.bz2 | \ node ./index.js --line -e -- '-[{]' '!' '<!--+[{]' '-[{][{]' | \ tee results/$wiki-results.txt done Using a fork of  which allows line-by-line searches and the equivalent of a   pipeline. The command above says, print all lines which contain  but not   (or  ) or , since the normal preprocessor precedence yields the expected (ie, non-language converter) result for those constructs. I then use a  script to post-process the results into wikitext. The relevant pages on mediawiki are checked out using : git clone -c remote.origin.categories='Parsoid' -c remote.origin.mwLogin=[your mediawiki username] -c remote.origin.mwPassword=[your mediawiki password] mediawiki::https://www.mediawiki.org/w You can then have the  script overwrite the appropriate   files in the checkout, ,  , and   to update the copy on-wiki.