Parsoid/Language conversion/Preprocessor fixups

The Problem
LanguageConverter markup was not well-integrated with the wikitext parser (or with subsequent additions to wikitext), resulting in a number of unusual corner cases when pages contain  markup. These caused difficulties for editors on Wikipedias in multiple writing systems; we'd like to better support their scripts. The T54661 task tracks a number of these corner cases, which have been steadily fixed without much issue.

The remaining bug is T146304: the wikitext preprocessor doesn't understand language converter markup, so it splits template arguments incorrectly when language converter markup is present. For instance:

This is interpreted as two arguments to the  template:   and , instead of one argument using language converter markup.

The most straight-forward fix makes the preprocessor recognize  ...   constructs in all wikitext, since preprocessor operation is not dependent on page language (although actual "conversion" of   ...   constructs at a later stage still only occurs on pages in certain languages). This is mostly safe, but there exists markup such as the following, from en:Glutathione: This breaks under the new preprocessor rules. The  sequence in this chemical name begins a new "language converter" construct, and although we have preprocessor rules to handle "unclosed constructs" and precedence, they kick in only when the template is finally closed with the   sequence. All the template arguments between the  and   are still swallowed up as apparent arguments to the language converter construct, and thus the template appears broken.

The fix is simple:  sequences need to be  'ed, either by wrapping   around an entire phrase or argument, or simply by separating the   and the   like so:

Occasionally the  sequence appears in URLs, and in that case one of the characters needs to be URL encoded, such as in en:Alan Turing: I recommend encoding the  character as %2D, like so: Alan Turing

How do we tell how widespread this is?
Used 2017-03-20 dumps of all wiki projects with >1,000,000 articles, plus  since I stumbled across some problematic markup there.

Command used: Using a fork of  which allows line-by-line searches and the equivalent of a   pipeline. The command above says, print all lines which contain  but not   (or  ) or , since the normal preprocessor precedence yields the expected (ie, non-language converter) result for those constructs. I then use a  script to post-process the results into wikitext. The relevant pages on mediawiki are checked out using : You can then have the  script overwrite the appropriate   files in the checkout, ,  , and   to update the copy on-wiki.

Edit Rules

 * Pages listed as of April 10, 2017. 13 wikis (incl. zhwiki; empty list), ~1500? pages total.

Edit Rules as applied:
 * Edits done in the listed pages:
 * Change  into
 * In chemical names (mostly IUPAC names and similar; could be 75% of all affected pages)
 * In species description (example: Oloo, G.W. (1975) Sugarcane. 1.- {Aulacaspis} spp. and other scales; note: closing hyphen not present)
 * In module documentation pages ( Module:.../doc )


 * in url: change into  (see example Alan Turing)
 * Removed when typo (for example  in wikitable pipe code)


 * Pending
 * When in static archive or log page (mostly before 2010).


 * Listed pages not edited:
 * No edit when in ns Module (Lua code)
 * No edit when (possibly) intentionally used for LanguageConverter: en:Bulgarian language has "-{ost/est}" (See note R)
 * No edit when inside TeX source code description (example: en:File:Homotopy lifting property.svg). (See note R)


 * Note R
 * These rules might need refinement, to select the true positives for editing.