Parsoid/Language conversion/Preprocessor fixups

From mediawiki.org

The Problem[edit]

LanguageConverter markup was not well-integrated with the wikitext parser (or with subsequent additions to wikitext), resulting in a number of unusual corner cases when pages contain -{…}- markup. These caused difficulties for editors on Wikipedias in multiple writing systems; we'd like to better support their scripts. The T54661 task tracks a number of these corner cases, which have been steadily fixed without much issue.

The remaining bug is phab:T146304: the wikitext preprocessor doesn't understand language converter markup, so it splits template arguments incorrectly when language converter markup is present. For instance:

{{1x|-{R|foo}-}}

This is interpreted as two arguments to the 1x template: -{R and foo}-, instead of one argument using language converter markup.

The most straight-forward fix makes the preprocessor recognize -{…}- constructs in all wikitext, since preprocessor operation is not dependent on page language (although actual "conversion" of -{…}- constructs at a later stage still only occurs on pages in certain languages). This is mostly safe, but there exists markup such as the following, from en:Glutathione:

{{Chembox
…
| IUPACName=(2''S'')-2-Amino-4-{[(1''R'')-1-[(carboxymethyl)carbamoyl]-2-sulfanylethyl]carbamoyl}butanoic acid
…
}}

This breaks under the new preprocessor rules. The -{ sequence in this chemical name begins a new "language converter" construct, and although we have preprocessor rules to handle "unclosed constructs" and precedence, they kick in only when the template is finally closed with the }} sequence. All the template arguments between the -{ and }} are still swallowed up as apparent arguments to the language converter construct, and thus the template appears broken.

How to fix[edit]

The fix is simple: -{ sequences need to be ‎<nowiki>'ed, either by wrapping ‎<nowiki> around an entire phrase or argument, or simply by separating the - and the { like so: -<nowiki/>{

Occasionally the -{ sequence appears in URLs, and in that case one of the characters needs to be URL encoded, such as in en:Alan Turing:

 * [http://www.rkbexplorer.com/explorer/#display=person-{http://dblp.rkbexplorer.com/id/people-a27f18ebafc0d76ddb05173ce7b9873d-e0b388b7c1e0985b1371d73ee1fae8b5} Alan Turing] RKBExplorer

We recommend encoding the - character as %2D, like so: Alan Turing

* [http://www.rkbexplorer.com/explorer/#display=person%2D{http://dblp.rkbexplorer.com/id/people-a27f18ebafc0d76ddb05173ce7b9873d-e0b388b7c1e0985b1371d73ee1fae8b5} Alan Turing] RKBExplorer

More recommendations here.

False positives[edit]

The preprocessor handles constructs like -{{Foo}} without a problem, using a "righthand precedence" rule. The rightmost opening construct (in this case {{, which is to the right of -{) is matched.

"Broken" constructs are also usually fine: Lonely -{, like {{ or [[, get inserted as literals in the output. Problems only occur when a broken construct is inside a template {{...}}, wikilink [[...]], or template argument {{{...}}}. In this case the "broken" construct causes the preprocessor to ignore the close token for the surrounding context (since it is still "trying to close" the broken construct).

The grep results below don't eliminate cases "outside a template" or "outside a wikilink", because to do so would require reimplementing the wikitext preprocessor. That causes some false positives, which are probably safest to fix up anyway. But some wikis have a large number of matches due to (for example) inclusion in an active editor's wiki signature. These are probably safe to ignore, since wiki signatures generally don't occur inside links or templates.

How do we tell how widespread this is?[edit]

We initially used dumps of all wiki projects with >1,000,000 articles generated on 2017-03-20, plus mediawikiwiki since we stumbled across some problematic markup there. In a second pass we used all wikis from the 2017-05-01 dump. The wikitech wiki was downloaded from https://dumps.wikimedia.org/other/wikitech/dumps/

Command used:

 for wiki in $(cat ~/DumpGrepper/wikis.txt ) ; do
   bzcat ~/DumpGrepper/$wiki-20170501-pages-articles.xml.bz2 | \
     node ./index.js --line -e -- '-[{]' '!' '<!--+[{]' '-[{][{]' | \
      tee results/$wiki-results.txt
  done
  zcat ~/DumpGrepper/labswiki-20170511.xml.gz | \
    node ./index.js --line -e -- '-[{]' '!' '<!--+[{]' '-[{][{]' | \
    tee results/labswiki-results.txt

Using a fork of dumpgrepper which allows line-by-line searches and the equivalent of a grep -v pipeline. The command above says, print all lines which contain -{ but not <!--{ (or <!---{) or -{{, since the normal preprocessor precedence yields the expected (ie, non-language converter) result for those constructs. We then use a munge.js script to post-process the results into wikitext. The relevant pages on mediawiki are checked out using git-mediawiki:

git clone -c remote.origin.categories='Parsoid' -c remote.origin.mwLogin=[your mediawiki username] -c remote.origin.mwPassword=[your mediawiki password] mediawiki::https://www.mediawiki.org/w

You can then have the munge.js script overwrite the appropriate .mw files in the checkout, git add, git commit, and git push to update the copy on-wiki.

Articles which need to be fixed, by project[edit]

As of 2017-06-20[edit]

Breakdown: /20170620. As of June 2017, pages containing the syntax at fault are broken (in some cases, you may need to refresh to notice it).

As of 2017-06-01[edit]

Breakdown: /20170601. As of June 2017, pages containing the syntax at fault are broken (in some cases, you may need to refresh to notice it).

As of 2017-05-01[edit]

See the breakdown by wiki (including all non-private non-closed wikis) at /20170501. As of June 2017, pages containing the syntax at fault are broken (in some cases, you may need to refresh to notice it).

As of 2017-03-20[edit]

This was the initial grep done on the 2017-03-20 wiki dump, on only the >1,000,000 article wikis. See /20170320.

Edit Rules[edit]

See: .../Edit logbook (the edit process log page) for more details about what to fix and how

Notes[edit]

These numbers aren't crazy large; we can probably fix them up by hand or with limited bot assistance.

Other options:

  • Make the preprocessor page-language aware, so -{…}- markup is only processed if the page language has variants. (But this complicates the preprocessor specification and rules for properly escaping wikitext.)
  • Add a backwards-compatibility rule specifically targeting -{…} (note no final dash) in template arguments. This is a narrow rule but (a) still results in a complicated preprocessor specification, and (b) might not actually address all the cases found above.