Parsing/Replacing Tidy

Tidy has had a large number of bugs filed against it and is also based on HTML4 semantics. Additionally, since the effect of running Tidy on MW Parser main pass output is poorly specified, the Parsing team is working on instead using the HTML 5 parsing algorithm to clean up bad HTML in wikitext, similar to the way browsers deal with tag soup. This is the approach taken by Parsoid, and using the same approach in MediaWiki will help provide consistent output in the two parsers.

This effort is tracked in Phabricator in T89331. This change will not happen during 2016 to allow communities several months to check their pages for errors.

We have two prototype implementations that we are testing: a PHP implementation called Balancer, and a Java service called Html5Depurate.

What this means for editors
Some templates currently rely on behaviour specific to Tidy which will not be retained. These templates will have to be updated. In initial testing (see below), we found issues such as:


 * Templates which generate plainly broken output (such as mismatched start and end tags), which Html5Depurate/Balancer cleans up in a different way to Tidy. These templates should be fixed.
 * Some templates generate unnecessary line breaks, and these line breaks cause MediaWiki's paragraph formatting (doBlockLevels) to output broken HTML. Tidy then cleans up the broken HTML differently from Html5Depurate/Balancer. Editors may have to work around this MediaWiki bug.
 * Tidy rearranges the whitespace in the HTML, mostly in a misguided attempt at pretty-printing. It assumes that these changes have no effect on the final layout, however, using the CSS white-space property can make them visible. This behaviour of Tidy's is usually harmless, but some templates rely on this behaviour, especially navboxes which use "white-space: nowrap" to prevent breaking in the middle of a list item. Some templates will need to be fixed.

To assist editors in migrating wikitext to the new rules, we will be deploying an extension called ParserMigration before the end of 2016. If you enable ParserMigration in your preferences, a link is added to the toolbox of all articles, which can show the current (Tidy) and expected (Balancer) output side-by-side, and can preview article text changes in the same side-by-side view.

We currently have no fixed schedule for migration of the default article HTML to Html5Depurate/Balancer. We will do this when we are satisfied that the impact on readers will be minor and tolerable. But, we would also not like to drag this on indefinitely. So, once the ParserMigration extension is available, it would be ideal if the template fixes are prioritized by editors.

Visual diff testing
We found that a common impact of Html5Depurate was to cause changes to the HTML which either don't affect the visual layout at all, or cause only minor vertical whitespace changes. In the belief that minor vertical whitespace changes would be tolerable, we wrote an image differ called UprightDiff which is able to identify vertical motion within an image, and to discount such motion for the purposes of automated testing.

We exported a subset of about 64000 articles from various Wikimedia projects, and rendered them with Tidy and with Html5Depurate, then used UprightDiff to analyse the result. Current results can be seen at http://mw-expt-tests.wmflabs.org/.

See Uprightdiff numeric scoring for more details about how we assign a test score to each tested page.

Test result notes
Here is a classification of diffs into different categories along with example titles, detailed description where useful, and proposed resolutions (to be filled out in all cases).

Self-closing tags like, etc.
Tidy strips self-closing tags like,   but a HTML5 parser treats them as  ,  , etc.

Sample searches to find these on your wiki: For a long list of many more searches that have yielded results on the English Wikipedia, expand the box below. Note that regular expression searches do not always return reliable results. See Phabricator bug for details.
 * To find  paste this into the search box:
 * To find  paste this into the search box:
 * To find  paste this into the search box:
 * To find  paste this into the search box:
 * To find  paste this into the search box:     (Note that this will find only "empty" divs.)
 * To find  paste this into the search box:     (Note that this will find only "empty" spans.)



Trailing whitespace migration from inline tags like,, etc.
Tidy migrates trailing whitespace out of inline tags like,  , etc. to outside the tag but this is broken Tidy behavior. A HTML5 parser will not do this.

Wikitext markup errors
Ex: Unclosed tables; Nested tables in fosterable position;  instead of , etc. These are fixed up differently by Tidy and HTML5Depurate. There is nothing to do in HTML5depurate. The obvious fix here is to fix up the affected templates and pages.

P-tags wrapping newlines
tags wrapping newlines in the HTML5 depurate but stripped by Tidy cause minor whitespace margin diffs and seems to be a source of a lot of noise in visual diff output.

NOTE for editors: HTML5depurate is taking care of this automatically right now. Eventually we might remove this compatibility pass, but this won't be an issue in the initial Tidy removal rollout.

list wrapping diffs because of inter-element white-space (possibly a concern in other rendering scenarios?)
In the Tidy version, there seem to be  chars between   and the next. In the HTML5depurate version, in some cases, they are not present. This seems to cause rendering differences in wrapping of lists.