Parsing/Replacing Tidy

Tidy has had a large number of bugs filed against it and is also based on HTML4 semantics. Additionally, since the effect of running Tidy on MW Parser main pass output is poorly specified, the Parsing team is working on instead using the HTML 5 parsing algorithm to clean up bad HTML in wikitext, similar to the way browsers deal with tag soup. This is the approach taken by Parsoid, and using the same approach in MediaWiki will help provide consistent output in the two parsers.

This effort is tracked in Phabricator in T89331.

The prototype solution is a Java service called Html5Depurate.

Visual diff testing
We found that a common impact of Html5Depurate was to cause changes to the HTML which either don't affect the visual layout at all, or cause only minor vertical whitespace changes. In the belief that minor vertical whitespace changes would be tolerable, we wrote an image differ called UprightDiff which is able to identify vertical motion within an image, and to discount such motion for the purposes of automated testing.

We exported a subset of about 64000 articles from various Wikimedia projects, and rendered them with Tidy and with Html5Depurate, then used UprightDiff to analyse the result. Current results can be seen at http://mw-expt-tests.wmflabs.org/.

Test result notes
Here is a classification of diffs into different categories along with example titles, detailed description where useful, and proposed resolutions (to be filled out in all cases).

Self-closing tags like , etc.
Tidy strips self-closing tags like ,  but a HTML5 parser treats them as , , etc.

Trailing whitespace migration from inline tags like, , etc.
Tidy migrates trailing whitespace out of inline tags like, , etc. to outside the tag but this is broken Tidy behavior. A HTML5 parser will not do this.

Wikitext markup errors
Ex: Unclosed tables; Nested tables in fosterable position; ... instead of .. , etc. These are fixed up differently by Tidy and HTML5Depurate. There is nothing to do in HTML5depurate. The obvious fix here is to fix up the affected templates and pages.

P-tags wrapping newlines
tags wrapping newlines in the HTML5 depurate but stripped by Tidy cause minor whitespace margin diffs and seems to be a source of a lot of noise in visual diff output