Parsing/Replacing Tidy

Tidy has had a large number of bugs filed against it and is also based on HTML4 semantics. Additionally, since the effect of running Tidy on MW Parser main pass output is poorly specified, the Parsing team is working on instead using the HTML 5 parsing algorithm to clean up bad HTML in wikitext, similar to the way browsers deal with tag soup. This is the approach taken by Parsoid, and using the same approach in MediaWiki will help provide consistent output in the two parsers.

This effort is tracked in Phabricator in T89331.

The prototype solution is a Java service called Html5Depurate.

Visual diff testing
We found that a common impact of Html5Depurate was to cause changes to the HTML which either don't affect the visual layout at all, or cause only minor vertical whitespace changes. In the belief that minor vertical whitespace changes would be tolerable, we wrote an image differ called UprightDiff which is able to identify vertical motion within an image, and to discount such motion for the purposes of automated testing.

We exported a subset of about 64000 articles from various Wikimedia projects, and rendered them with Tidy and with Html5Depurate, then used UprightDiff to analyse the result. Current results can be seen at http://mw-expt-tests.wmflabs.org/.