Parsing/Replacing Tidy

Please see the FAQ for more focused discussion of why we are replacing html4-tidy with RemexHtml.

Tidy has had a large number of bugs filed against it and the current binary deployed on the Wikimedia cluster is based on HTML4 semantics. Additionally, since the effect of running Tidy on MW Parser main pass output is poorly specified, the Parsing team is working on instead using the HTML 5 parsing algorithm to clean up bad HTML in wikitext, similar to the way browsers deal with tag soup. This is the approach taken by Parsoid, and using the same approach in MediaWiki will help provide consistent output in the two parsers.

This effort is tracked in Phabricator in T89331. This change did not happen during 2016 to allow communities several months to check their pages for errors.

After testing three different Tidy-replacement implementations (one in Java, and two in PHP), we have settled on a version based on RemexHTML, a PHP-only HTML5 parsing library.

See also the simplified instructions for editors.

What this means for editors
Some templates currently rely on behaviour specific to Tidy which will not be retained. These templates will have to be updated. In initial testing (see below), we found issues such as:


 * Templates which generate plainly broken output (such as mismatched start and end tags), which RemexHTML cleans up in a different way to Tidy. These templates should be fixed.
 * Some templates generate unnecessary line breaks, and these line breaks cause MediaWiki's paragraph formatting (doBlockLevels) to output broken HTML. Tidy then cleans up the broken HTML differently from RemexHTML. Editors may have to work around this MediaWiki bug.
 * Tidy rearranges the whitespace in the HTML, mostly in a misguided attempt at pretty-printing. It assumes that these changes have no effect on the final layout, however, using the CSS white-space property can make them visible. This behaviour of Tidy's is usually harmless, but some templates rely on this behaviour, especially navboxes which use "white-space: nowrap" to prevent breaking in the middle of a list item. Some templates will need to be fixed.

To identify some of the pages that need fixing, we are working to add new categories to the Linter extension for scenarios 2. and 3. in the Things to fix section below. In addition, to assist editors in migrating wikitext to the new rules, we deployed an extension called ParserMigration. If you enable ParserMigration in your preferences (under "Editing > General options > Enable parser migration tool"), a link called "Edit with migration tool" is added to the toolbox of all articles, which can show the current (Tidy) and expected (RemexHTML) output side-by-side, and can preview article text changes in the same side-by-side view. Given this, you can make the required changes to the wikitext and compare how the page renders with Tidy and with RemexHML via the side-by-side preview.

We currently have no fixed schedule for turning off Tidy and replace it with the RemexHTML based version. We will do this when we are satisfied that the impact on readers will be minor and tolerable. But, we would also not like to drag this on indefinitely. It would be ideal if the template fixes are prioritized by editors.

Things to fix
Based on visual diff testing, here are 3 main categories of markup that need fixing:
 * 1) Self-closing tags:     are no longer treated as empty tags in a HTML5 parser. Both the PHP parser as well as Parsoid right now have backward compatibility code to handle these. However, it is good to fix them up. Looks like the cleanup effort is going well so far. This dashboard tracks progress.
 * 2) Broken wikitext markup: Tidy and a HTML5 parser might fix up mismatched tags or missing end tags differently in some scenarios.
 * 3) Pages/templates relying on Tidy's whitespace mangling behavior: There are 3 kinds of whitespace mangling that Tidy does that a HTML5 parser does not do. This is because you can write CSS rules that is affected by whitespace. Those are exactly the pages that will be impacted by this.
 * 4) * Tidy migrates trailing whitespace out of inline tags like,  , etc. to outside the tag
 * 5) * Tidy adds  chars between a closing tag and an opening tag. For example, newlines between list items or table cells.
 * 6) * Tidy strips whitespace between a tag and its contents. Ex: whitespace inside a list item or a table cell.

It is possible to provide tool support for scenarios 1 and 2 above.

For 1. we have been adding maintenance categories to pages using self-closing tags and editors have been using that and other tools to fix them up.

For 2, there are the various Project CheckWiki tools. Besides that, the parsing team will deploy the Linter extension to help identify and fix some of these scenarios.

However, we don't know how to provide tooling to detect / list pages that rely on Tidy's whitespace mangling behavior. See below for more information about this.

Effect of whitespace changes
The net effect of Tidy's whitespace mangling is that wikitext / HTML like this will be transformed to  In most cases, this does not make a difference to rendering. However, there are pages and templates that will be affected by this. If a page sets the CSS  property or displays a list with the CSS   property, those pages will be affected because all the Tidy replacements will not mangle whitespace in this way.

See T155634 (affects a template), T74416#2384571 (affects a template that has been fixed). There are some other templates and pages that have been identified below. We expect navbox templates and similar templates might need fixing to not rely on whitespace mangling for correct rendering.

Detailed information about what to fix

 * The FAQ page may contain more up to date information.

Here is a classification of diffs into different categories along with example titles, detailed description where useful, and proposed resolutions.

Self-closing tags like, etc.
Tidy strips self-closing tags like,   but a HTML5 parser treats them as  ,  , etc. This dashboard tracks progress of editors fixing these.

Sample searches to find these on your wiki: For a long list of many more searches that have yielded results on the English Wikipedia, expand the box below. Note that regular expression searches do not always return complete results. See Phabricator bug for details.
 * To find  paste this into the search box:
 * To find  paste this into the search box:
 * To find  paste this into the search box:
 * To find  paste this into the search box:     (Note that this will find only "empty" divs; structures such as   are also wrong.)
 * To find  paste this into the search box:     (Note that this will find only "empty" spans; structures such as   are also wrong.)

[Note: some of these searches find tags that are not self-closed tags, but are broken in some other way and should be fixed.]

The following tools can help detect and fix such issues:
 * Such errors are reported by Check Wikipedia as part of error #2 (example for cswiki) in the main namespace only : ask Check Wikipedia project on how to configure this error for a given wiki
 * Such errors can be detected and fixed by WPCleaner : see how to configure WPCleaner for a given wiki.

Wikitext markup errors
Ex: Unclosed tables; Nested tables in fosterable position;  instead of , etc. These are fixed up differently by Tidy and HTML5Depurate. There is nothing to do in HTML5depurate. The obvious fix here is to fix up the affected templates and pages.

Trailing whitespace migration from inline tags like,, etc.
Tidy migrates trailing whitespace out of inline tags like,  , etc. to outside the tag but this is broken Tidy behavior. A HTML5 parser will not do this.

list wrapping diffs because of inter-element white-space (possibly a concern in other rendering scenarios?)
In the Tidy version, there seem to be  chars between   and the next. In the HTML5depurate version, in some cases, they are not present. This seems to cause rendering differences in wrapping of lists.

Visual diff testing
We found that a common impact of RemexHTML was to cause changes to the HTML which either don't affect the visual layout at all, or cause only minor vertical whitespace changes. In the belief that minor vertical whitespace changes would be tolerable, we wrote an image differ called UprightDiff which is able to identify vertical motion within an image, and to discount such motion for the purposes of automated testing.

We exported a subset of about 64000 articles from various Wikimedia projects, and rendered them with Tidy and with RemexHTML, then used UprightDiff to analyse the result. Current results can be seen at http://mw-expt-tests.wmflabs.org/.

See Uprightdiff numeric scoring for more details about how we assign a test scorech tested page.

P-tags wrapping newlines
tags wrapping newlines in the HTML5 depurate but stripped by Tidy cause minor whitespace margin diffs and seems to be a source of a lot of noise in visual diff output.

NOTE for editors: HTML5depurate is taking care of this automatically right now. Eventually we might remove this compatibility pass, but this won't be an issue in the initial Tidy removal rollout.