Parsing/Replacing Tidy/FAQ

What is Tidy?
Tidy is a library currently used by MediaWiki to fix some HTML errors found in wiki pages. Badly formed markup is common on wiki pages when editors use HTML tags in templates and on the page itself. (Example: unclosed HTML tags such as a without a are common). In some cases, MediaWiki can generate erroneous HTML by itself.

Tidy fixes these markup errors, but also does other “cleanup” on its own that is not required for correctness. For example, it removes empty elements and adds whitespace between HTML tags, which can sometimes change rendering. Since Tidy is based on HTML4 semantics and the web has moved to HTML5, it also makes some incorrect changes to HTML to 'fix' things that used to not work; for example, Tidy will unexpectedly move a bullet list out of a table caption even though that's allowed.

Why are you replacing it? And with what?
Tidy's technology is from the 1990s, when browsers weren’t standardized. Tidy’s behavior is loosely based on HTML4 semantics but matches no modern browser. After spending years without active maintenance, Tidy has now been revived as “tidy-html5” with very different behavior. The older Tidy is no longer being packaged. As noted earlier, Tidy does HTML cleanup unrelated to fixing errors. Together, all these issues have led to lots of bugs filed against it on Phabricator, and a replacement has been asked for since at least 2013.

HTML5 is the standard today, and the parsing algorithm for HTML5 is clearly specified, which has led to compatible implementations across browsers and other libraries. This algorithm also clearly specifies how broken markup should be fixed up. In this new technological landscape, Tidy should really be replaced with a HTML5 parser that fixes up the broken markup and generates valid well-formed HTML markup in the standard way.

However, Wikimedia wikis have a huge corpus of pages whose markup relies on Tidy’s fixups. Doing an immediate and straight-forward replacement of Tidy with a third-party HTML5 based tool is not feasible, since a HTML5-based tool would repair some markup differently and this can break how pages look.

So, we are replacing Tidy with our own tool based on the HTML5 specification, but which also adds a few Tidy-compatibility workarounds to minimize the impact of replacing Tidy. We have two separate implementations that we are testing: a PHP implementation called Balancer, and a Java service called Html5Depurate. Both can also be used in the future to enable new core MediaWiki features, such as more-robust section editing, balanced template support, and more efficient page updates after templates have been edited.

For those who are wondering, note that using tidy-html5 would not preclude us from having to deal with fixing markup errors since some of the required cleanup is coming from a change from HTML4 to HTML5 semantics. There are other change management reasons for preferring a in-house tool, including the ability to enable other features as mentioned above.

If you're interested in other technical details, please see https://phabricator.wikimedia.org/T89331 or Replacing Tidy.

Which tests have you performed so far?
To identify the impact of replacing Tidy with a HTML5 based tool, we have utilized a testing strategy (using a tool called “VisualDiff”) that compares the pixel-by-pixel output image of MediaWiki with Tidy enabled, with the pixel-by-pixel output image of its replacement. Early on, we found that a common difference was minor vertical whitespace changes. In the belief that these would either not be noticeable or would be tolerable, we wrote a tool called "UprightDiff" which is able to identify vertical motion within an image and to discount such motion for the purposes of automated testing. This also let us assign a numeric score to differences and readily identify the most egregious differences.

We exported a subset of about 64,000 articles (some from the recent changes stream, and rest selected randomly) from various Wikimedia wikis (40 wikis from Wikipedia, Wikisource, Wiktionary, and Wikivoyage), and rendered them with Tidy and with Html5Depurate, then used "UprightDiff" to analyse the result. This takes a lot of cpu cycles, memory, and disk space, and it takes 2 days for one round of testing to complete. This limits the size of the testing corpus but we believe 64K is a sizeable sample to figure out the kind of fixes necessary.

To minimize the differences and reduce the impact of fixes that would be needed from editors, we added some additional Tidy-compatibility fixups. After all these fixes were in place and we repeated our tests, we found that 93% of pages had no changes in rendering. And, 98% of pages had less than 1% pixel differences.

Based on these tests, we identified several classes of markup errors that will render differently between the two. For one class of markup errors (self-closing tags that aren’t valid in HTML5), we added a maintenance category that editors have already been using to fix up templates and pages. But, the other classes of markup errors were not easy to detect automatically at this time and editors' assistance is necessary to identify and fix them up.

What will happen? When will these changes happen?
As noted earlier, we cannot yet do a drop-in replacement of Tidy with a HTML5 based tool. We have added a maintenance category for one class of markup errors which will help editors fix them up. To help editors identify and fix up other markup errors, we also built a ParserMigration extension that helps them compare output in production and fix markup errors.

By the end of December 2016, we plan to deploy the ParserMigration extension to the production cluster. We hope to assist editors as required to help with fixing up pages for actually replacing Tidy in 2017. We will go ahead with the switch when we are satisfied that the impact on editors and readers will be minor and tolerable. But, we would also not like to drag this on indefinitely. So, once the tool is available, it would be ideal if the template fixes are prioritized by editors, assuming this hasn't already happened by then.

Separately, the Tidy-compatibility fixups (mentioned in the previous section) are meant to be in place till we replace Tidy. After that, we will start replacing these fixups gradually while relying on similar testing and tooling support.

What will editors need to do?

 * We anticipate that certain templates will need to be updated. Test results from visual diffing has some examples of templates that need to be fixed. For one class of markup errors, we expect the Linter extension (that surfaces information from Parsoid) will greatly help in identifying templates that might be source of markup errors. We don’t yet know how to automatically determine all affected templates, but expect the list of templates and suggestions will get more complete as we learn about problems that were not caught before. Community Liaisons will reach out to template experts, regulars at technical village pumps, etc. to make sure they become aware of the necessary changes and of resources that may help them.
 * As mentioned above, to assist editors in migrating wikitext to the new rules, we will be deploying an extension called "ParserMigration" before the end of 2016. If you enable "ParserMigration" in your preferences, a link will be added to the toolbox of all articles, which can show the current (Tidy) and expected (Balancer) output side-by-side, and can preview article text changes in the same side-by-side view, so you can report any problems in the articles you'll test.
 * We expect to deploy the Linter extension by end of the year as well; it will surface information from Parsoid that identifies broken wikitext usage so that other tools may intervene and fix them.
 * We are thinking about other easy, sustainable ways in which we may support "gnomes" and all the Wikimedians willing to help with this effort.

Other FAQs
Q: What happens to  and similar tags if they are used in self-closed form?

A: As noted in T134423, the only valid self-closed HTML tags are: area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr. All other self-closing HTML tags should be fixed (and are already being fixed by editors at this time) to prevent unexpected rendering effects (e.g.  will be treated as  and result in more text being bolded than intended). Non-HTML self-closing tags (like, ) are not affected by this change. is a special case because while it is a HTML tag, MediaWiki treats it like an extension tag and hence remains unaffected.

Q: What were the results of tests on languages other than English, or on sister projects?

A: There is nothing in what Tidy, Html5Depurate/Balancer do that is specific to Wikipedia or English. This project is primarily about a change from HTML4 to HTML5 semantics and getting rid of some Tidy cleanups of HTML. These changes affects all projects and languages equally except if some projects and languages tend to have more markup errors or use more self-closed tags than others.

Q: What other changes are editors likely to see, after this replacement?

A: The effect of this replacement is primarily going to affect readers, as they may notice that the page doesn’t look right (ex: more text in bold or small than seems necessary). However, if anything, this might lead to the rendering seen in VisualEditor to match the rendering seen outside it much more than before since Parsoid’s output has been HTML5-compliant since the beginning, and we are now moving the read output to HTML5. We do not expect any impact on VisualEditor edits, but we will promptly address any bugs reported with respect to dirty diffs. In addition, we do not plan to add any error messages or warnings displayed on pages if the markup errors are not fixed.

Q: How does the replacement relate to other projects you are working on?

A: By enabling the move to HTML5 semantics, this is one of the steps evolving markup in our corpus to keep up with web standard. We also expect to leverage this tool to support well-balanced template output. Separately, but relatedly, this will also make the output of the PHP parser (used for reads) and the output of Parsoid (used for edits in VE, Content Translation) more consistent since Parsoid already uses HTML5 semantics. One of our goals is to make the two outputs fully consistent with each other and use one parser for both reads and edits.

--The Parsing Team

Volunteers available to support this effort
''Community Liaisons invite interested Wikimedians to please add their name in the sections below and support their community engagement efforts. Thank you. As with similar past initiatives, signing up is optional.''
 * 1) I am available to test with the ParserMigration tool.
 * 2) your signature here
 * 3) I am available to fix templates.
 * 4) I am available to study and discuss fixes to templates.
 * 5) I am available to spread the word among my community.
 * 1) I am available to study and discuss fixes to templates.
 * 2) I am available to spread the word among my community.