I can add a reference to HTML5 tidy to the page.
But, here are some examples to show how Tidy (even html5) differs from what a HTML5 parser (like a browser) interprets the HTML. This is based on the ubuntu-packed tidy that i installed via apt. tidy -v says: HTML Tidy for HTML5 for Linux version 5.2.0.
Given <p>a<span>b</p><p>c</span>d</p>
, tidy-html5 generates <p>a<span>b</span></p>\n<p><span>c</span>d</p>
whereas a HTML5 tree builder generates <p>a<span>b</span></p><p>cd</p>
.
Given <ul><li> a </li><li> b </li></ul>
, it generates <ul>\n<li>a</li>\n<li>b</li>\n</ul>
.
So, like tidy-html4, it does 3 things: (a) it fixes up span tags incorrectly compared to what the html5 treebuilder is supposed to (b) it adds line breaks by pretty-printing html (c) it trims whitespace in tags. In HTML5, whitespace can be significant and can be styled by CSS and you cannot arbitrarily manipulate it without impacting rendering. In fact, precisely because of this tidy-html4 behavior, we are now having to introduce a tidy compatibility mode in wikitext.
We want code / library that is compliant with the html5 tree builder spec. This gives us options as our platform evolves. We can replace one html5 library with another. We can switch development languages, port from node.js to PHP or PHP to Rust or to C or Java and know that the input source will continue to parse identically. With a custom html parser like tidy, we are less able to do that. In fact, we had the choice of 3 or 4 different html5 parser libraries when we were looking for one.
In any case, there are other very strong reasons why we are going with a HTML5 parser library (while homegrown, it is spec-compliant and can be replaced with something else down the line, if we so choose). As we evolve wikitext, we want to be able to build a DOM from the raw html and manipulate it. Tidy doesn't give us that right now.
But, thanks for flagging that there is a gap in our documentation. We'll update the page suitably.