Topic on Talk:Parsing/Replacing Tidy

HTML5 tidy is a thing

5 comments • 16:44, 8 March 2018 6 years ago

5

Summary by SSastry (WMF)

But not the thing we are interested in.

Zearin (talkcontribs)

I just found this page about replacing tidy. My immediate response was “…Seriously?!”

Once, it was true that tidy had been unmaintained during the earlier spread of HTML5 on the web.

That hasn’t been true for a long time! Tidy has been updated with HTML5 semantics for years. Since late 2015, in fact.

Don’t believe me? See for yourself!

Tidy’s official homepage: http://www.html-tidy.org
Tidy’s official Github Repository: https://github.com/htacg/tidy-html5

Tidy is once again a living, actively maintained software project. Let’s use it!

If, at any point, it doesn’t meet Wikipedia’s needs, open an Github Issue. The community maintaining it is alive and well.

14:27, 8 March 2018 6 years ago

SSastry (WMF) (talkcontribs)

We are / have been aware of html5 tidy. See https://phabricator.wikimedia.org/T89331#1681258 and https://lists.wikimedia.org/pipermail/wikitech-l/2015-August/082845.html

At this point, we have a viable solution that does what we need. This is based on the HTML5 spec and we then have added some html4-tidy compatibility passes to reduce breakage of pages. Eventually, we'll get rid of these compatibility passes by fixing pages instead.

In any case, the real problem with replacing HTML4-Tidy is fixing all pages that will break with a migration to HTML5 semantics. That process is ongoing now and we'll effect a completely switch from Tidy to Remex by mid 2018.

15:33, 8 March 2018 6 years ago

Zearin (talkcontribs)

If you are aware of html5 tidy, that awareness should not be confined to discussion on phabricator’s page. This article’s opening statements are outdated by over 3 years, and the rest of the article continues as if that outdated state of the software were still true today.

The appearance from the outside is that either Wikimedia users are grossly out of touch and unaware of widespread use of HTML5 tidy, or that they don’t care about misleading the reader as long as it is in pursuit of their custom software solution.

At the very least, acknowledge the fact that tidy is **not** outdated software. And ideally, explain why your custom software solution, whose premise for adoption was that tidy was outdated, remains the best choice for the Wikimedia ecosystem.

15:42, 8 March 2018 6 years ago

SSastry (WMF) (talkcontribs)

I can add a reference to HTML5 tidy to the page.

But, here are some examples to show how Tidy (even html5) differs from what a HTML5 parser (like a browser) interprets the HTML. This is based on the ubuntu-packed tidy that i installed via apt. tidy -v says: HTML Tidy for HTML5 for Linux version 5.2.0.

Given abcd, tidy-html5 generates ab\ncd whereas a HTML5 tree builder generates abcd.

Given <ul><li> a </li><li> b </li></ul>, it generates <ul>\n<li>a</li>\n<li>b</li>\n</ul>.

So, like tidy-html4, it does 3 things: (a) it fixes up span tags incorrectly compared to what the html5 treebuilder is supposed to (b) it adds line breaks by pretty-printing html (c) it trims whitespace in tags. In HTML5, whitespace can be significant and can be styled by CSS and you cannot arbitrarily manipulate it without impacting rendering. In fact, precisely because of this tidy-html4 behavior, we are now having to introduce a tidy compatibility mode in wikitext.

We want code / library that is compliant with the html5 tree builder spec. This gives us options as our platform evolves. We can replace one html5 library with another. We can switch development languages, port from node.js to PHP or PHP to Rust or to C or Java and know that the input source will continue to parse identically. With a custom html parser like tidy, we are less able to do that. In fact, we had the choice of 3 or 4 different html5 parser libraries when we were looking for one.

In any case, there are other very strong reasons why we are going with a HTML5 parser library (while homegrown, it is spec-compliant and can be replaced with something else down the line, if we so choose). As we evolve wikitext, we want to be able to build a DOM from the raw html and manipulate it. Tidy doesn't give us that right now.

But, thanks for flagging that there is a gap in our documentation. We'll update the page suitably.

Edited 16:29, 8 March 2018 6 years ago

SSastry (WMF) (talkcontribs)

Also, a colleague pointed out that we do reference tidy-html5 in our FAQ and discuss why we are using RemexHtml instead. See https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#Why_are_you_replacing_it?_And_with_what?

16:44, 8 March 2018 6 years ago