Parsing/Notes/HTML5 Compliance

Sources of non-compliance
There are at least 3 sources of non-compliance in MediaWiki output. This affects the output of the PHP parser as well as Parsoid.
 * 1) Use of obsolete tags from HTML4 (ex: big, font)
 * 2) Use of obsolete tag attributes (ex: bgcolor)
 * 3) Violation of content model constraints (ex:   ..     or ..    )

Fixing non-compliance
If we want to fix 1, one option would be to use the linter extension to deprecate the use of these tags and fix the editors to not provide these tag buttons in their interfaces. For example, both WikiEditor and VisualEditor emit tags. An alternative option is to treat these special tags in wikitext and emit tags with classes / inline styles in the sanitizer (ex: See T154067).

If we want to fix 2, similar options exist. Use linter extension to deprecate use of these attributes. And, where necessary, rewrite these attributes to equivalent HTML5 attributes in the sanitizer. See T68413 for some related relevant discussion.

The situation with 3. is a bit more complicated. Tidy does a better job of compliance with (3) than Parsoid or any of the proposed Tidy replacements (HTML5Depurate or RemexHTML). But, Tidy is HTML4 compliant and does too much, so emulating that is not the solution. The non-compliance in Parsoid, etc. exists because the HTML5 tree builder has to be more lenient in what it expects and so the  operation does not guarantee that content model constraints will be enforced. HTML5's tree builder algorithm used in parsing input strings is deliberately designed this way because of the vast source of non-compliant documents out there. So, we cannot rely on the tree builder to fix up content model constraints. If we wanted to ensure compliant output, we would have to either rely on a post-processor to fix up the output (more feasible) or never generate non-compliant in the first place (less feasible). With Parsoid, this post-processing pass is further complicated by the fact that this has to fix up DSR offsets as well as any other private round tripping information (much less serious going forward as we remove more and more of it) so that selective serialization continues to function properly.

Separately, Parsoid has non-standard uses of tags that won't directly validate with a HTML5 validator. But, we should verify that Parsoid's uses are compliant with RDFa extensions to the html5 syntax.

Other HTML5 spec issues
Beyond this, we might want to consider other fixes to our output. For example, element ids generated in MediaWiki are HTML4 ids and have more constraints on them compared to HTML5. We could migrate to generating HTML5 ids instead but this is an involved task as well. T152540 has more details.

The sanitizer code in the core parser as well as Parsoid reflect HTML5 semantics only partially. At some point, they should be updated to adopt HTML5 semantics more fully (while accounting for html4 tags and attributes that are still in use as indicated earlier). T145002 is the task for that.

Related discussions elsewhere

 * enwiki VP (Policies) Rfc: RfC: Should deprecated/invalid/unsupported HTML tags be discouraged?
 * en VP (Technical) discussion: Time_to_knock_out_obsolete_HTML_tags
 * T68413 has some discussion about invalid attributes, dropping them, migrating them which is relevant.

Pros / cons of shooting for compliance
TO BE COMPLETED.

This section is to collect arguments for / against shooting for compliance.

Compliance is a binary state. However, given that, we can still discuss what parts of the HTML5 spec we want to comply with. For example, which subsets of 1 - 3 in the first section we want to shoot for. What are the pros / cons of it?

The WHATWG FAQ says "Validity (more often referred to as document conformance in the WHATWG) is a quality assurance tool to help authors avoid mistakes. We don't make things non-conforming (invalid) for the sake of it, we use conformance as a guide for developers to help them avoid bad practices or mistakes (like typos)."

As such, having conformance as a goal allows us to use standard output validation to check the quality of the MediaWiki code base.