Parsing/Notes/Section Wrapping

TODO:
 * 1) Better terminology for the two different section notions: MediaWiki notion used by the section-edit interface, and the notion that will be represented by   tags in output
 * 2) Link up to other pages (wiki, phab, elsewhere) to make this discussion self-contained
 * 3) Any other language cleanup / tightening to clarify the problem, options, and solution space

Adding wrappers to HTML output
MediaWiki currently has a notion of a section which is used for section editing that is commonly used by editors to edit a small fragment of a page and avoid edit conflicts. This identification is regexp-based. Separately, T114072 is a proposal to add  tags to MediaWiki output. This page is a discussion about what the considerations and constraints are around adding these section tags.

Consistency requirements and why they matter
tags in output will be used by clients to either selectively load / display content of a page or to provide editing interfaces for just that part of the page. Consequently, it is important that there be some consistency in the two differing notions of sections. Here are some reasons for why this consistency is desirable. So, we need to identify a strategy based on product (editing, reading, tooling) needs:
 * 1) You want reading and editing tools to agree upon what a section is -- at least if you want editors to click on a section edit link and have the same section open up in the editor.
 * 2) You want all editing tools (visual or source) to have a fairly agreed upon notion of what a section is -- or at the very least have an agreed-upon understanding of where the inconsistencies are and why they arise.
 * 3) If one tool is going to get a list of sections from the API and another tool might process the HTML and extract sections based on examining   tags, you want some agreements on whether the two are identical and if not, how the two differ and why they differ.
 * 1) what kind of consistency requirements we are going to provide
 * 2) where the inconsistencies are going to be and why they arise
 * 3) provide a plan for what the normative / canonical / authoritative source of sections are and deprecate and remove the alternative version

Some consistency options:
There are three possibilities here: It seems that guarantee 1 is the simplest to provide right now. Given unbalanced html tags and/or tags around multiple partial sections, 2. is harder to guarantee since it is essentially a requirement that content of every MediaWiki-section has well-balanced output. But, this isn't true right now. For example, see this section-edit form in the wikitext editor. There is an unmatched  tag there in the middle of the section. Given this, it is not possible to add a  wrapper around the contents of this MediaWiki section.
 * 1) If HTML output has a   tag, there is a corresponding MediaWiki section that can be edited.
 * 2) If there is a MediaWiki section that can be edited, the HTML output for that wikitext fragment has a   tag around it.
 * 3) Both of the above are true which guarantees a 1-1 mapping between the two notions.

The expectation is that we want to gradually move MediaWiki output towards this guarantee, but we aren't there right now and don't want to block section wrapping on getting there first.

First step: A section wrapping solution that provides consistency guarantee 1
So, we could instead focus on ensuring that any -wrapping solution provides guarantee 1. With this guarantee, the set of sections that are wrapped in  tags are going to be a strict subset of the set of sections that MediaWiki knows about. This is not necessarily a problem. This just means that VE section editing or mobile notion of sections will have support for a restricted set of MediaWiki sections on pages. So, on pages with block tags around MediaWiki sections, there will be degraded functionality for reading and editing in certain clients. This doesn't introduce broken or inconsistent support for sections.

So, it seems we can provide a section wrapping solution in Parsoid for now that provides guarantee 1 above. This can be easily computed on a DOM and there is currently a library that Gabriel developed previously and which is being used by MCS for its apps. We just need to audit it to ensure that its output provides guarantee 1 above.

However, we are not going to provide this section wrapping functionality in the PHP parser right now because we cannot provide the guarantee we identified above without DOM-based processing. With RemexHTML, it is more of a possibility, but that is not something we are considering right now.

Alternative proposal (based on existing solution used by MobileContentService)
The section wrapping code in https://github.com/wikimedia/parsoid-dom-utils/blob/master/lib/sections.js was designed to satisfy the following requirements: Both of these requirements are based on product needs. The first is used to implement section loading, and in many cases to identify the lead section of a page. The second requirement aims to conform to user expectations as closely as possible, while also respecting the structural integrity of the page.
 * A page is a sequence of sections. Content before the heading is wrapped in a section as well.
 * Sections are as fine-grained and as close to MediaWiki formatting and edit behavior as possible. Headings wrapped in s or tables are picked up, and introduce a section wrapper around the overall wrapper element.

This proposal conflicts with the consistency proposal above. This needs to be resolved.

Long term: Supporting use cases for adding wrappers around multiple (partial) sections
So, why are editors adding div (or other block-tag) wrappers around partial sections or multiple sections?

I need more input on this, but it looks like this is possibly to highlight that portion of the page for action / attention. This is likely an use-case on user pages and other non-article pages like wikipedia or talk namespaces, for example where there might be a call to action or a notice or something that spans multiple sections.

Solution 1: Stronger consistency guarantee for the Article namespace only
Given this, one way to get to stronger consistency guarantees (2 and 3) is by restricting section wrapping to only the article namespace. In that namespace, we could provide the stronger guarantee by breaking some badly nested sections by independently DOM-balancing output for every section on its own. This will forcibly close unclosed div (and other "block") tags and discard stray closing tags. If we want to go down this route, we need to get a sense of usage of bad nesting in the article namespace and amenability of editors to fixing their pages that have this behaviour. The Linter extension can help precisely identify these set of pages.

Note however that this breakage will only be in Parsoid output (and hence clients that use Parsoid output - mobile, Visual Editor). But, since we are progressing towards adopting Parsoid output as the de facto output for MediaWiki, this is not a concern per se.

Solution 2: Provide other solutions for the use cases that require multi-section styling
This is a bit more longer term strategy compared to solution 1, but doing so let us provide uniform section editing / wrapping / reading semantics across the board for all namespaces and all clients. Some nascent ideas for this include relying on the templates and the upcoming template styles solution, or introduce new wikitext syntax for styling individual sections (see below).

Solution 3: Make it possible to style sections "properly"
For example (strawman alert!), one option would be to support "class" and "style" attributes on sections in the same way we do for table cells: which would generate: The  attribute could also be moved to the   tag. You would need to lint for all section titles currently containing a  character, and   the.

Another option would let the user write  tags manually in wikitext, which would then be detected and suppress normal   generation. The above example would then be written as: This is perhaps more "compatible" (no existing headings need to be 'ed, but rather encourages the use (advertent or inadvertent) of explicit  s which don't line up with headings: It's probably best to avoid candy machine interfaces of this sort which make it easy to generate bad results.