Parsing/Notes/Section Wrapping

TODO:
 * 1) Better terminology for the two different section notions: MediaWiki notion used by the section-edit interface, and the notion that will be represented by   tags in output
 * 2) Link up to other pages (wiki, phab, elsewhere) to make this discussion self-contained
 * 3) Any other language cleanup / tightening to clarify the problem, options, and solution space

Adding wrappers to HTML output
MediaWiki currently has a notion of a section which is used for section editing that is commonly used by editors to edit a small fragment of a page and avoid edit conflicts. This identification is regexp-based. Separately, T114072 is a proposal to add  tags to MediaWiki output. This page is a discussion about what the considerations and constraints are around adding these section tags.

Flat or nested tags
Sections are semantically nested; that is, a heading which is  in wikitext or   in html is logically "inside" a previous  /  section. It is logical to represent these as nested  tags, but for completeness briefly consider the alternative: in a "flat" representation the document consists of a flat list of   tags, all of which are siblings in the DOM tree. This is similar to how sections are internally represented in PHP core after the input wikitext is split by a heading regexp, and has a direct relationship to the section numbers used by the PHP APIs. A variant is used by the current MobileContentService output: the returned JSON lists sections as a flat array, but each element of the array contains a numeric field giving the depth it would have in a nested tree.

In the remainder of this document we will assume nested  tags.

Consistency requirements and why they matter
tags in output will be used by clients to either selectively load / display content of a page or to provide editing interfaces for just that part of the page. Consequently, it is important that there be some consistency in the two differing notions of sections. Here are some reasons for why this consistency is desirable. Note that a particular section definition and numbering are currently baked into the HTML output of PHP mediawiki, in the form of section edit links. These section numbers are present in cached HTML, and so changing them is relatively expensive.
 * 1) You want reading and editing tools to agree upon what a section is -- at least if you want editors to click on a section edit link and have the same section open up in the editor, and if you save a section you want to be sure that mediawiki replaces the correct section of wikitext. MediaWiki does this on the basis of section numbers so a mismatched section number or different section bounds can lead to page corruption on save.
 * 2) You want all editing tools (visual or source) to have a fairly agreed upon notion of what a section is -- or at the very least have an agreed-upon understanding of where the inconsistencies are and why they arise.
 * 3) If one tool is going to get a list of sections from the API and another tool might process the HTML and extract sections based on examining   tags, you want some agreements on whether the two are identical and if not, how the two differ and why they differ.

So, we need to identify a strategy based on product (editing, reading, tooling) needs:
 * 1) what kind of consistency requirements we are going to provide
 * 2) where the inconsistencies are going to be and why they arise
 * 3) provide a plan for what the normative / canonical / authoritative source of sections are and deprecate and remove the alternative version

Some consistency options
There are three possibilities here: It seems that guarantee 1 is the simplest to provide right now (but see the discussion of pseudo-sections below). Given unbalanced html tags and/or tags around multiple partial sections, 2. is harder to guarantee since it is essentially a requirement that content of every MediaWiki-section has well-balanced output. But, this isn't true right now. For example, see this section-edit form in the wikitext editor. There is an unmatched  tag there in the middle of the section. Given this, it is not possible to add a  wrapper around the contents of this MediaWiki section. (Note that mediawiki's section handling refers to "DOM" but they mean the tree-like structure emitted by the preprocessor, not a proper HTML5 DOM.)
 * 1) If HTML output has a   tag, there is a corresponding MediaWiki section that can be edited.
 * 2) If there is a MediaWiki section that can be edited, the HTML output for that wikitext fragment has a   tag around it.
 * 3) Both of the above are true which guarantees a 1-1 mapping between the two notions.

The expectation is that we want to gradually move MediaWiki output towards guarantee 3 (using our new tidy replacement to fix up unbalanced tags in sections), but we aren't there right now and don't want to block section wrapping on getting there first.

Plan of Record: Implementation proposal
Some of the difficulties discussed earlier are due to a conflation of what a section is in rendered output with whether it is editable. One way out of this is to separate these notions explicitly so that not all display sections need to be editable (but all editable sections should have a display component to them). This can be done by adding an attribute that indicates whether a section is editable via the section edit interface or not. This scenario shows up when we have  and other such tags wrapping multiple or partial sections. This also is the case when sections are generated by templates, for example, as in this revision. In both these scenarios, the individual sections cannot be edited directly.

For reading and display purposes, mobile clients might still want to show them in Table of Contents and collapse individual sections. This should be done with caution, however, since non-editable sections by definition do not correspond to a consistently-defined wikitext region. If non-editable pseudo-sections are allowed (see below), then they must be suppressed in the ToC. In all cases, collapsing non-editable sections is likely to expose the fact that the  does not match the wikitext section, resulting in more or less content folded than the user expects. It ought to be fine to use all sections for incremental loading.

We will focus on ensuring that any -wrapping solution provides guarantee 1 only for editable sections. With this guarantee, the set of sections that are wrapped in  tags and marked as editable are going to be a strict subset of the set of sections that MediaWiki knows about. This is not necessarily a problem. This just means that VE section editing or mobile notion of sections will have support for a restricted set of MediaWiki sections on pages. So, on pages with block tags around MediaWiki sections, there will be degraded functionality for reading and editing in certain clients. This doesn't introduce broken or inconsistent support for sections.

In order to provide this guarantee for editable sections, we are going to identify sections where Parsoid wrapping matches MediaWiki's notion of what that section is. Using Parsoid's DOM-to-wikitext mapping, it is relatively straightforward to verify this. But, because Parsoid will potentially omit some  tags that do not map directly to PHP sections, there is not a 1-to-1 correspondence of section numbering. That is, the "section number 2" baked into PHP's HTML edit links won't correspond to the 2nd  tag in depth-first preorder emitted by Parsoid -- in fact, Parsoid might not emit any   tag at all for a particular PHP section. In order to provide compatibility, we will add a  attribute to every   tag we emit identifying which PHP section number it corresponds to. This attribute implicitly identifies editable sections. So, any  without this attribute should be considered uneditable via the section edit interface. (In practice we use  to indicate uneditable sections, so that Parsoid's sections are still distinguishable by other internal uses of, such as by the  extension.)

By convention we always align the start of the  tree with the wikitext heading, and any violation of guarantee 2 causes the end of the   to diverge from the end of the wikitext section. Other alternatives are possible (you could grow both boundaries as needed; you could always align the end and let the start mismatch where necessary).

Note that PHP uses "section 0" to refer to the lead section. This is consistent with the product need for MCS.

Support in the PHP Parser
We are not going to provide this section wrapping functionality in the PHP parser right now because we cannot provide the guarantee we identified above without DOM-based processing. With RemexHTML the current plan is that PHP sections will (eventually) be balanced by default (see below for description of impacts on styling), which will allow us to then provide consistency guarantee 3 (at least for non-template-affected sections).

Section wrapping algorithm currently used by MobileContentService
The section wrapping code in https://github.com/wikimedia/parsoid-dom-utils/blob/master/lib/sections.js was designed to satisfy the following requirements: Both of these requirements are based on product needs. The first is used to implement section loading, and in many cases to identify the lead section of a page. The second requirement aims to conform to user expectations as closely as possible, while also respecting the structural integrity of the page.
 * A page is a sequence of sections (which can include other nested sections). Content before the heading is wrapped in a section as well.
 * Sections are as fine-grained and as close to MediaWiki formatting and edit behavior as possible. Headings wrapped in s or tables are picked up, and introduce a section wrapper around the overall wrapper element.

This proposal conflicts with the consistency proposal above. Using these s with PHP APIs can easily introduce corruption (wrong wikitext bounds replaced by new content) or confusion (wrong bounds displayed to reader). They are also not interoperable with cached PHP HTML output, and the DOM walk can introduce sections which include a large amount of content before and after a heading and thus don't match users' expectations for section bounds, especially when there is an unintentional unclosed tag in the wikitext.

Pseudo-sections
It would be desirable to ensure that the number of  heading tags is exactly equal to the number of   tags in the output (plus one for the lead content in "section 0"). Issues with wikitext alignment can cause the production of uneditable sections, but can we avoid splitting  tags and creating more sections than there are headings? (We will call these "extra" sections "pseudo-sections".)

When multiple sections are generated by templates, we can use our existing techniques to grow the template-affected region as needed to avoid splitting sections. But explicit HTML markup which conflicts with section nesting causes other problems. Given the wikitext: It seems to be impossible to create a  tag created for   which contains both   and , and the lead   containing   and   seems like it would necessarily have to include the sections for   and   as well. An alternative would be to break the  after   and either orphan   or invent a new pseudo-section for   which does not correspond to any heading in the output. Similarly, we might want to break the lead section at the, orphaning   or putting it in a pseudo-section. If the  is split, the pseudo-sections should not be used for folding and they should be suppressed from the Table of Contents. Note that in a nested section representation orphans can end up in a parent section. Using  to indicate pseudo-sections, the above wikitext would generate:

Note that uneditable sections only represent a prefix of a wikitext section, and pseudo-sections may contain suffixes of multiple wikitext sections, as in the example above. Only pseudo-sections need to be suppressed from a table of contents, since uneditable sections still contain some prefix of the wikitext section. If extracting or collapsing section contents, only editable sections will have boundaries consistent with the wikitext author's intent; collapsing an uneditable section will leave a suffix of the section visible, and collapsing a pseudo-section will hide parts of multiple sections.

Long-term, closing open HTML tags at section boundaries (and growing template-affected regions where necessary) should avoid the need for uneditable and pseudo-sections.

Long term: Supporting use cases for adding wrappers around multiple (partial) sections
So, why are editors adding div (or other block-tag) wrappers around partial sections or multiple sections?

Usually it appears to be in order to style a section, for example to highlight that portion of the page for action / attention. This is likely a use-case on user pages and other non-article pages like wikipedia or talk namespaces, for example where there might be a call to action or a notice or something that spans multiple sections.

Solution 1: Stronger consistency guarantee for the Article namespace only
Given this, one way to get to stronger consistency guarantees (2 and 3) is by restricting section wrapping to only the article namespace. In that namespace, we could provide the stronger guarantee by breaking some badly nested sections by independently DOM-balancing output for every section on its own. This will forcibly close unclosed  (and other "block") tags and discard stray closing tags. If we want to go down this route, we need to get a sense of usage of bad nesting in the article namespace and amenability of editors to fixing their pages that have this behaviour. The Linter extension can help precisely identify these set of pages. Other namespaces will have only a single  tag around the entire page.

Note however that this breakage will only be in Parsoid output (and hence clients that use Parsoid output - mobile, Visual Editor). But, since we are progressing towards adopting Parsoid output as the de facto output for MediaWiki, this is not a concern per se.

This would also break section editing and navigation for non-article pages.

Solution 2: Make it possible to style sections "properly"
For example (strawman alert!), one option would be to support "class" and "style" attributes on sections in the same way we do for table cells: which would generate: The  attribute could also be moved to the   tag. You would need to lint for all section titles currently containing a  character, and   the.

Another option would let the user write  tags manually in wikitext, which would then be detected and suppress normal   generation. The above example would then be written as: This is perhaps more "compatible" (no existing headings need to be 'ed, but rather encourages the use (advertent or inadvertent) of explicit  s which don't line up with headings: It's probably best to avoid candy machine interfaces of this sort which make it easy to generate bad results.

Plan of record: Provide other solutions for the use cases that require multi-section styling
The template styles system has just been rolled out on wikipedia. The long-term plan is to deprecate and eventually remove support for inline styles in wikitext, replacing it with template styles. Styled sections would thus be generated from a template, either inside the template body or as an argument (for example a heredoc argument). Section balancing will then be treated in a similar manner as template balancing, although we'd need to balance both sections and templates for similar reasons. The DOM tree will include non-section elements between s; that is, it won't be the case that all   tags will be direct children of either a   or   tag. This may interact poorly with the "list of sections" assumption made by MobileContentService.

NOTE: consider a template which contains two tags. If we need to emit a wrapper around the template contents, our choices are currently or. We may wish to consider emitting a as a wrapper as well, in order to allow the "parent of a is always either or " property of MCS to hold.