Parsing/Notes/Section Wrapping

From mediawiki.org
< Parsing‎ | Notes

Adding <section> wrappers to HTML output[edit]

MediaWiki currently has a notion of a section which is used for section editing that is commonly used by editors to edit a small fragment of a page and avoid edit conflicts. This identification is regexp-based. Separately, T114072 is a proposal to add <section> tags to MediaWiki output. This page is a discussion about what the considerations and constraints are around adding these section tags.

Flat or nested tags[edit]

Sections are semantically nested; that is, a heading which is === in wikitext or <h3> in html is logically "inside" a previous ==/<h2> section. It is logical to represent these as nested <section> tags, but for completeness briefly consider the alternative: in a "flat" representation the document consists of a flat list of <section> tags, all of which are siblings in the DOM tree. This is similar to how sections are internally represented in PHP core after the input wikitext is split by a heading regexp, and has a direct relationship to the section numbers used by the PHP APIs. A variant is used by the current MobileContentService output: the returned JSON lists sections as a flat array, but each element of the array contains a numeric field giving the depth it would have in a nested tree.

In the remainder of this document we will assume nested <section> tags.

Consistency requirements and why they matter[edit]

<section> tags in output will be used by clients to either selectively load / display content of a page or to provide editing interfaces for just that part of the page. Consequently, it is important that there be some consistency in the two differing notions of sections. Here are some reasons for why this consistency is desirable.

  1. You want reading and editing tools to agree upon what a section is -- at least if you want editors to click on a section edit link and have the same section open up in the editor, and if you save a section you want to be sure that mediawiki replaces the correct section of wikitext. MediaWiki does this on the basis of section numbers so a mismatched section number or different section bounds can lead to page corruption on save.
  2. You want all editing tools (visual or source) to have a fairly agreed upon notion of what a section is -- or at the very least have an agreed-upon understanding of where the inconsistencies are and why they arise.
  3. If one tool is going to get a list of sections from the API and another tool might process the HTML and extract sections based on examining <section> tags, you want some agreements on whether the two are identical and if not, how the two differ and why they differ.

Note that a particular section definition and numbering are currently baked into the HTML output of PHP mediawiki, in the form of section edit links. These section numbers are present in cached HTML, and so changing them is relatively expensive.

So, we need to identify a strategy based on product (editing, reading, tooling) needs:

  1. what kind of consistency requirements we are going to provide
  2. where the inconsistencies are going to be and why they arise
  3. provide a plan for what the normative / canonical / authoritative source of sections are and deprecate and remove the alternative version

Some consistency options[edit]

There are three possibilities here:

  1. If HTML output has a <section> tag, there is a corresponding MediaWiki section that can be edited.
  2. If there is a MediaWiki section that can be edited, the HTML output for that wikitext fragment has a <section> tag around it.
  3. Both of the above are true which guarantees a 1-1 mapping between the two notions.

It seems that guarantee 1 is the simplest to provide right now (but see the discussion of pseudo-sections below). Given unbalanced html tags and/or <div> tags around multiple partial sections, 2. is harder to guarantee since it is essentially a requirement that content of every MediaWiki-section has well-balanced output. But, this isn't true right now. For example, see this section-edit form in the wikitext editor. There is an unmatched </div> tag there in the middle of the section. Given this, it is not possible to add a <section> wrapper around the contents of this MediaWiki section. (Note that mediawiki's section handling refers to "DOM" but they mean the tree-like structure emitted by the preprocessor, not a proper HTML5 DOM.)

The expectation is that we want to gradually move MediaWiki output towards guarantee 3 (using our new tidy replacement to fix up unbalanced tags in sections), but we aren't there right now and don't want to block section wrapping on getting there first.

Plan of Record: Implementation proposal[edit]

Some of the difficulties discussed earlier are due to a conflation of what a section is in rendered output with whether it is editable. One way out of this is to separate these notions explicitly so that not all display sections need to be editable (but all editable sections should have a display component to them). This can be done by adding an attribute that indicates whether a section is editable via the section edit interface or not. This scenario shows up when we have <div> and other such tags wrapping multiple or partial sections. This also is the case when sections are generated by templates, for example, as in this revision. In both these scenarios, the individual sections cannot be edited directly.

For reading and display purposes, mobile clients might still want to show them in Table of Contents and collapse individual sections. This should be done with caution, however, since non-editable sections by definition do not correspond to a consistently-defined wikitext region. If non-editable pseudo-sections are allowed (see below), then they must be suppressed in the ToC. In all cases, collapsing non-editable sections is likely to expose the fact that the <section> does not match the wikitext section, resulting in more or less content folded than the user expects. It ought to be fine to use all sections for incremental loading.

We will focus on ensuring that any <section>-wrapping solution provides guarantee 1 only for editable sections. With this guarantee, the set of sections that are wrapped in <section> tags and marked as editable are going to be a strict subset of the set of sections that MediaWiki knows about. This is not necessarily a problem. This just means that VE section editing or mobile notion of sections will have support for a restricted set of MediaWiki sections on pages. So, on pages with block tags around MediaWiki sections, there will be degraded functionality for reading and editing in certain clients. This doesn't introduce broken or inconsistent support for sections.

In order to provide this guarantee for editable sections, we are going to identify sections where Parsoid wrapping matches MediaWiki's notion of what that section is. Using Parsoid's DOM-to-wikitext mapping, it is relatively straightforward to verify this. But, because Parsoid will potentially omit some <section> tags that do not map directly to PHP sections, there is not a 1-to-1 correspondence of section numbering. That is, the "section number 2" baked into PHP's HTML edit links won't correspond to the 2nd <section> tag in depth-first preorder emitted by Parsoid -- in fact, Parsoid might not emit any <section> tag at all for a particular PHP section. In order to provide compatibility, we will add a data-mw-section-id attribute to every <section> tag we emit identifying which PHP section number it corresponds to. This attribute implicitly identifies editable sections. So, any <section> without this attribute should be considered uneditable via the section edit interface. (In practice we use data-mw-section-id="-1" to indicate uneditable sections, so that Parsoid's sections are still distinguishable by other internal uses of <section>, such as by the ProofreadPage extension.)

By convention we always align the start of the <section> tree with the wikitext heading, and any violation of guarantee 2 causes the end of the <section> to diverge from the end of the wikitext section. Other alternatives are possible (you could grow both boundaries as needed; you could always align the end and let the start mismatch where necessary).

Note that PHP uses "section 0" to refer to the lead section. This is consistent with the product need for Mobile Content Service (MCS).

Examples[edit]

Wikitext HTML
a
=1=
b
==1.1==
c
===1.1.1===
d
===1.1.2===
e
=2=
f
<section data-mw-section-id="0"><p>a</p></section>
<section data-mw-section-id="1">
    <h1>1</h1>
    <p>b</p>
    <section data-mw-section-id="2">
        <h2>1.1</h2>
        <p>b</p>
        <section data-mw-section-id="3">
            <h3>1.1.1</h3>
            <p>d</p>
        </section>
        <section data-mw-section-id="4">
            <h3>1.1.2</h3>
            <p>e</p>
        </section>
    </section>
</section>
<section data-mw-section="5">
    <h1>2</h1>
    <p>f</p>
</section>
=1=
b
{{1x|1=
==1.1==
c
=2=
d
}}
==2.1==
e
<section data-mw-section-id="1" about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":["=1=\nb\n", --data-mw from <h2> below-- ,"\n==2.1==\ne\n"]}'>
    <h1 id="1">1</h1>
    <p>b</p>
    <section data-mw-section-id="-1">
        <h2 about="#mwt1" typeof="mw:Transclusion" id="1.1" data-mw='..'>1.1</h2>
        <p about="#mwt1">c</p>
    </section>
</section>
<section data-mw-section-id="-1" about="#mwt1">
    <h1 about="#mwt1" id="2">2</h1>
    <p about="#mwt1">d</p>
    <section data-mw-section-id="4">
        <h2 id="2.1">2.1</h2>
        <p>e</p>
    </section>
</section>

Notes:

  • The template output is slightly simplified for ease of understanding
  • Since the template is generating sections that are nested in different subtrees, with section-wrapping, the original template's content is no longer contiguous children of a common parent. To preserve template continuity semantics, in this scenario, we add an additional template wrapping layer at the section level. This guarantees that clients that analyze the page with section tags get well-formed DOM structures for template-affected content. But, if clients (like VisualEditor) decide to strip the section wrappers, the original template wrapping layer is exposed which guarantees the pre-section-wrapping template-continuity semantics.

Pseudo-sections[edit]

It would be desirable to ensure that the number of <h_> heading tags is exactly equal to the number of <section> tags in the output (plus one for the lead content in "section 0"). Issues with wikitext alignment can cause the production of uneditable sections, but can we avoid splitting <section> tags and creating more sections than there are headings? (We will call these "extra" sections "pseudo-sections".)

When multiple sections are generated by templates, we can use our existing techniques to grow the template-affected region as needed to avoid splitting sections. But explicit HTML markup which conflicts with section nesting causes other problems. Given the wikitext:

a
<div>
b
= 1 =
c
= 2 =
d
</div>
e
= 3 =

It seems to be impossible to create a <section> tag created for = 2 = which contains both d and e, and the lead <section> containing a and b seems like it would necessarily have to include the sections for = 1 = and = 2 = as well. An alternative would be to break the <section> after d and either orphan e or invent a new pseudo-section for e which does not correspond to any heading in the output. Similarly, we might want to break the lead section at the <div>, orphaning b or putting it in a pseudo-section. If the <section> is split, the pseudo-sections should not be used for folding and they should be suppressed from the Table of Contents. Note that in a nested section representation orphans can end up in a parent section. Using data-mw-section-id="-2" to indicate pseudo-sections, the above wikitext would generate:

<section data-mw-section-id="-1"><!-- lead section, uneditable -->
 <p>a</p>
</section>
<section data-mw-section-id="-2"><!-- pseudo section -->
 <div>
  <p>b</p>
  <section data-mw-section-id="1"><h1>1</h1><!-- editable section -->
    <p>c</p>
  </section>
  <section data-mw-section-id="-1"><h1>2</h1><!-- uneditable -->
   <p>d</p>
  </section>
 </div>
 <p>e</p>
</section>
<section data-mw-section-id="3"><h1>3</h1><!-- editable section -->
 ...
</section>

Note that uneditable sections only represent a prefix of a wikitext section, and pseudo-sections may contain suffixes of multiple wikitext sections, as in the example above. Only pseudo-sections need to be suppressed from a table of contents, since uneditable sections still contain some prefix of the wikitext section. If extracting or collapsing section contents, only editable sections will have boundaries consistent with the wikitext author's intent; collapsing an uneditable section will leave a suffix of the section visible, and collapsing a pseudo-section will hide parts of multiple sections.

Long-term, closing open HTML tags at section boundaries (and growing template-affected regions where necessary) should avoid the need for uneditable and pseudo-sections.

Support in the PHP Parser[edit]

We are not going to provide this section wrapping functionality in the PHP parser right now because we cannot provide the guarantee we identified above without DOM-based processing. With RemexHTML the current plan is that PHP sections will (eventually) be balanced by default (see below for description of impacts on styling), which will allow us to then provide consistency guarantee 3 (at least for non-template-affected sections).

Section wrapping algorithm currently used by MobileContentService[edit]

The section wrapping code in https://github.com/wikimedia/parsoid-dom-utils/blob/master/lib/sections.js was designed to satisfy the following requirements:

  • A page is a sequence of sections (which can include other nested sections). Content before the heading is wrapped in a section as well.
  • Sections are as fine-grained and as close to MediaWiki formatting and edit behavior as possible. Headings wrapped in <div>s or tables are picked up, and introduce a section wrapper around the overall wrapper element.

Both of these requirements are based on product needs. The first is used to implement section loading, and in many cases to identify the lead section of a page. The second requirement aims to conform to user expectations as closely as possible, while also respecting the structural integrity of the page.

This proposal conflicts with the consistency proposal above. Using these <section>s with PHP APIs can easily introduce corruption (wrong wikitext bounds replaced by new content) or confusion (wrong bounds displayed to reader). They are also not interoperable with cached PHP HTML output, and the DOM walk can introduce sections which include a large amount of content before and after a heading and thus don't match users' expectations for section bounds, especially when there is an unintentional unclosed tag in the wikitext.

Long term: Supporting use cases for adding <div> wrappers around multiple (partial) sections[edit]

So, why are editors adding div (or other block-tag) wrappers around partial sections or multiple sections?

Usually it appears to be in order to style a section, for example to highlight that portion of the page for action / attention. This is likely a use-case on user pages and other non-article pages like wikipedia or talk namespaces, for example where there might be a call to action or a notice or something that spans multiple sections.

Solution 1: Stronger consistency guarantee for the Article namespace only[edit]

Given this, one way to get to stronger consistency guarantees (2 and 3) is by restricting section wrapping to only the article namespace. In that namespace, we could provide the stronger guarantee by breaking some badly nested sections by independently DOM-balancing output for every section on its own. This will forcibly close unclosed <div> (and other "block") tags and discard stray closing tags. If we want to go down this route, we need to get a sense of usage of bad nesting in the article namespace and amenability of editors to fixing their pages that have this behaviour. The Linter extension can help precisely identify these set of pages. Other namespaces will have only a single <section> tag around the entire page.

Note however that this breakage will only be in Parsoid output (and hence clients that use Parsoid output - mobile, Visual Editor). But, since we are progressing towards adopting Parsoid output as the de facto output for MediaWiki, this is not a concern per se.

This would also break section editing and navigation for non-article pages.

Solution 2: Make it possible to style sections "properly"[edit]

For example (strawman alert!), one option would be to support "class" and "style" attributes on sections in the same way we do for table cells:

== class="alert" style="bgcolor:red" | Section Title ==

which would generate:

<section class="alert" style="bgcolor:red">
    <h2>Section Title</h2>
    ...
</section>

The id attribute could also be moved to the <section> tag. You would need to lint for all section titles currently containing a | character, and <nowiki> the |. Another option would let the user write <section> tags manually in wikitext, which would then be detected and suppress normal <section> generation. The above example would then be written as:

<section class="alert" style="bgcolor:red">
== Section Title ==
...
</section>
== Next section ==

This is perhaps more "compatible" (no existing headings need to be <nowiki>'ed, but rather encourages the use (advertent or inadvertent) of explicit <section>s which don't line up with headings:

<section class="a">
Text here!  What section is this in?
== Section title ==
More text.
</section>
More text, what section is this in?
<section class="b">
== One Section ==
== Another section ==
</section>

It's probably best to avoid candy machine interfaces of this sort which make it easy to generate bad results.

Plan of record: Provide other solutions for the use cases that require multi-section styling[edit]

The template styles system has just been rolled out on wikipedia. The long-term plan is to deprecate and eventually remove support for inline styles in wikitext, replacing it with template styles. Styled sections would thus be generated from a template, either inside the template body or as an argument (for example a heredoc argument). Section balancing will then be treated in a similar manner as template balancing, although we'd need to balance both sections and templates for similar reasons. The DOM tree will include non-section elements between <section>s; that is, it won't be the case that all <section> tags will be direct children of either a <section> or <body> tag. This may interact poorly with the "list of sections" assumption made by MobileContentService.

NOTE: consider a template which contains two <section> tags. If we need to emit a wrapper around the template contents, our choices are currently <div> or <span>. We may wish to consider emitting a <section> as a wrapper as well, in order to allow the "parent of a <section> is always either <section> or <body>" property of MCS to hold.