Parsing/Notes/Wikitext 2.0/Strawman Spec

Here is a somewhat tentative strawman spec for how we might implement the ideas in Parsing/Notes/Wikitext 2.0 and are based on prior ideas in User:SSastry (WMF)/Notes/Wikitext.

Note to discerning readers: This is not as much a spec as it is an outline of an implementation. But, a spec can be extracted from this. But, yes, the "Spec" part of the title is probably misleading.

Goals

Improve ability to reason about wikitext
Reduce edge cases by bounding range of wikitext errors
Improve editability in VE
Improve performance
Introduce new processing semantics with minimal disruptions to existing markup (preserve existing syntax, preserve existing rendering as far as possible)

Parsing outline

Broad ideas:

Separate wikitext-native parsing from HTML / DOM manipulation code.
Remove the notion of preprocessing from the core parser. The existing preprocessing code is moved into a string-preprocessing-templates extension to support the current corpus of templates.
HTML tags are not tokenized at all while processing wikitext constructs. They would only be tokenized just before the HTML5 tree builder stage if we are feeding HTML5 tokens to the HTML5 tree builder (vs. passing it a string to parse).
The top level page is parsed with placeholders for transclusions and extensions which are processed independently and inserted into their slots. This will force some restructuring in some cases to ensure that the output doesn't violate some HTML5 semantics.

Steps:

Top-level page is tokenized to identify wikitext-native constructs.
- headings: /^= .. =$/
- paragraph separators: /\n\n+/
- tables: /^{| ... \n ... \n|}/ (alternatively, table start, end, rows, cells, headings, captions could be tokenized as is done in Parsoid right now, but conceptually the same thing)
- list markers: /^[*#:;]/
- indent-pres: /^ /
- links: [[..]] and [..]
- quotes: '' and '''
- transclusions: {{..}}
- non-html extension tags: <*include*>, <nowiki>, <ref>, and other installed extensions

Process the wikitext tokens to generate an AST with 2 classes of nodes.
- those that are parsed to an independent DOM fragment.
- those that are not parsed to an independent DOM fragment on their own.
Paragraphs, indent-pres, links, quotes are not processed to DOM. The wikitext markup is replaced with the equivalent HTML markup in the AST.
- Explanation for those who are wondering why paragraphs are excluded here: A paragraph's content model in HTML5 is phrasing content. So, paragraphs, <div>s, tables, etc. cannot be nested inside a paragraph. But, the reason we are excluding them here is because we aren't tokenizing HTML tags and paragraphs are implicit in wikitext markup. So, for example, consider wikitext a\n<div>foo\n\nbar</div>\nb. The expected HTML output from this is <p>a</p><div><p>foo</p><p>bar</p></div><p>b</p> or<p>a</p><div>foo\n\nbar</div><p>b</p> and not <p>a</p><div>foo</div><p>bar\nb</p> which is what we would get if we processed the wikitext paragraphs to DOM. An alternative would be to tokenize HTML tags and identify well-nested <div> and other flow content, but then we would have to deal with bad nesting, etc. which effectively requires us to embed the HTML5 parsing rules into the tokenization and AST building. But, I am not interested in that. The intent here is to keep the wikitext-native parsing separate from HTML / DOM manipulation code.
- Possible enhancement 1: Consider link content to be a nested document and balance it independently (Parsoid does this right now already).
- Possible enhancement 2: Once quotes are processed to generate I/B nestings, process their content as
Headings, top-level sections, table rows, tables, list items, transclusions, extension tags, are parsed to DOM (recursively bottom-up) and the AST node is replaced with a DOM-fragment (which would have nested DOM fragments in turn).
- Special cases:
  1. For table nodes, fostered content is easily identifiable without heroic analyses and can be handled appropriately according to whatever semantics we want.
    - error markup to get editors to fixup.
    - silently discard fostered content.
    - silently display fostered content but prevent direct visual editing and flag the error in VE.
    - error markup during preview.
    - ... other possibilities? ...
  2. Templates can specify one of multiple types for their output. These types are enforced in the composition step below.
    - Plain text
    - Attribute pairs
    - DOM forest
    - more specialized DOM types (ex: inline, block, table, etc. as in T114445)
This transforms the AST with symbolic wikitext nodes into an AST tree where every node is a DOM fragment.
Rely on composition rules to transform / collapse these DOM fragments into a DOM tree. This composition spec needs to be developed. These notes about composition constraints are a beginning.
- Simplest "do nothing" rule (which is wikitext 1.0 by default): All the dom fragments are flattened to a html string and the HTML is reparsed into the final DOM. This relies on HTML5 tree builder algorithm to fix things up.
- Some possible composition rules:
  - If a template is found in an attribute position (need to figure out representation / mechanics of this), our options are:
    - we silently / loudly discard uses of non-attribute-output typed uses
    - like with the fostering scenario, other options are: error markup always, error markup during preview, discard all non-attribute string in the output (current wikitext 1.0 behavior)
  - All content nested within links will have links stripped
  - If a p-tag is nested in another p-tag, transform DOM in a well-defined manner (easy to work out the details, skipping here for now).
  - Others based on perusing the HTML5 tree building spec and common-sense semantic notions of nesting.
Run Add template wrapping markup + other DOM passes (will be a subset of Parsoid's DOM passes as necessary).
Sanitize (last step before serialization)
XML serialize.

Technical Blockers

Use of parser functions at the top-level (vs. in templates / extensions)
- not sure sure how to tackle this yet
Templates that heavily rely on preprocesing:
- Preprocessing-based templates can be processed with a specialized extension that yields fully processed markup which is then parsed using the above algorithm.
- In the long run, templates might be processed recursively using the same algorithm, but for now, for templates that rely on preprocessing, this is not feasible.
- So, this is only a blocker in that the extension needs to be written.
Multi-template authoring of DOM structures (tables, 2-column layout, etc.) where each individual template wouldn't generate a well-formed DOM structure.
- One way of handling it is using heredoc style syntax that enables these to be written using an atomic syntactical construct. Ex: T114432
Deprecate and remove all parser hooks that are specific to internals of existing PHP parser
- Doesn't look too bad according to Parser_Hooks_Stats
Come up with an extension spec that extension authors can use to implement the functionality currently used by various extensions
- This is a matter of working out the details
Actually reviewing the sanity of this proposal, working out the kinks, prototyping this, and evaluating this.

Social Blockers

I am just doing a first cut laundry list here. These are all genuine and valid concerns. We need to think through these and understand real impacts. My gut sense is that the changes are not going to be drastic. A lot of wikitext will continue to behave as before, but there might be some subtle changes (for the good, I think). But, template authors are probably the ones that might have the biggest concerns here.

Potential concerns about breakage about pages
Potential concerns about templates that need updating
Potential concerns about changes to editing workflow
Potential concerns about changes to well-understood wikitext semantics.
How does this affect bots?

Advantages

Parsing/Notes/Wikitext 2.0#Implications of this model elaborates on some possible benefits. Repeating them here in a concise form for reasons of completeness of this document.

Improved reasoning abilities: clear semantics and improved reasoning about wikitext markup.
- Top-level document is decoupled (in good ways) from transclusions and extensions. So, '' some wikitext here '' will always parse as an italic even if there were transclusions in some wikitext here.
- Uniform semantics for templates, extensions, gadgets, whatever.
Bounding effects of markup errors:
- markup errors are bound to and fixed within the smallest possible scopes.
Potential for improved edit tooling:
- Structural semantics of the input and output document is an enabler for improved editing tools.
- You can enable micro-editing at the sub-section level without edit conflicts as long as the nodes being edited are different. You can imagine locking / warning / timeout / real time editing features built on top of this.
Potential for improved parsing performance:
- Templates can be parsed independent of the top-level document and concurrently.
- As long as templates don't use features that depend on volatile output (time of day, etc.), their output can be cached.
- Both of these have tremendous performance implications for template edits. If you fix a typo in a heavily used template, you parse the template output once, cache it, and (with some caveats), in most pages that use the template, you can do a drop-in replacement of the html string in place of the old string and be done.
- Other incremental parsing abilities on edits. If a typo is fixed on a page, there is absolutely no reason to reparse the entire page. It is conceivable to come up with solutions / algorithms to fix the typo in the HTML output.