Parser 2011/Stage 2: Informal grammar

...

Loose structure assembly
Our primary document structure nests along lines of template expansions/parser functions/tag hooks/links, which have pretty firmly defined start/end/content.

However a lot of bits are built on loose structure; the first formal parsing steps will give us tokens/nodes for pieces of looser structures like tables and HTML elements, which need to be assembled into higher-level structures.

Separate nesting levels
While not popular with purists, this is a fact of life in existing wikitext. :) It is not at all uncommon to find structures like this:


 * some page
 * start template
 * row template
 * || blah
 * end template
 * |}
 * end template
 * |}

If producing HTML web output, we need to produce output that looks more like:


 * some page
 * &lt;table>
 * &lt;tr>
 * &lt;td> blah
 * &lt;/table>

Similarly, sequences of HTML-like or table tags may be missing clearly ordered close tags etc and may need to be disabled or have implicit closes added. Even if all the pieces are there, their nesting may not match the brace structure:


 * some page
 * bla bla
 * bla bla

Here, we have to "pull through" those #if function blocks.

Sequences of adjacent list item lines similarly may need to be reassembled into a properly-nesting structure of lists and list items when producing HTML or similar output -- they too may come from a combination of different template/function nesting levels.

We may however be able to limit some of these sorts of structures for editing purposes...?

Fix-ups in different stages of parsing
Fix-ups can be distributed across the different stages of parser processing:
 * Grammar
 * Parser actions associated with each grammar rule
 * Parse tree walkers, which try to modify the parse tree or even serialize and re-parse some sections, for example after a template expansion.

Fix-ups in the grammar
Adding some fix-ups directly in the grammar can simplify further processing by reducing the number and severity of fix-ups needed in later stages. Examples for a grammar fix-up would be implied (optional) end-tags for inline structures on block boundaries, or the creation of end-tag tokens for stray or improperly nested end tags.

Parser actions
Parser actions are run while processing a matching grammar rule, and are generally responsible for building the returned tree structure. In some parsers, actions can additionally abort a match which would ask the parser to try other alternative parsings from other rules. The structure transformation role overlaps with parse tree walkers, but can be more efficient as the scope is more targeted.

An example of a slightly more elaborate parser action would be the current list processing code. List items with differing combinations of bullet prefixes are wrapped in properly nested list structures using a stack. Expansion of the template in this example is however harder to integrate with a parser action:


 * some page
 * * First bullet
 * start template
 * # Second bullet
 * * Third bullet

Parse tree walkers
Parse tree walkers visit nodes on a completely built parse tree, and expand or generally reorder the tree. A possible task would be template expansions. Some operations might require a re-serialization and re-parsing of parts of the tree. The unbalanced-template example is hard to parse without a re-serialization of the subtree containing the templates, as each template in isolation is not complete:


 * some page
 * start template
 * row template
 * || blah
 * end template
 * |}
 * end template
 * |}

Templates containing a nowiki or pre tag will change the complete parsing rules for the wrapped content, so a re-parse is hard to avoid as an earlier structure might now be obsolete.

Ideally, the scope of this re-serialization would be limited to a subtree. A new restriction on templates to open or close only block-level constructs within a containing block structure could provide a reduced scope, and would still allow the table start/end template construct listed above.

This idea would however fail in a case similar to this:


 * some page
 * start template
 * &lt;nowiki&gt;
 * Bla
 * second template
 * &lt;/nowiki&gt;
 * Other content
 * &lt;/div&gt;
 * Other content
 * &lt;/div&gt;

The earlier parse did not see the first nowiki and thus matched the inner div tag. The start template is not in the same scope as Other content. Reliable detection of such a change of parsing rules and thus nesting could be used to widen the scope of re-serialization to fix these cases.

Non-nested templates are hard to support in parsing and visual editing. Consequently, the consensus in clearly favors late template expansion and properly nested templates. Satisfactory replacements for common uses of non-nested templates however still need to be developed, and the transition needs time in any case. In the meantime, support for 'legacy' non-nested templates is still needed.

Another aspect that might collide with uses of re-serialization could be the flat addressing scheme of operational transform techniques outlined in Visual Editor design. Re-parsing of sections can reassign tokens and thus change the element-dependent linear addresses described there. Basing the addressing on offsets in serialized wikitext could avoid this problem as those should remain invariant to the parse of this wikitext.