Parser 2011/Stage 2: Informal grammar

...

Loose structure assembly
Our primary document structure nests along lines of template expansions/parser functions/tag hooks/links, which have pretty firmly defined start/end/content.

However a lot of bits are built on loose structure; the first formal parsing steps will give us tokens/nodes for pieces of looser structures like tables and HTML elements, which need to be assembled into higher-level structures.

Separate nesting levels
While not popular with purists, this is a fact of life in existing wikitext. :) It is not at all uncommon to find structures like this:


 * some page
 * start template
 * row template
 * || blah
 * end template
 * |}
 * end template
 * |}

If producing HTML web output, we need to produce output that looks more like:


 * some page
 * &lt;table>
 * &lt;tr>
 * &lt;td> blah
 * &lt;/table>

Similarly, sequences of HTML-like or table tags may be missing clearly ordered close tags etc and may need to be disabled or have implicit closes added. Even if all the pieces are there, their nesting may not match the brace structure:


 * some page
 * bla bla
 * bla bla

Here, we have to "pull through" those #if function blocks.

Sequences of adjacent list item lines similarly may need to be reassembled into a properly-nesting structure of lists and list items when producing HTML or similar output -- they too may come from a combination of different template/function nesting levels.

We may however be able to limit some of these sorts of structures for editing purposes...?

Fix-ups in different stages of parsing
Fix-ups can be distributed across the different stages of parser processing:
 * Grammar
 * Parser actions associated with each grammar rule
 * Parse tree walkers, which try to modify the parse tree or even serialize and re-parse some sections, for example after a template expansion.

Fix-ups in the grammar
Adding some fix-ups directly in the grammar can simplify further processing by reducing the number and severity of fix-ups needed in later stages. Examples for a grammar fix-up would be implied (optional) end-tags for inline structures on block boundaries, or the creation of end-tag tokens for stray or improperly nested end tags.

Parser actions
Parser actions are run while processing a matching grammar rule, and are generally responsible for building the returned tree structure. In some parsers, actions can additionally abort a match which would ask the parser to try other alternative parsings from other rules. The structure transformation role overlaps with parse tree walkers, but can be more efficient as the scope is more targeted.

An example of a slightly more elaborate parser action would be the current list processing code. List items with differing combinations of bullet prefixes are wrapped in properly nested list structures using a stack. Expansion of the template in this example is however harder to integrate with a parser action:


 * some page
 * * First bullet
 * start template
 * # Second bullet
 * * Third bullet

Parse tree visitors
Parse tree visitors walk a completely built parse tree, and expand or generally reorder the tree. A possible task would be template expansions. Modifications with effects outside a cleanly nested subtree (e.g. textual template expansion) can require a re-serialization and re-parsing of parts of the tree.

The unbalanced-template example can be processed by re-serializing and then re-parsing the subtree containing the templates, as each template in isolation is not complete:


 * some page
 * start template
 * row template
 * || blah
 * end template
 * |}
 * end template
 * |}

Ideally, the scope of this re-serialization would be limited to a subtree, but this cannot be guaranteed with current textual templates. A new restriction on templates to open or close only block-level constructs within a containing block structure could provide a reduced scope, and would still allow the table start/end template construct listed above.

This idea would however still fail in a case similar to this:


 * some page
 * start template
 * &lt;nowiki&gt;
 * Bla
 * second template
 * &lt;/nowiki&gt;
 * Other content
 * &lt;/div&gt;
 * Other content
 * &lt;/div&gt;

The earlier parse did not see the first nowiki and thus matched the inner div tag. The start template is not in the same scope as Other content. Some hypothetical reliable detection of such a change of parsing rules and thus nesting could widen the scope of re-serialization to fix these cases.

Re-parsing of parts of a page could interfere with the flat addressing scheme of operational transform techniques outlined in Visual Editor design. The new parse might add or remove DOM elements, which are currently all counted as 1 in the linear addressing scheme. Basing the addressing on offsets in serialized wikitext could avoid this problem as those would remain invariant to the parse of this wikitext.

Summary
Unbalanced textual templates are clearly hard to support in parsing and visual editing. Consequently, the consensus in clearly favors properly nested templates. Satisfactory replacements for common uses of non-nested templates however still need to be developed, and the transition needs time in any case. In the meantime, support for 'legacy' non-nested templates is still needed.

The consensus at the parser summit also favors template expansions as late in the pipeline as possible ( line 97). This is no problem for balanced templates, but makes the support of traditional unbalanced templates very challenging.

Traditional pipeline
The traditional pipeline used in the current MediaWiki parser and most (all?) compilers for languages with textual preprocessors first performs a shallow parse followed by preprocessor expansion. In most cases, this is then followed by a full parse on the resulting textual representation or an expanded and patched-up token stream. The same approach is also used by the Sweble wiki parser.

This order of things avoids the thorny issues associated with late (textual) template expansion. The main disadvantage of this approach might be the difficulty of tracking the origin of parts of the resulting source for editing purposes. IDEs for languages like C++ face the same issues, and have found solutions to this problem:.

UI challenges with origin interleaving at the character level
IDEs operate on the source level, where UI interaction can be more readily associated with bits of the visible source or function definitions. Unfortunately, a WYSIWYG editor operates on rendered constructs, which might be constructed from a mix of multiple origins. In the worst case, the interleaving of sources dives down to the character level, which makes the creation of a sensible UI rather challenging. Reverting to an IDE-like editing experience for heavily interleaved sections might be the only option available in some cases.

Most actually used templates interleave on the text or section level, with origin typically aligning with structure (e.g. infoboxes). In these cases, origin tracking could be sufficient to construct a workable UI for template-generated content.

Related work on fix-ups

 * The current MediaWiki parser implicitly closes inline tags (see e.g. Parser::doBlockLevels)
 * HTML tidy, which is commonly used to clean up MediaWiki output, implements further heuristics in src/parser.c (see ParseInline and ParseBlock)
 * The Sweble parser implements fix-ups both in grammar productions and AST visitors. See in particular this pre-print for an overview.
 * Late template expansion somewhat overlaps with incremental compilation used in some IDEs. TODO: research!