Parser 2011/Stage 2: Informal grammar

The current Parser in combination with the HTML tidy post-processor go to great lengths to produce valid and reasonable output even with improperly nested structures in the wiki source. Fix-up steps are deeply embedded in the parser (e.g., implicitly closed inline tags in doBlockLevels) and the HTML tidy code. Generally, tightly bound structures listed in Wikitext parser/Stage 1: Formal grammar are less likely to require fix-ups than larger-scale structures like tables and HTML elements.

Besides problematic wiki source, a different kind of fix-up is needed if textual templates are expanded late in the parsing process. The purely textual and thus often unbalanced structures in templates can change the correct parse of large parts of the document. Without a complete re-parse of affected parts of the document or the full document, elaborate transformations on an AST can be required to restore a corrected parse. A restriction on templates to contain only balanced structures can avoid most of this problem, but is a large departure from the current textual model and unlikely to happen over night. In the meantime, a general parser needs to handle unbalanced templates as well as balanced ones.

The current MediaWiki parser avoids these problems by using the traditional strategy of an early textual template expansion using a preprocessor before performing multiple parsing passes on the resulting source.

The relative merits of early vs. late template expansion are not very clear at this stage. In the following, some relevant issues for both late template expansion fix-ups and regular best-effort fix-ups are explored.

Unbalanced nesting levels in templates
While not popular with purists, this is a fact of life in existing wikitext. :) It is not at all uncommon to find structures like this:


 * some page
 * start template
 * row template
 * || blah
 * end template
 * |}
 * end template
 * |}

If producing HTML web output, we need to produce output that looks more like:


 * some page
 * &lt;table>
 * &lt;tr>
 * &lt;td> blah
 * &lt;/table>

Nesting may not match the brace structure

 * some page
 * bla bla
 * bla bla

Here, we have to "pull through" those #if function blocks.

Sequences of adjacent list item lines similarly may need to be reassembled into a properly-nesting structure of lists and list items when producing HTML or similar output -- they too may come from a combination of different template/function nesting levels.

Fix-ups in different stages of parsing
Fix-ups can be distributed across the different stages of parser processing:
 * Grammar
 * Parser actions associated with each grammar rule
 * Parse tree walkers, which try to modify the parse tree or even serialize and re-parse some sections, for example after a template expansion.

Fix-ups in the grammar
Adding some fix-ups directly in the grammar can simplify further processing by reducing the number and severity of fix-ups needed in later stages. Examples for a grammar fix-up are implied (optional) end-tags for inline structures on block boundaries, or the creation of end-tag tokens for stray or improperly nested end tags.

Parser actions
Parser actions are run while processing a matching grammar rule, and are generally responsible for building the returned tree structure. In some parsers, actions can additionally abort a match which would ask the parser to try other alternative parsings from other rules. The structure transformation role overlaps with parse tree walkers, but can be more efficient as the scope is more targeted.

An example of a slightly more elaborate parser action would be the current list processing code. List items with differing combinations of bullet prefixes are wrapped in nested list structures using a stack. Expansion of the template in this example is however harder to integrate with a parser action:


 * some page
 * * First bullet
 * start template
 * # Second bullet
 * * Third bullet

Parse tree visitors
Parse tree visitors walk a completely built parse tree, and expand or generally reorder the tree. A possible task would be template expansions.

Late template expansion fix-ups in parse tree visitors
Late textual template expansion can affect other parts of the tree, which appears hard to fix using operations on an AST alone. A possible alternative is a scoped re-serialization of the AST back to wiki source followed by a re-parse with document model-related parameters set according to its location in the full AST.

The scope to serialize should be relatively simple to determine in cases such as this:


 * some page
 * start template
 * row template
 * || blah
 * end template
 * |}
 * end template
 * |}

Unfortunately, not all templates match up to pairs within a small containing scope as in this example. In general, any part of the document after a template expansion might need to be re-parsed, and the precise determination of a minimal scope appears to be difficult.

Traditional pipeline
The traditional pipeline used in the current MediaWiki parser and most (all?) compilers for languages with textual preprocessors first performs a shallow parse followed by preprocessor expansion. In most cases, this is then followed by a full parse on the resulting textual representation or an expanded and patched-up token stream. Another example of this approach is the Sweble wiki parser.

This order of things avoids the thorny issues associated with late (textual) template expansion. Disadvantages of this approach include be the difficulty of tracking the origin of parts of the resulting source for editing purposes, and the potential performance impact of re-parsing full documents after template expansion.

IDEs for languages like C++ face the same issues, and seem to have found solutions to these problems:.

Interaction of textual templates and parser functions with the Visual Editor
Visual WYSIWYG editors guide the user in its interaction with elements of a page. Modifications are mapped to modifications of the corresponding constructs at the DOM or even source level. This model however relies on a direct mapping between visible elements of a page and DOM or syntactical structures.

Textual templates and parser functions can break the direct correspondence. In extreme cases, visible elements of a page might be the result of parsing a concatenation of individual characters from many sources. Constructing an UI for this in terms of the visible page elements appears to be a daunting mission at best.

Tracking the origin of parts of the page source
IDEs for programmers operate on the source level, where UI interaction can be more readily associated with bits of the visible source or function definitions. Tracking of source locations despite preprocessor expansions is needed for functions such as 'show definition'. An example of a solution is to build a map of expanded (absolute) addresses to ranges within files or templates.

Most actually used templates interleave on the text or section level, with origin typically aligning with structure (e.g. infoboxes). In these cases, origin tracking could be sufficient to construct a workable UI for template-generated content.

Related work on fix-ups

 * The current MediaWiki parser implicitly closes inline tags (see e.g. Parser::doBlockLevels and many line-based regexps)
 * HTML tidy, which is commonly used to clean up MediaWiki output, implements further heuristics in src/parser.c (see ParseInline and ParseBlock)
 * The HTML 5 spec standardizes fix-up algorithms in chapter 8: ,,. Different insertion modes depending on the element nesting determine the handling of formatting elements, mis-nested, overlapping or disallowed element recovery and other details. The specs differ slightly from the HTML tidy behavior, which was not updated in recent years. Overall this appears to be the best and most systematic treatment of the problem, and promises good compatibility with browser and past tidy behavior.
 * The Sweble parser implements fix-ups both in grammar productions and AST visitors. See in particular this pre-print for an overview.
 * Late template expansion somewhat overlaps with incremental compilation used in some IDEs. This article on LTU has some nice pointers into this space. Actual incremental parsing appears to be rare and restricted to generally slow compilers (an early Scala IDE was mentioned). More common appear to be highly optimized parsing of the full file in a background thread, often with a simplified grammar to establish nestings and types only. In particular, the C# compiler in Visual Studio is mentioned as an example of this principle.

Preliminary summary
Unbalanced textual templates are clearly hard to support in parsing and visual editing. Consequently, the consensus in DataSummitParsers clearly favors properly nested templates. Satisfactory replacements for common uses of non-nested templates however still need to be developed, and the transition needs time in any case. In the meantime, support for 'legacy' non-nested templates is still needed.

The consensus at the parser summit also favors template expansions as late in the pipeline as possible (DataSummitParsers line 97). This is no problem for balanced templates, but makes the support of traditional unbalanced templates very challenging.

A rough sketch of the relative merits might be (please add your points!):

Late template expansion
 * + Avoids re-parsing the full document and reconstructing the DOM, potentially faster
 * + Handles expansion of balanced constructs without problems
 * - Needs complicated and error-prone fix-ups and affected subtree identification for unbalanced and textual constructs (templates)
 * (-) Needs fix-ups to handle DOM-level nesting constraints (inline vs. blocklevel etc) (maybe not relevant, if visitor is used for regular fix-ups)

Early template expansion
 * + Simple and proven design, no elaborate AST juggling
 * + Handles expansion of balanced constructs without problems
 * - Need to re-parse full document after template expansion (but first parse can be very shallow and thus fast)

Implicit close of inline elements in line-based constructs
Tidy currently expands unclosed html-based inline tags beyond the scope of line-based constructs by re-opening them explicitly. In the following example, the second bullet and even any following text will thus appear in red:
 * This is red
 * Also red


 * This is red
 * Also red

This differs from the more natural implicit close of the span within the first list item. Without tidy, Firefox renders only the first bullet in red.

A specification of these kinds of fix-ups should at least be consistent across element types, even if that requires a departure from inconsistent application-defined behavior and thus produces minor differences in output.