Parsoid/Token stream transformations

From mediawiki.org

(Longer term) objectives[edit]

  • Parallelism: Allow the tokenizer, token transformations, and tree builder to execute in parallel on separate cores. Idea: single pass over a token stream with minimal buffering.
  • Concurrent IO: Support overlapping and batching of IO from transformations (template fetching etc).
  • Generality and modularity: Make it easy to plug in transformations for new features, input sources etc.
  • Backwards compatibility: Provide support for extension APIs through wrappers.

Some of these objectives are not easy to achieve in the short term, but considering them in the architecture should help to move towards them.

Overview and current status[edit]

Most wiki-specific functionality is implemented in token stream transformations, which are dispatched using a registration mechanism by token type. A token transform can perform the following actions on each token:

  • token deletion: aborts further processing for this token
  • token expansion: registered handlers for each of the returned tokens are called
  • token modification: If the token type is unchanged, pass the token to the next transformation for this token type. If the type was changed, call handlers for the new type.

Transformations are grouped in three phases:

  1. In-order per input source (so not necessarily globally in-order including templates etc):
    • Input token adaptations depending on the input format
    • Parser hooks / extensions. In-order is needed to collect input tokens, but the actual execution can overlap with template expansions. Setting the phase on the result tokens to one past sanitation can be used to selectively disable further processing per-token.
  2. Out-of-order: template expansion, link / image / media / category handling, section linking etc. These all operate only on a single input token (which can hold further tokens in properties though), so order is immaterial. This allows the overlapping of IO and potentially a parallel execution of computationally expensive transformations.
  3. Globally in-order after all expansions:
    • MediaWiki quote to italic/bold conversion, Cite handling, listItem to list conversion
    • Last: Output sanitation. Enforce tag / attribute whitelists and sanitize attribute values after all hooks are handled

To support rendering of unbalanced templates with wiki syntax (e.g., the table start / row / table end combinations) it is necessary to expand templates as early as possible. This should result in support for the common uses of unbalanced templates, but not for everything the current textual expansion can handle. The hope is that the use of the unsupported part is sufficiently rare in practice (?). If you are aware of templates with broken-up structural wiki syntax (not html tags) other than tables and lists, then please leave a note! Things like nested templates are not problematic as long as those can be expanded by concatenating tokens of parts or performing some very limited flattening of tokens back to plain text. The template tricks listed in meta:Help:Advanced_templates should also be doable at this level. The current implementation should now provide much of this, but is still mostly untested and will still have rough edges.

Minimal buffering using the TokenAccumulator class[edit]

The asynchronous second transformation phase allows the overlapping of IO, for example for template fetches, extension processing or link resolution. The needed buffering between asynchronous processing points and as-early-as-possible emission of result tokens is handled by the TokenAccumulator class.

 frame
	\
  accumulator1    ----  accumulator2		  ----  accumulator3
  parent: frame		parent: accumulator1		parent: accumulator2
  frame: frame		frame: frame			frame: frame
  outstanding: 2	outstanding: 2			outstanding: 1

Template expansion[edit]

The template expansion implementation is based on a phase-2 handler for template and templatearg tags. After processing and potentially template expansion on attributes (template title, argument names and values), the template source is fetched and parsed using a partial (and independent) pipeline:

    | wikitext                                                             | template wikitext from template 
    V                                                                      V
PEG wiki/HTML tokenizer   (or other tokenizers / SAX-like parsers)     PEG wiki/HTML tokenizer
    | Chunks of tokens                                                     | Chunks of tokens
    V                                                                      V
Token stream transformation: synchronous in-order per input            Token stream transformation: synchronous in-order per input
    | Chunks of tokens                                                     | Chunks of tokens
    V                                                                      V
Token stream transformation: asynchronous out-of-order per input       Token stream transformation: asynchronous out-of-order per input
    | Chunks of tokens                                                     | Chunks of tokens
    +<---------------------------------------------------------------------+
    |                                 Template expansion
    V
Token stream transformation: synchronous and globally in-order
    | Chunks of tokens                                                 
    V
HTML5 tree builder 
    | HTML 5 DOM tree
    V
DOM Postprocessors 
    | HTML5 DOM tree
    V
(X)HTML serialization
    |
    +----------> Browser
    V
Visual Editor

Limits on template expansion depth and loop detection[edit]

  • MediaWiki limits the expansion depth to 40 by default, as xdebug limits the stack depth to 100 (see DefaultSettings)
  • Browsers seem to support stacks 500+ deep though, so tail call optimization for callback chains is not urgent: [1][2][3][4]
  • Loop detection: don't expand parent titles in children (implemented as a linked list to allow sharing between async expansions)