Parsoid/Token stream transformations

Overview and current status
The idea is to implement most wiki-specific parser functionality in token stream transformations, which are dispatched using a registration mechanism by token type. A token transform can perform the following actions on each token:


 * token deletion: aborts further processing for this token
 * token expansion: registered handlers for each of the returned tokens are called
 * token modification: If the token type is unchanged, pass the token to the next transformation for this token type. If the type was changed, call handlers for the new type.

The order of handlers can be specified using a simple prepend/append api for now. Syntax-specific transformations on a token can register for early processing, so that later transformations on a token can operate on a normalized version of the token. MediaWiki's special quote handling for italic/bold for example is implemented in a core extension that registers handlers for 'tag quote', 'newline' and the special 'eof' token. Lists and a simple version of the Cite extension are similarly implemented. A general emulation of parser hook behavior on top of the token stream should be quite straightforward. Either collected tokens between tags or plain text based on source positions noted in tokens are available.

The token transform dispatcher class is prepared for asynchronous processing of tokens, which is already used in a synchronous fashion for the back-reference behavior of the italic/bold extension. This ability to overlap operations on multiple tokens will be important for template expansions. Doing template expansions on the token level makes it possible to render unbalanced templates like the table start / row / end combinations for viewing, while encapsulating those if the output is destined for the visual editor. Template expansion is currently WIP.

Objectives

 * Parallelism: Allow the tokenizer, token transformations, and tree builder to execute in parallel on separate cores. Idea: single pass over a token stream with minimal buffering.
 * Concurrent IO: Support overlapping and batching of IO from transformations (template fetching etc).
 * Generality and modularity: Make it easy to plug in transformations for new features, input sources etc.
 * Backwards compatibility: Provide support for extension APIs through wrappers.

Design ideas for the next steps
Random ideas and notes, very much WIP. Please edit, add or comment!

Ordering of transformations in phases
The current implementation provides control over the order of handlers with simple prepend/append operations. A move to numeric priorities could enable a phased structure, in which each token passes through multiple phases in which only transformations starting from this phase are applied to it. This is especially interesting when a token is converted to a different token type (or multiple tokens of a different type), as the processing then needs to restart on the new tokens. Marking a phase on the token should help to make sure that transformations are only applied once per token in general.

Candidate phases:
 * 1) Input format specific conversions. Examples: MediaWiki list and quote handling
 * 2) Evaluation order independent transformations: template expansion, link/image expansion, section linking etc
 * 3) * Last: Output sanitation. Enforce tag / attribute whitelists
 * 4) Evaluation order sensitive, but safe transformations: TOC extraction, Cite link numbering

Support for unbalanced templates with wiki syntax (e.g., the table start / row / table end combinations) complicates this scheme. List handling for example will need to be split in one part running before template expansions, and another for building the list after everything is expanded. In any case, a token-based template expansion will not be able to support everything the current textual expansion can handle. The hope is that the use of the unsupported part is sufficiently rare in practice.

Template frames: argument dict, barrier synchronization for async expansions

 * arg expansion for templates: arg key passed in callback
 * arg-in-arg: substitute from parent frame
 * template-in-arg: new accumulators for plain tokens; new accumulator + frame for each template with parent set to accumulator, parent->addOutstanding to increment outstanding counter
 * flags to toggle inclusion vs. direct page view mode and arg/template expansion

Generic callbacks for async expansions: parent.returnTokens( tokens, reference, notYetDone=false )
returns null if done, new parent (same ref) otherwise
 * templateframe.returnTokens( tokens, this.parentref ) // ref for title: null
 * accumulator.returnTokens( tokens, this.parentref )
 * Passed to children for later callback: reference, parent

Rough sketch of an async completion / callback hierarchy for a template: root frame (empty args) template frame title accumulator1 (parent = templateframe, parentref = null) template frame (parent = accumulator1, parentref = null) accumulator2 (parent = accumulator1, parentref = firstparent) arg1 accumulator1 (parent = templateframe, parentref = null) template frame (parent = accumulator1, parentref = null) accumulator2 (parent = accumulator1, parentref = firstparent) ..argn

Accumulators with generic returnTokens support
Accumulators should return token chunks in-order but as soon as possible to the parent.

frame \ accumulator1      accumulator2		    accumulator3 parent: frame			parent: accumulator1		parent: accumulator2 frame: frame			frame: frame				frame: frame outstanding: 2		outstanding: 2				outstanding: 1

Accumulator todo:
 * change order to expansion node first, followed by plain tokens
 * now need way to access predecessor token, for example for quote handling (to determine if quotes are preceded by a space or word)
 * cannot edit predecessor if already written out, but can read written-out predecessor. Sufficient for quote handling, but sufficient in general?

Limits on expansion depth and loop detection

 * MediaWiki limits the expansion to 40 by default, as xdebug limits the stack depth to 100 (see DefaultSettings)
 * Browsers seem to support stacks 500+ deep though, so tail call optimization for callback chains is not urgent:
 * Loop detection: don't expand parent titles in children