Parsoid/Token stream transformations

Overview and current status
The basic idea is to implement most wiki-specific parser functionality in token stream transformations, which are dispatched using a registration mechanism by token type. A token transform can perform the following actions on each token:


 * token deletion: aborts further processing for this token
 * token expansion: registered handlers for each of the returned tokens are called
 * token modification: If the token type is unchanged, pass the token to the next transformation for this token type. If the type was changed, call handlers for the new type.

The order of handlers can currently be specified using a simple prepend/append api. (But this should probably be changed to a grouping of transformations in phases.) Syntax-specific transformations on a token can register for early processing, so that later transformations on a token can operate on a normalized version of the token. MediaWiki's special quote handling for italic/bold for example is implemented in a core extension that registers handlers for 'tag quote', 'newline' and the special 'eof' token. Lists and a simple version of the Cite extension are similarly implemented. A general emulation of parser hook behavior on top of the token stream should be quite straightforward. Either collected tokens between tags or plain text based on source positions noted in tokens are available.

The token transform dispatcher class is prepared for asynchronous processing of tokens, which is already used in a synchronous fashion for the back-reference behavior of the italic/bold extension. This ability to overlap operations on multiple tokens will be important for template expansions. Doing template expansions on the token level makes it possible to render unbalanced templates like the table start / row / end combinations for viewing, while encapsulating those if the output is destined for the visual editor. Template expansion is currently work in progress, and seems to be the hardest nut to crack. Full compatibility with the current text-based preprocessor will not be possible on a token stream, but the hope is that all commonly used parts of the template system can be supported eventually.

(Long term) objectives

 * Parallelism: Allow the tokenizer, token transformations, and tree builder to execute in parallel on separate cores. Idea: single pass over a token stream with minimal buffering.
 * Concurrent IO: Support overlapping and batching of IO from transformations (template fetching etc).
 * Generality and modularity: Make it easy to plug in transformations for new features, input sources etc.
 * Backwards compatibility: Provide support for extension APIs through wrappers.

These objectives are clearly not achievable in the short term, but considering them in the architecture could help to move towards them.

Design ideas for the next steps
Random ideas and notes, very much work in progress. Please edit, add or comment!

Grouping of transformations in phases
The current implementation provides control over the order of handlers with simple prepend/append operations. A move to numeric priorities could enable a phased structure, in which each token passes through multiple phases in which only transformations starting from this phase are applied to it. This is especially interesting when a token is converted to a different token type (or multiple tokens of a different type), as the processing then needs to restart on the new tokens. Marking a phase on the token should help to make sure that transformations are only applied once per token in general.

Rough sketch of phases:
 * 1) In-order per input source (so not necessarily globally in-order including templates etc):
 * 2) * Input token adaptations depending on the input format
 * 3) * Parser hooks / extensions. In-order is needed to collect input tokens, but the actual execution can overlap with template expansions. Setting the phase on the result tokens to one past sanitation can be used to selectively disable further processing per-token.
 * 4) Out-of-order: template expansion, link / image / media / category handling, section linking etc. These all operate only on a single input token (which can hold further tokens in properties though), so order is immaterial. This allows the overlapping of IO and potentially a parallel execution of computationally expensive transformations.
 * 5) Globally in-order after all expansions:
 * 6) * MediaWiki quote to italic/bold conversion, TOC extraction, Cite link numbering, table validation, listItem to list conversion
 * 7) * Last: Output sanitation. Enforce tag / attribute whitelists after all hooks are handled

To support rendering of unbalanced templates with wiki syntax (e.g., the table start / row / table end combinations) it is necessary to expand templates as early as possible. This should result in support for the common uses of unbalanced templates, but not for everything the current textual expansion can handle. The hope is that the use of the unsupported part is sufficiently rare in practice (?). If you are aware of templates with broken-up structural wiki syntax (not html tags) other than tables and lists, then please leave a note! Are there headings where start and end of the heading line is in different templates? Or links where the opening bracket is in one template, and the closing bracket in another? Things like nested templates are not problematic as long as those can be expanded by concatenating tokens of parts or performing some very limited flattening of tokens back to plain text. The template tricks listed in meta:Help:Advanced_templates should also be doable at this level. Brion has already implemented this for a parse tree. This needs to be ported to work with token streams.

Generic callbacks for async expansions: parent.returnTokens( tokens, reference, notYetDone=false )
returns null if done, new parent (same ref) otherwise
 * templateframe.returnTokens( tokens, this.parentref ) // ref for title: null
 * accumulator.returnTokens( tokens, this.parentref )
 * Passed to children for later callback: reference, parent

Rough sketch of an async completion / callback hierarchy for a template: root frame (empty args) template frame title accumulator1 (parent = templateframe, parentref = null) template frame (parent = accumulator1, parentref = null) accumulator2 (parent = accumulator1, parentref = firstparent) arg1 accumulator1 (parent = templateframe, parentref = null) template frame (parent = accumulator1, parentref = null) accumulator2 (parent = accumulator1, parentref = firstparent) ..argn

Accumulators with generic returnTokens support
Accumulators buffer fully processed tokens between asynchronous expansion points. They thus provide a kind of linked barrier with buffering, that returns its contents in ordered chunks to its parent as soon as possible.

frame \ accumulator1      accumulator2		    accumulator3 parent: frame		parent: accumulator1		parent: accumulator2 frame: frame		frame: frame			frame: frame outstanding: 2	outstanding: 2			outstanding: 1

Example execution:
 * 1) accumulator 1 is done, calls this.parent.returnToken(tokens, this.reference, true) and decrements outstanding (now 1). Setting the notYetDone flag to true signals the frame that further returnToken calls will follow.
 * 2) accumulator 3 is done too, calls returnToken(accumulator2, this.reference, false) (it is done since outstanding was 1). Accumulator2 simply saves the returned tokens to a returnBuffer and decrements outstanding to 1. If outstanding was 1, it would call returnTokens on its own parent and return that parent to accumulator3.
 * 3) accumulator 2 finishes and (as outstanding is 1 by now) calls this.parent.returnToken(this.returnBuffer.concat(tokens), this.reference, false)

Another example execution:
 * 1) accumulator 2 finishes first, calls accumulator1.returnTokens(tokens, this.reference, true) (notYetDone as outstanding was > 1) and decrements outstanding. accumulator 1 does not decrement outstanding, as notYetDone was set and adds the tokens to this.returnBuffer. It returns null to accumulator2, so accumulator2's parent keeps pointing to accumulator1.
 * 2) accumulator3 finishes next and calls accumulator2.returnTokens(tokens, this.reference, false). As notYetDone is false, accumulator2 decrements its outstanding count which now reaches zero and calls accumulator1.returnTokens(tokens, this.reference, false) with the passed-in tokens. accumulator1 appends these tokens to this.returnBuffer and now decrements oustanding as notYetDone was false.
 * 3) accumulator1 finishes last and calls its parent (the frame).returnTokens(tokens.concat(this.returnBuffer), this.reference, false), with notYetDone false as oustanding was 1.

Accumulator todo:
 * change order to expansion node first, followed by plain (fully processed) tokens
 * now need way to access predecessor token, for example for quote handling (to determine if quotes are preceded by a space or word)
 * cannot edit predecessor if already written out, but can read written-out predecessor. Sufficient for quote handling, but sufficient in general?

Template frames
Template frames provide an argument dict and a general barrier synchronization point in the returnToken framework for async argument and title expansion. The reference to the containing frame thus needs to be passed down the expansion hierachy, with an empty top-level frame provided for the root page.
 * The reference passed into returnToken is used to identify the argument the returned tokens belong to, with null signifying the template title
 * Need to flatten the template name and argument names from (hopefully) text tokens to plain text before expanding the template
 * Need flags to toggle inclusion vs. direct page view mode and arg/template expansion

Expansion details:
 * arg-in-arg: substitute from parent frame
 * template-in-arg: new accumulators for plain tokens; new accumulator + frame for each template with parent set to accumulator, increment outstanding counter for each async child

Limits on expansion depth and loop detection

 * MediaWiki limits the expansion to 40 by default, as xdebug limits the stack depth to 100 (see DefaultSettings)
 * Browsers seem to support stacks 500+ deep though, so tail call optimization for callback chains is not urgent:
 * Loop detection: don't expand parent titles in children