Parsing/Compatibility serialization

An attempt at specifying the Html5Depurate p-wrapping algorithm.

Suppose you receive a balanced SAX-like event stream consisting of "start tag", "end tag", "character" and "comment" events.

"Normal" processing constructs a DOM from this event stream as follows:


 * Start tag : Insert an element for the token, with the parent being the current element in the stack of open elements. Push the newly-created element on to the stack of open elements.
 * End tag : Pop the current element from the stack of open elements.
 * Character : Insert a text node for the token, with the parent being the current element in the stack of open elements.
 * Comment : Insert a comment node for the token, with the parent being the current element in the stack of open elements.

In what follows, an "inline element" is an element in the HTML namespace with the tag name being one of: "a", "abbr", "acronym", "applet", "b", "basefont", "bdo", "big", "br", "button", "cite", "code", "dfn", "em", "font", "i", "iframe", "img", "input", "kbd", "label", "legend", "map", "object", "param", "q", "rb", "rbc", "rp", "rt", "rtc", "ruby", "s", "samp", "select", "small", "span", "strike", "strong", "sub", "sup", "textarea", "tt", "u", "var".

A "non-inline element" is any other element.

The compatibility algorithm proceeds as follows:


 * If the current node is a &lt;body> or &lt;blockquote>
 * If a character token is encountered
 * Insert an mw:p-wrap element, push it onto the stack of open elements
 * Process the token as normal
 * If a non-inline end tag is encountered
 * Insert an mw:p-wrap element, push it onto the stack of open elements
 * Process the token as normal
 * If the current node is an mw:p-wrap
 * If an end tag is encountered
 * Close the mw:p-wrap
 * Process the token as normal
 * If a non-inline start tag is encountered
 * Close the mw:p-wrap
 * Process the token as normal
 * If the current node is otherwise, but there is an mw:p-wrap in the stack
 * If a non-inline start tag is encountered
 * Consider the stack range between the current node and the mw:p-wrap
 * For each element in this stack range, not including the mw:p-wrap, clone the node
 * The intended parent of the node which is a child of the mw:p-wrap is the parent of the mw:p-wrap
 * The intended parent of the other elements is the cloned version of its original parent
 * Close all the elements in the stack range, including the mw:p-wrap
 * Process the token as normal
 * If the current node is otherwise, and there is no mw:p-wrap in the stack, but there is a &lt;body> or &lt;blockquote> in the stack
 * If a character token is encountered
 * Consider the stack elements from the current node back to the &lt;body> or &lt;blockquote>
 * If any of these stack elements are non-inline elements, process the token as normal and then abort these steps.
 * Insert a new mw:p-wrap element as a child of the &lt;body> or &lt;blockquote>
 * Take the element in the stack of open elements which is immediately under the &lt;body> or &lt;blockquote>. Reparent this element so that its new parent is the mw:p-wrap.
 * Insert the mw:p-wrap into the stack of open elements, under the &lt;body> or &lt;blockquote>, corresponding to its new position in the DOM.
 * Process the token as normal
 * Any case not handled above is processed as normal.

To serialize the DOM thus constructed:
 * If an &lt;li>, &lt;p> or &lt;tr> has no attributes, and it either has no children, or all its children are either text nodes which consist only of tab, LF, FF or space, or comment nodes, add a "class" attribute with the value "mw-empty-elt".
 * An mw:p-wrap element is serialized as follows:
 * If it has no children, or if all its children are either text nodes which consist only of tab, LF, FF or space, or comment nodes, its serialization is the serialization of its children.
 * Otherwise, its serialization is "&lt;p>" followed by the serialization of its children, followed by "&lt;/p>".