User:GWicke/oldUserPage

Gabriel Wicke, wicke ät wikidev.net, en:User:Gwicke.

I am working on a new parser prototype for MediaWiki.

General links:
 * Current activity in SVN
 * Etherpad index
 * MediaWiki_roadmap/1.18/Revision_report, patch triage, high priority bugs @20%
 * JavaScript performance, MVL

Parser

 * Future/Parser plan and in particular informal grammar and fix-ups
 * PEG.js documentation
 * Visual editor/Software design and Todo
 * WikiDom docs, talk and example document
 * Parser tests in bugzilla: have, need

HTML 5 parsers
'''HTML 5 parsing widely overlaps with the needs for a wiki parser, and covers many areas of informal grammar and fix-ups. The main differences are in the tokenizer part (different syntax), and in the handling of non-matching elements (ignore vs. show as plain text).

The plan is to convert the PEG parser into a tokenizer (which handles most wiki-specific issues) and use any HTML5-compliant parser as a backend that builds the (DOM) tree from 'token soup'. If we can get away with an unmodified HTML parser, then this will allow a reuse of specification and implementations for the back-end.

HTML5 parsers include (apart from those in browsers):
 * Java: http://about.validator.nu/htmlparser/, especially startTag in src/nu/validator/htmlparser/impl/TreeBuilder.java'''
 * Based on Google Web Toolkit, can compile to Javascript (Debian build script) and C++ (C++ used in Gecko; Live Javascript version);
 * PHP, Python and Ruby ports: http://code.google.com/p/html5lib/
 * Dom.js, a cleaner JS parser + DOM implementation sponsored by Mozilla; only works on Spidermonkey (not on node.js) due to use of proxies, const and other advanced JS features: ,
 * Node html5 parser library: https://github.com/aredridel/html5, slower than the Mozilla one according to

Tokenizer interfaces
html5: this.emit('token', tok); {type: 'Characters', data: c}       {type: 'StartTag', name: 'li', data: [{nodeName: 'attr1', nodeValue: 'attrvalue1'}]};
 * uses events module for dispatch to parser; supports streaming sources (EventEmitter)

dom.js insertToken(TEXT, s); insertToken(COMMENT, s); insertToken(TAG, tagname, 'an1', 'av'); insertToken(ENDTAG, tagname); -> parser(..)
 * direct call to current parser, by insertion mode

validator.nu htmlparser:

TreeBuilder.startTag(tagName, key,value, selfClosing); TreeBuilder.endTag(tagName); TreeBuilder.comment(commentstring);

Common emit function, passed into tokenizer constructor:. The list of attribute key-value pairs preserves order and duplicate attributes for round-tripping if possible. TYPE is one of (incomplete list) TAG, ENDTAG, TEXT, COMMENT, SELFCLOSINGTAG. Source positions would be an interesting addition to enable some degree of reconciliation.

Wiki-specific parser work

 * List of alternative parsers, notes from Berlin Hackathon 2011
 * Markup spec/ANTLR/draft and Wikitext-l discussion
 * Kiwi grammar
 * Python parser by Mozilla.org
 * Sweble: in particular sweble-wikitext/swc-parser-lazy/src/main/java/org/sweble/wikitext/lazy/postprocessor and sweble-wikitext/swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/parser (grammar) (and damn those deep hierarchies!!)
 * Hook output handling in current parser:

Differences between Tidy and HTML5 parser behavior
Generally HTML5 parsers only perform very limited correction of invalid nesting according to the content model. Content in locations where neither inline nor block-level content is legal (for example between 'table' and 'tr' tags) is generally adopted by elements further up in the tree according to an 'foster parent' algorithm. Block-level ('flow' in HTML5 lingo) content where only inline ('phrasing') is allowed is not corrected at all. Browsers manage to display these unspecified nestings with mostly acceptable results.

Tidy on the other hand tries harder to correct content-model violations, with sometimes surprising and not very localized effects if other invalid content (especially with missing end tags) precedes mis-nested content.

Examples:
 * Block elements in headings are moved after the heading, or joined up with unclosed block elements before the heading

Formatting elements

 * a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u.
 * scope limited by applet elements, buttons, object elements, marquees, table cells, and table captions
 * formatting restored when entering other elements: Search for Reconstruct the active formatting elements in

Editor-related bits from the HTML spec

 * UndoManager and DOM transaction interface WIP.
 * WhatWG Web-Apps standard WIP

Test cases
The parser is being written against the MediaWiki parser tests suite (currently a little more than 660 test cases). Next up will be round-trip testing and running through dumps.

Additional interesting pages:
 * en:Template:WPBannerMeta/core - Template and parser function torture test