User:GWicke/oldUserPage

Gabriel Wicke, gwicke at wikimedia.org. See also en:User:Gwicke.

I am a software developer working for the Wikimedia Foundation. My current project is a new parser prototype for MediaWiki in support of the Visual editor project. Before joining the foundation in October 2011, I was a volunteer contributor with a very active period between 2003 and 2005. During this time, I designed and implemented the (first version of the) Squid cache layer, the MonoBook skin, added the capability to develop javascript and css on the wiki (user and common scripts/styles) and tweaked the parser to emit slightly less broken output. Without too much success though, which is why I then added html tidy as a post-processor to clean up the mess before output.

The new parser should fix a big part of that problem by design. The challenge now is to support the existing content of a few encyclopedias including esoteric template trickery, parser functions and extensions ;)

General links:
 * Current activity in SVN
 * Etherpad index
 * MediaWiki_roadmap/1.18/Revision_report, patch triage, high priority bugs, parser bugs, @20%
 * JavaScript performance, MVL

Parser

 * Future/Parser plan and in particular informal grammar and fix-ups
 * PEG.js documentation
 * Visual editor/Software design, Todo and VEplan
 * WikiDom docs, talk and example document
 * Parser tests in bugzilla: have, need

HTML 5 parsers
'''HTML 5 parsing widely overlaps with the needs for a wiki parser, and covers many areas of informal grammar and fix-ups. The main differences are in the tokenizer part (different syntax), and in the handling of non-matching elements (ignore vs. show as plain text).

The plan is to convert the PEG parser into a tokenizer (which handles most wiki-specific issues) and use any HTML5-compliant parser as a backend that builds the (DOM) tree from 'token soup'. If we can get away with an unmodified HTML parser, then this will allow a reuse of specification and implementations for the back-end.

HTML5 parsers include (apart from those in browsers):
 * Java: http://about.validator.nu/htmlparser/, especially startTag in src/nu/validator/htmlparser/impl/TreeBuilder.java'''
 * Based on Google Web Toolkit, can compile to Javascript (Debian build script) and C++ (C++ used in Gecko; Live Javascript version);
 * PHP, Python and Ruby ports: http://code.google.com/p/html5lib/
 * Dom.js, a cleaner JS parser + DOM implementation sponsored by Mozilla; only works on Spidermonkey (not on node.js) due to use of proxies, const and other advanced JS features: ,
 * Node html5 parser library: https://github.com/aredridel/html5, slower than the Mozilla one according to

Tokenizer interfaces
html5: this.emit('token', tok); {type: 'Characters', data: c}       {type: 'StartTag', name: 'li', data: [{nodeName: 'attr1', nodeValue: 'attrvalue1'}]};
 * uses events module for dispatch to parser; supports streaming sources (EventEmitter)

dom.js insertToken(TEXT, s); insertToken(COMMENT, s); insertToken(TAG, tagname, 'an1', 'av'); insertToken(ENDTAG, tagname); -> parser(..)
 * direct call to current parser, by insertion mode

validator.nu htmlparser:

TreeBuilder.startTag(tagName, key,value, selfClosing); TreeBuilder.endTag(tagName); TreeBuilder.comment(commentstring);

Common emit function, passed into tokenizer constructor:. The list of attribute key-value pairs preserves order and duplicate attributes for round-tripping if possible. TYPE is one of (incomplete list) TAG, ENDTAG, TEXT, COMMENT, SELFCLOSINGTAG. Source positions would be an interesting addition to enable some degree of reconciliation.

Wiki-specific parser work

 * List of alternative parsers, notes from Berlin Hackathon 2011
 * Markup spec/ANTLR/draft and Wikitext-l discussion
 * Kiwi grammar
 * Python parser by Mozilla.org
 * Sweble: in particular sweble-wikitext/swc-parser-lazy/src/main/java/org/sweble/wikitext/lazy/postprocessor and sweble-wikitext/swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/parser (grammar) (and damn those deep hierarchies!!)
 * Hook output handling in current parser: 8997

Differences between Tidy and HTML5 parser behavior
Generally HTML5 parsers only perform very limited correction of invalid nesting according to the content model. Content in locations where neither inline nor block-level content is legal (for example between 'table' and 'tr' tags) is generally adopted by elements further up in the tree according to an 'foster parent' algorithm. Block-level ('flow' in HTML5 lingo) content where only inline ('phrasing') is allowed is not corrected at all. Browsers manage to display these unspecified nestings with mostly acceptable results.

Tidy on the other hand tries harder to correct content-model violations, with sometimes surprising and not very localized effects if other invalid content (especially with missing end tags) precedes mis-nested content.

Examples:
 * Block elements in headings are moved after the heading, or joined up with unclosed block elements before the heading

Formatting elements

 * a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u.
 * scope limited by applet elements, buttons, object elements, marquees, table cells, and table captions
 * formatting restored when entering other elements: Search for Reconstruct the active formatting elements in

Editor-related bits from the HTML spec

 * UndoManager and DOM transaction interface WIP.
 * WhatWG Web-Apps standard WIP

Other UI stuff

 * Visual diff in localwiki using a daisydiff Java diff service to produce diff annotations. Contact: Philip and Mike on #localwiki.

Test cases
The parser is being written against the MediaWiki parser tests suite (currently a little more than 660 test cases). Next up will be round-trip testing and running through dumps. See also Future/Parser test cases

Potential future server-side DOM processing stuff

 * TAL: Spec, http://code.google.com/p/jstal/; Very similar: Genshi
 * HTML5 WebWorkers and zero-copy binary structure passing
 * jsdom supports the execution of onload inline javascript in attributes
 * namespace scoping similar to TAL limits possible interactions, which is a good thing for DOM subtree caching and general sanity