User:OrenBochman/ParserNG

The subpages are weekend project attempt at an Antler based Parser Speck. I do still unfamilier with the parser - the work is mostly based infor in other specks from below.

Challanges of Parsing MediaWiki Syntax
Based on: and
 * 1) The set of all Inpput is not fixed.
 * 2) External references:
 * 3) *templates
 * 4) *transclusion
 * 5) *extentions
 * 6) command order mis marthc
 * 7) *output is a single file. input can a recurusive set of files.
 * 8) *templates require out of order processing and extentions too.
 * 9) the lexer is context sensitive lexer?
 * 10) need to look forward, and backards too some times
 * 11) backwards to determine curly construct meaning. (till end of file)
 * 12) the same goes for includeonly, no-include, comments and no wiki.
 * 13) the languages is big, the statemenst (magic words can be changed externaly)
 * 14) some langugae statements are very similar
 * 15) * { in can mean servral things.
 * 16) * ' in 'x' can mean servral things: ' +  or  + '
 * 17) white space adds some complexity.
 * 18) * TOC placement
 * 19) * indentations does matter
 * 20) * single vs multiple new lines matter too.
 * 21) optional case sensitivity in literals first letter but not in commands.
 * 22) error recovery is important
 * 23) good reporting is not
 * 24) poor documentation.
 * 25) * no well defined language or no manual;
 * 26) * hacked for ages like by non-language designers?
 * 27) The Translator should be fast and modular.
 * 28) * However the current parser is very slow
 * 29) * it would be hard to be slower
 * 30) * extensive caching compensates for slowness in many situations
 * 31) * modularity and simplicity are more important.
 * 32) content has comments and markup that can occur anywhere in the input and need to go out into the output at proper locations.
 * 33) multiple syntax for features:
 * 34) * tables
 * 35) * headers, bold italic can be wiki or html based
 * 36) * output need not be human editable
 * 37) input size - can be massive, e.g. wikibooks.
 * 38) * imposes limits on # of passes.
 * 39) * imposes limits on viability of memoization.

Open Questions

 * 1) what are and what should be the parser's
 * 2) * error handling.
 * 3) * error recovery capability.
 * 4) Is a major move to simplify the language being considered?
 * 5) *reducing construct ambiguity.
 * 6) *reducing context dpendency.
 * 7) **Links, images etc in
 * 8) *simple is not neccessaraly weaker.
 * 9) how does/should the extention mechanism interact with the parser.
 * 10) * protect the parser from extension's bugs.
 * 11) * give extention's services.
 * 12) * seperate implimentation.
 * 13) is the Antlr backend for PHP or Javascript good enough to generate the parser with?
 * 14) what is the importance of semantics on parsing media wiki content, as opposed to parsing just the syntax?
 * 15) templates seem important
 * 16) can the parser's complexity be reduced if had access to semantic metadata.

Specs & Tools

 * Preprocessor
 * Markup Speck
 * Alternative_parsers
 * Parser Testing script + Test Cases
 * Mediawiki\maintenance\tests
 * Parser Playground gadget

Java Based Parsers

 * http://code.google.com/p/gwtwiki/
 * http://rendering.xwiki.org/xwiki/bin/view/Main/WebHome
 * http://sweble.org/wiki/Sweble_Wikitext_Parser

Todo

 * 1) finish the dumpHtmlHarness class.
 * 2) add more options.
 * 3) bench marking.
 * 4) log4j output.
 * 5) implement extention tag loading mechanism.
 * 6) implement magicword (localised) loading mechanism.
 * 7) input filter support.
 * 8) different parser implimentation via dpendency injection
 * 9) write a junit test which runs the tests in Mediawiki\maintenance\tests\parser\parserTests.txt
 * 10) write a junit test which runs real page content.
 * 11) get the lot into hudson.
 * 12) fix one of the above parser
 * test the ANTLR version.