User:OrenBochman/ParserNG

The subpages are weekend project attempt at an Antler based Parser Speck. I do still unfamilier with the parser - the work is mostly based infor in other specks from below.

My Specs
WikiTable

Parsing Options
Goal: spcify the parser in Antlr
 * would provide documentation
 * would be more efficent and robust.
 * would simplify other parsing effort
 * can produce different language targets php,javascript,java,c++,python for use by many tools
 * can be used to migrate, translate to a better format.
 * can be extended

Challenges of Parsing MediaWiki Syntax
Based on: and
 * 1) The set of all Input is not fixed.
 * 2) External references:
 * 3) *templates
 * 4) *transclusion
 * 5) *extensions
 * 6) Command order mis marthc
 * 7) *output is a single file. input can a recursive set of files.
 * 8) *templates require out-of-order processing and extensions too.
 * 9) the lexer is context sensitive lexer?
 * 10) Need to look forward, and backwards too some times.
 * 11) backwards to determine curly construct meaning. (till end of file)
 * 12) the same goes for include-only, no-include, comments and no wiki.
 * 13) The languages is big, the statement (magic words can be changed externally)
 * 14) some language statements are very similar
 * 15) * [ in .|... can mean several things. (internal link, external link, audio, picture, video etc)
 * 16) * { in can mean several things.
 * 17) * ' in 'x' can mean several things: ' +  or  + '
 * 18) White space adds some complexity.
 * 19) * TOC placement
 * 20) * indentations does matter
 * 21) * single vs multiple new lines matter too.
 * 22) Optional case sensitivity in literals first letter but not in commands.
 * 23) Error recovery is important
 * 24) Good reporting is not
 * 25) Poor documentation.
 * 26) * The language is not well-defined and is sparsely documented;
 * 27) * It was hacked for ages like by non-language designers?
 * 28) * The only definition is in the working code of the above hacks.
 * 29) The Translator should be fast and modular.
 * 30) * However the current parser is very slow.
 * 31) * it would be hard to be slower
 * 32) * extensive caching compensates for slowness in many situations
 * 33) * modularity and simplicity are more important.
 * 34) content has comments and markup that can occur anywhere in the input and need to go out into the output at proper locations.
 * 35) multiple syntax for features:
 * 36) * tables
 * 37) * headers, bold italic can be wiki or html based
 * 38) * output need not be human editable
 * 39) input size - can be massive, e.g. wikibooks.
 * 40) * imposes limits on # of passes.
 * 41) * imposes limits on viability of memorization.

Open Questions

 * 1) what are and what should be the parser's
 * 2) * error handling.
 * 3) * error recovery capability.
 * 4) Is a major move to simplify the language being considered?
 * 5) *reducing construct ambiguity.
 * 6) *reducing context dpendency.
 * 7) **Links, images etc in
 * 8) *simple is not neccessaraly weaker.
 * 9) how does/should the extention mechanism interact with the parser.
 * 10) * protect the parser from extension's bugs.
 * 11) * give extention's services.
 * 12) * seperate implimentation.
 * 13) is the Antlr backend for PHP or Javascript good enough to generate the parser with?
 * 14) what is the importance of semantics on parsing media wiki content, as opposed to parsing just the syntax?
 * 15) templates seem important
 * 16) can the parser's complexity be reduced if had access to semantic metadata.
 * 17) scoping rules (templates, variables, refrences)
 * 18) * are the required variable defined already
 * 19) * when does a definition expire

Enhacments

 * 1) dynamic scoping of template args
 * let the template called see named variables defined in thier parent's call
 * as above but with name munsing like super.argname


 * 1) parser functions which evaluate
 * (mathematical) expressions within variables.

Specs

 * Preprocessor
 * Markup Speck
 * Alternative_parsers
 * Parser Testing script + Test Cases

Missing specks:
 * Language conversion
 * Sanitations
 * Operator precedence
 * Error recovery
 * Parser hooks for the extention mechanism

Tools

 * Mediawiki\maintenance\tests
 * Parser Playground gadget

Antlr

 * How to remove global backtracking from your grammar
 * look ahead analysis

Java Based Parsers

 * http://code.google.com/p/gwtwiki/
 * http://rendering.xwiki.org/xwiki/bin/view/Main/WebHome
 * http://sweble.org/wiki/Sweble_Wikitext_Parser

Todo

 * 1) finish the dumpHtmlHarness class.
 * 2) add more options.
 * 3) bench marking.
 * 4) log4j output.
 * 5) implement extention tag loading mechanism.
 * 6) implement magicword (localised) loading mechanism.
 * 7) input filter support.
 * 8) different parser implimentation via dpendency injection
 * 9) write a junit test which runs the tests in Mediawiki\maintenance\tests\parser\parserTests.txt
 * 10) write a junit test which runs real page content.
 * 11) get the lot into hudson.
 * 12) fix one of the above parser
 * test the ANTLR version.