User:OrenBochman/ParserNG

The subpages are weekend project attempt at an Antler based Parser Speck. I do still unfamilier with the parser - the work is mostly based infor in other specks from below.

Specs
Before creating a full parser solution is seems safer to produce partial parsers in Antrl.

These are caleld specs because while they should document all the language options they do little more than build a parse tree.

They specs are planned as pars of a full spec parser chain..

To build fully functional output would nevcesitate:
 * 1) a mechanism to access to parser functions, magic words, extentions, templates, transliteration tables.
 * 2) parser specs
 * 3) integration with the parser specs
 * 4) add TreeGrammars to construct a DOM
 * 5) use StringTemplate file to construct the output.

an analysis of the input On the other handsproduce a basic parse tree.

My Current Specs

 * WikiTable
 * Preprocessor
 * Translator
 * Awk Antlr
 * Sanitizer Antlr Scrubber

Parsing Options
Goal: spcify the parser in Antlr
 * would provide documentation
 * would be more efficent and robust.
 * would simplify other parsing effort
 * can produce different language targets php,javascript,java,c++,python for use by many tools
 * can be used to migrate, translate to a better format.
 * can be extended

Challenges of Parsing MediaWiki Syntax
Based on: and
 * 1) The set of all Input is not fixed.
 * 2) External references:
 * 3) *templates
 * 4) *transclusion
 * 5) *extensions
 * 6) Command order mis marthc
 * 7) *output is a single file. input can a recursive set of files.
 * 8) *templates require out-of-order processing and extensions too.
 * 9) the lexer is context sensitive lexer?
 * 10) Need to look forward, and backwards too some times.
 * 11) backwards to determine curly construct meaning. (till end of file)
 * 12) the same goes for include-only, no-include, comments and no wiki.
 * 13) The languages is big, the statement (magic words can be changed externally)
 * 14) some language statements are very similar
 * 15) * [ in .|... can mean several things. (internal link, external link, audio, picture, video etc)
 * 16) * { in can mean several things.
 * 17) * ' in 'x' can mean several things: ' +  or  + '
 * 18) White space adds some complexity.
 * 19) * TOC placement
 * 20) * indentations does matter
 * 21) * single vs multiple new lines matter too.
 * 22) Optional case sensitivity in literals first letter but not in commands.
 * 23) Error recovery is important
 * 24) Good reporting is not
 * 25) Poor documentation.
 * 26) * The language is not well-defined and is sparsely documented;
 * 27) * It was hacked for ages like by non-language designers?
 * 28) * The only definition is in the working code of the above hacks.
 * 29) The Translator should be fast and modular.
 * 30) * However the current parser is very slow.
 * 31) * it would be hard to be slower
 * 32) * extensive caching compensates for slowness in many situations
 * 33) * modularity and simplicity are more important.
 * 34) content has comments and markup that can occur anywhere in the input and need to go out into the output at proper locations.
 * 35) multiple syntax for features:
 * 36) * tables
 * 37) * headers, bold italic can be wiki or html based
 * 38) * output need not be human editable
 * 39) input size - can be massive, e.g. wikibooks.
 * 40) * imposes limits on # of passes.
 * 41) * imposes limits on viability of memorization.

Open Questions

 * 1) what are and what should be the parser's
 * 2) * error handling.
 * 3) * error recovery capability.
 * 4) Is a major move to simplify the language being considered?
 * 5) *reducing construct ambiguity.
 * 6) *reducing context dpendency.
 * 7) **Links, images etc in
 * 8) *simple is not neccessaraly weaker.
 * 9) how does/should the extention mechanism interact with the parser.
 * 10) * protect the parser from extension's bugs.
 * 11) * give extention's services.
 * 12) * seperate implimentation.
 * 13) is the Antlr backend for PHP or Javascript good enough to generate the parser with?
 * 14) what is the importance of semantics on parsing media wiki content, as opposed to parsing just the syntax?
 * 15) templates seem important
 * 16) can the parser's complexity be reduced if had access to semantic metadata.
 * 17) scoping rules (templates, variables, refrences)
 * 18) * are the required variable defined already
 * 19) * when does a definition expire

Enhacments

 * 1) dynamic scoping of template args
 * let the template called see named variables defined in thier parent's call
 * as above but with name munsing like super.argname


 * 1) parser functions which evaluate
 * (mathematical) expressions within variables.

Specs

 * Preprocessor
 * Markup Speck
 * Alternative_parsers
 * Parser Testing script + Test Cases

Missing specks:
 * Language conversion
 * Sanitations
 * Operator precedence
 * Error recovery
 * Parser hooks for the extention mechanism

Tools

 * Mediawiki\maintenance\tests
 * Parser Playground gadget

Antlr

 * How to remove global backtracking from your grammar
 * look ahead analysis
 * (...)? optional subrule
 * (...)=> syntactic predicate
 * {...}? hoisting disambiguating semantic predicate
 * {...}?=> gated semantic predicate

Java Based Parsers

 * http://code.google.com/p/gwtwiki/
 * http://rendering.xwiki.org/xwiki/bin/view/Main/WebHome
 * http://sweble.org/wiki/Sweble_Wikitext_Parser

Todo

 * 1) finish the dumpHtmlHarness class.
 * 2) add more options.
 * 3) bench marking.
 * 4) log4j output.
 * 5) implement extention tag loading mechanism.
 * 6) implement magicword (localised) loading mechanism.
 * 7) input filter support.
 * 8) different parser implimentation via dpendency injection
 * 9) write a junit test which runs the tests in Mediawiki\maintenance\tests\parser\parserTests.txt
 * 10) write a junit test which runs real page content.
 * 11) get the lot into hudson.
 * 12) fix one of the above parser
 * test the ANTLR version.