User:OrenBochman/ParserNG/antlr

From mediawiki.org

The preprocessor is tricky business to code.

  • It has recursive rules a complicated (for parsing) format.
  • double curly expressions can have different semantics
  • triple curlies too
  • single curlies also on occasion.

next follow some advanced Antlr specific features illustrated by things they could help to fix.

Recursive Lexer / Parser Rules[edit]

When creating a recursive expresions definition, the lowest nesting level is the lowest precedence of the operator matched in that rule.

Asociativity[edit]

expr: mult ('+' mult)* ; // left-associative via (...)*
mult: pow ('*' pow)* ;
pow : atom ('^' pow)? ; // right-associative via tail recursion
atom: ID| INT ;

In The PreProcessor[edit]

thus:


grammar:PreProcessor;

members
{
String myName;
boolean isTemplate(String name){...}
String getTemplate(String name, List<String> args){...}
boolean isTemplateArg(String name){...}
String getTemplateArg(String name){...}
boolean isMagicWord(String name){...}
String getMagicWord(String name, List<String> args){...}
boolean isParserFunction(String name){...}
String getParserFunction(String name, List<String> args){...}
}
wikiExpr: curlyExpr;
curlyExpr: parserFunction| magicWord | templateVar | template ;
angleExpr: ext | nowiki | noInclude | includeOnly | comment | pre ;
template: {isTemplate(input.LT(2).text)}?=>'{{' NAME TEMPLATE_ARGS | curlyExpr | /* empty */ '}}'; // right-associative via tail recursion
templateVar: {isTemplate(input.LT(2).text)}?=>'{{{' NAME | /* empty */ '}}}'; // right-associative via tail recursion
parserFunction : {isParserFunction(input.LT(2).text)}?=>'{{' NAME PF_ARRGS '}}'; //curly atom
magicWord : {isMagicWord(input.LT(2).text)}?=>'{{' NAME MW_ARGS '}}'; //curly atom
//lexer rules
PF_ARRGS: ... ;
MW_ARGS: ... ;
TEMPLATE_ARGS: ... ;

Semantic and Syntactic Predicates[edit]

two advanced ideas are

syntactic predicates[edit]

parts of a template look like:

parts: 
 title=LITERAL (PIPE (args=LITERAL)*)*;

semantic predicates[edit]

Using Semantics To Simplify Syntax Parsing[edit]

does {{LITERAL| ... }} reffer to a parser function, magic word or a template ?

since the parser should be privy to a symbol table of parser_functions, magic_words, templates, variables, etc they can be used to simplify (read: speed up parsing using a simple semantic check (isMagicWord) instead of a deep syntax scan.) in Antlr speak lookahead is said to be analogous to checking all options of the maze, using predicates is called sending in a trained monkiess to stand in the required junction points.

IncludeOnly and NoInclude semantics[edit]

  • IncludeOnly means include the following text only within the scope of the tansculded page (expanded page).
  • NoInclude means include the following text only within the scope of the template page (expanded page).
//todo prototype and paste here :-)

Look Ahead[edit]

This is an example of using look ahead in the parser. There is also lookahead in the lexer. But it does not return string only ints (of chars).

Implementing some parser functions[edit]

if switch etc are the where parsers shine. the parser rules below use sysntactic predicates:

stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;

/** An ID whose text is "#if" */
keyIF : {input.LT(1).getText().equals("#if")}? ID ;
/** An ID whose text is "#switch" */
keyCALL : {input.LT(1).getText().equals("#switch")}? ID ;

Implementing some Magic Words[edit]

Magic words can be trickier to take care of since. one idea is to keep them in a data structure like a hash or a trie.

@members{

  public boolean isMagic(String token){

     return (magic.isKey(token))); // a hash of magic words token (and thier functor objects)
  }


}




keyCALL : {isMagic(input.LT(1).getText())}? ID ;