User:OrenBochman/ParserNG/Transliterator Antlr

From mediawiki.org

Tranlitatrator Filter Antlr[edit]

"To make ANTLR generate lexers that behave like the UNIX utility sed (copy standard in to standard out except as specified by the replace patterns), use a filter rule that does the input to output copying:" - antlr docs [1]

class cfgSed extends Lexer;
options {
  k=2;
  filter=IGNORE;
  charVocabulary = '\3'..'\177';

  //if dictionary is needed
  map<String,String> dictionary = loadDictionary();

}

//example of unicode to unicode conversion;
ALPHA1 : '\u000X'-'\u000Y';

KENJI  : src:ALPHA1;
        { System.out.print(dictionary.get(src)); } // filter output
        ;

protected
IGNORE
  :  ( "\r\n" | '\r' | '\n' )
     {newline(); System.out.println("");}
  |  c:. {System.out.print(c);}
  ;

based on [2]

the idea is to use a dictionary (map) or a conversion function to replace the detected char set.

Usage[edit]

this filter can be:

  • Integrated into the lexer (one scan would be fastest).
  • Run as a seperate step (modular, slow, easily configurable).

Issues[edit]

  • Best if translitration is an offset or a call to an outside module.
  • Dictionry and look ahead provide maximum tanliteration power.