User:OrenBochman/ParserNG/Preprocessor

shortcut: WP:PF
From mediawiki.org

Quick Preprocessor[edit]

Here is wikitext preprocessor written in antlr. It is aimed at parsing for search rather than parsing for rendering It supports recursive template and math expressions in template, some parser function and some magic words

based on

grammar br1;

be	:	tExpr*;

fragment	//mathematical IDentifier adds pi and e constants
mID	: ID|PI_CONST |E_CONST	;

fragment	//template expression
tExpr	: ID
//      | FLOAT
//	| INT
	| var
	| '{{'  ID+  ('|' (statement | tExpr) )* '}}'
	| '{{'  pf_core (rValue|var) ('|' (statement | tExpr) )* '}}'
	| '{{'  pf_ext_math mathExpr ('|' (mathExpr) )* '}}'
//	| '{{'  pf_ext (expr)* '}}'
	;

fragment
pf_ext_math //parser functions extentions which evaluate math expression
	:	 '#ifeq' |'#ifexpr' |'#if' |'#expr' ; 
//math expression

fragment	//math expression used by extentions parser functions
mathExpr
	: '(' mathAtom (op mathExpr)* ')' (op mathExpr)*   // (4 + ... ) rond 3
	|  mathAtom (op mathExpr)*	// 4+ ....
	|  
	;


fragment	
mathAtom: FLOAT
	| INT
	| mID 
	| var;
	
fragment
op	:	EQ|NOT|MUL|DIV|MOD|PLUS|MINUS|ROUND|DIFF|NE|GE|LE|AND|OR|POW;

fragment
pf_core	: PF_CORE ;		//core parser functions

fragment
pf_ext	: PF_EXT;		//extentions parser functions

fragment
var	:'{{{' ID  '|'? '}}}'; //var deinition OR var existance

fragment
statement: ID EQ rValue ;     //statement like ID=rValue

fragment
rValue	: FLOAT| INT | ID+;  // rValue is the assigned value ID+ should be String

fragment
E_CONST	:	'e';		// contsant e in math expressions
fragment
PI_CONST:	'p' 'i';       // contsant PI in math expressions

//mathematcial operators
EQ	:	'=';
NOT	:	'n' 'o' 't';
MUL	:	'*';
DIV	:	'/';
MOD	:	'm' 'o' 'd';
PLUS	:	'+';
MINUS	:	'-';
ROUND	:	'r' 'o' 'u' 'n' 'd';
DIFF	:	'<' '>';
NE	:	'!' '=';
GE	:	'<' '=';
LE	:	'>' '=';
AND	:	'a' 'n' 'd';
OR	:	'o' 'r';
POW	:	'^';


PF_CORE :  ID ':' ;		//core parser function
PF_EXT 	:  '#' ID ;		//extention parser function


ID  :	('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;


INT :	'0'..'9'+
    ;

FLOAT
    :   ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
    |   '.' ('0'..'9')+ EXPONENT?
    |   ('0'..'9')+ EXPONENT
    ;

COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

CHAR:  '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
    ;

fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

test input[edit]

/* basic */
 aa {{bb}} cc

/* parametrised */
{{payoff matrix | UL = 5 | UR = 7 | DL = 2 | DR = 9 | Name = Example usage }}  

/* nested  */
{{dd | {{ee| ff}} }} 
{{gg | {{hh | jj = 2}} | jj}}

/* core parser functions */
{{uc: Heavens to BETSY! }}
{{lc: Heavens to BETSY! }}
{{NS: 1 }}
{{fullurl: pagename }}

/* ext parser functions */
{{#ifeq: yes | yes | Hooray...! | Darn...! }}
{{#ifeq: yes | no | Hooray...! | Darn...! }}
{{#if: {{{param|}}} | Hooray...! | Darn...! }}
{{#expr: ( pi * 4 ^ 2 ) round 3 }}
{{#ifexpr: 1.23E+3 mod 2 | Odd | Even }}
{{#iferror: 1 / 0 }} 
{{#ifexist: pageTitle |Hooray | Darn }}
{{#switch: comparison string
 | case1 = result1
 | case2 
 | case3 
 | case4 = result2
 | case5 = result3
 | case6 
 | case7 = result4
 | #default = default result
}}

__NOTOC__
__FORCETOC__
__TOC__
__NOEDITSECTION__
__NEWSECTIONLINK__
__NONEWSECTIONLINK__
__NOGALLERY__
__HIDDENCAT__
__INDEX__
__NOINDEX__


untest input[edit]

//////////////////////////////////////////////////untested ///////////////////////////////////////////////////////////////

/* magic words - behavior switches*/
__NOTOC__
__FORCETOC__
__TOC__
__NOEDITSECTION__
__NEWSECTIONLINK__
__NONEWSECTIONLINK__
__NOGALLERY__
__HIDDENCAT__
__INDEX__
__NOINDEX__

/* magic words -  page variables */
{{DISPLAYTITLE:title}} 
{{DEFAULTSORT:sortkey}} 

{{FULLPAGENAME}} {{FULLPAGENAMEE}}
{{PAGENAME}}
{{BASEPAGENAME}} 
{{SUBPAGENAME}} 
{{SUBJECTPAGENAME}} 
{{TALKPAGENAME}} 
{{NAMESPACE}}  {{NAMESPACEE}} 
{{SUBJECTSPACE}} {{ARTICLESPACE}} 
{{TALKSPACE}}

/*Variables: Date and time*/
{{SITENAME}} {{SERVER}} {{SERVERNAME}} {{SCRIPTPATH}} {{CURRENTVERSION}} {{REVISIONID}}
{{REVISIONDAY}} {{REVISIONDAY2}} {{REVISIONMONTH}} {{REVISIONYEAR}} {{REVISIONTIMESTAMP}} {{REVISIONUSER}} 
{{CURRENTYEAR}} {{CURRENTMONTH}} {{CURRENTMONTHNAME}} {{CURRENTMONTHABBREV}} {{CURRENTDAY}} {{CURRENTDAY2}} {{CURRENTDOW}}{{CURRENTDAYNAME}} {{CURRENTTIME}} {{CURRENTHOUR}} {{CURRENTWEEK}} {{CURRENTTIMESTAMP}} {LOCALYEAR}}
/*Variables: Stats */
{{NUMBEROFPAGES}} {{NUMBEROFARTICLES}} {{NUMBEROFFILES}} {{NUMBEROFEDITS}} 
{{NUMBEROFVIEWS}} {{NUMBEROFUSERS}} {{NUMBEROFADMINS}} {{NUMBEROFACTIVEUSERS}}
/*Variables: Stats no commas */
{{NUMBEROFPAGES:R}} {{NUMBEROFARTICLES:R}} {{NUMBEROFFILES:R}} {{NUMBEROFEDITS:R}} 
{{NUMBEROFVIEWS:R}} {{NUMBEROFUSERS:R}} {{NUMBEROFADMINS:R}} {{NUMBEROFACTIVEUSERS:R}}


/* Parser Functions: Metadata */ 
{{PAGESIZE:page name}}
{{PROTECTIONLEVEL:action}} 
{{PAGESINCATEGORY:categoryname}} 
{{NUMBERINGROUP:groupname}}

/* Parser Functions: Formatting*/ 
{{lc:string}} 
{{lcfirst:string}}
{{uc:string}} 
{{ucfirst:string}} 
{{formatnum: 3333}}
{{#formatdate:date|format}} 
{{padleft:xyz|5}}
{{padright:xyz|5}} 
{{plural:1|is|are}} 
{{plural:2|is|are}} 
{{#time:format string|date/time object}}
{{gender: male|masculine|female|neutral}}
{{gender: female|masculine|female|neutral}}
{{#tag: tagname|content|parameter1=value1|parameter2=value2}}

untested on[edit]

Behavior switches[edit]

done now

Variables[edit]

For documentation, refer to the Variables section of the MediaWiki page.

  • {{FULLPAGENAME}} (page title including namespace)
  • {{PAGENAME}} (page title excluding namespace)
  • {{BASEPAGENAME}} (page title excluding current subpage and namespace - effectively the parent page without the namespace.)
  • {{SUBPAGENAME}} (subpage part of title)
  • {{SUBJECTPAGENAME}} (associated non-talk page)
  • {{TALKPAGENAME}} (associated talk page)
  • {{NAMESPACE}} (namespace of current page)
  • {{SUBJECTSPACE}}, {{ARTICLESPACE}} (associated non-talk namespace)
  • {{TALKSPACE}} (associated talk namespace)
  • {{FULLPAGENAMEE}}, {{NAMESPACEE}} etc. (equivalents encoded for use in MediaWiki URLs)

The above can all take a parameter, to operate on a page other than the current page.

  • {{SITENAME}} (MediaWiki)
  • {{SERVER}} (//www.mediawiki.org)
  • {{SERVERNAME}} (www.mediawiki.org)
  • {{SCRIPTPATH}} (/w)
  • {{CURRENTVERSION}} (current MediaWiki version)
  • {{REVISIONID}} (latest revision to current page)
  • {{REVISIONDAY}}, {{REVISIONDAY2}}, {{REVISIONMONTH}}, {{REVISIONYEAR}}, {{REVISIONTIMESTAMP}}, {{REVISIONUSER}} (date, time, editor at last edit)
  • {{CURRENTYEAR}}, {{CURRENTMONTH}}, {{CURRENTMONTHNAME}}, {{CURRENTMONTHABBREV}}, {{CURRENTDAY}}, {{CURRENTDAY2}}, {{CURRENTDOW}}, {{CURRENTDAYNAME}}, {{CURRENTTIME}}, {{CURRENTHOUR}}, {{CURRENTWEEK}}, {{CURRENTTIMESTAMP}} (current date/time variables)
  • {{LOCALYEAR}} etc. (as above, based on site's local time)
  • {{NUMBEROFPAGES}}, {{NUMBEROFARTICLES}}, {{NUMBEROFFILES}}, {{NUMBEROFEDITS}}, {{NUMBEROFVIEWS}}, {{NUMBEROFUSERS}}, {{NUMBEROFADMINS}}, {{NUMBEROFACTIVEUSERS}} (statistics on English Wikipedia; add :R to return numbers without commas)

Parser functions[edit]

These are documented at the main documentation page unless otherwise stated.

Metadata[edit]

  • {{PAGESIZE:page name}} (size of page in bytes)
  • {{PROTECTIONLEVEL:action}} (protection level for given action on the current page)
  • {{PAGESINCATEGORY:categoryname}} (number of pages in the given category)
  • {{NUMBERINGROUP:groupname}} (number of users in a specific group)

Add |R to return numbers without commas.

Formatting[edit]

  • {{lc:string}} (convert to lower case)
  • {{lcfirst:string}} (convert first character to lower case)
  • {{uc:string}} (convert to upper case)
  • {{ucfirst:string}} (convert first character to upper case)
  • {{formatnum:unformatted num}} (format a number with comma separators; add |R to unformat a number)
  • {{#formatdate:date|format}} (formats a date according to user preferences; a default can be given as an optional case-sensitive second parameter for users without date preference; can convert a date from an existing format to any of dmy, mdy, ymd or ISO 8601 formats, with the user's preference overriding the specified format)
  • {{padleft:xyz|stringlength}}, {{padright:xyz|stringlength}} (pad with zeros to the right or left; an alternative padding string can be given as a third parameter; the alternative padding string may be truncated if its length does not evenly divide the required number of characters)
  • {{plural:n|is|are}} (produces alternative text according to whether n is greater than 1)
  • {{#time:format string|date/time object}} (for date/time formatting; also #timel for local time. Covered at the extension documentation page.)
  • {{gender:username|masculine|female|neutral}} (produces alternative text according to the gender specified by the given user in his/her preferences)
  • {{#tag:tagname|content|parameter1=value1|parameter2=value2}} (equivalent to an HTML tag or pair of tags; can be used for nesting references)

Paths[edit]

  • {{localurl:page name}}, {{localurl:page name|query string}} (relative path to the title)
  • {{fullurl:page name}}, {{fullurl:page name|query_string}} (absolute path to the title, without a protocol prefix)
  • {{canonicalurl:page name}}, {{canonicalurl:page name|query_string}} (absolute path to the title, with a protocol prefix)
  • {{filepath:file name}} (absolute URL to a media file)
  • {{urlencode:string}} (input encoded for use in URLs)
  • {{anchorencode:string}} (input encoded for use in URL section anchors)
  • {{ns:n}} (name for the namespace with index n; use {{nse:}} for the equivalent encoded for MediaWiki URLs)
  • {{#rel2abs: path }} (converts a relative file path to absolute; see the extension documentation)
  • {{#titleparts: pagename | number of segments to return | first segment to return }} (splits title into parts; see the extension documentation)

Conditional expressions[edit]

caveats:[edit]

  1. it is based on identifiers and not general strings
  2. tested input is limited.
  3. it should support <code><include>,<noinclude>,<includeonly></code> type tags
  4. it should support comments <!-- --> and ‎<nowiki>...‎</nowiki>.
  5. math support can be improved to consider precedence