User:OrenBochman/ParserNG/Preprocessor

Here is wikitext preprocessor written in antlr. It is aimed at parsing for search rather than parsing for rendering It supports recursive template and math expressions in template, some parser function and some magic words

based on
 * http://en.wikipedia.org/wiki/Help:Calculation
 * http://en.wikipedia.org/wiki/Help:Calculation

test input
/* basic */ aa cc

/* parametrised */

/* nested */

/* core parser functions */ HEAVENS TO BETSY! heavens to betsy!

/* ext parser functions */

//////////////////////////////////////////////////untested ///////////////////////////////////////////////////////////////

/* magic words behavior switches*/ –  –   –  –

/* variables */

=untested on=

Behavior switches

 *   (can appear anywhere in the wikitext; suppresses the table of contents)
 *   (can appear anywhere in the wikitext; makes a table of contents appear in its normal position above the first header)
 *   (places a table of contents at the word's position)


 *   (hides the section "edit" links beside all headings on the page) (use tags to hide the edit link for one header only)
 *  __NEWSECTIONLINK__  (adds a "+" link for adding a new section on a non-"Talk" page)
 *  __NONEWSECTIONLINK__  (removes the "+" link on "Talk" pages)


 *  __NOGALLERY__  (on a category page, replaces thumbnails with normal links)
 *  __HIDDENCAT__  (on a category page, makes it a hidden category)
 *  __INDEX__ </tt> (tells search engines to index the page)
 *  __NOINDEX__ </tt> (tells search engines not to index the page)


 *  </tt> (changes the displayed form of the page title)
 *  </tt> (sets a default category sort key)

Variables
For documentation, refer to the Variables section of the MediaWiki page.
 *  </tt> (page title including namespace)
 *  </tt> (page title excluding namespace)
 *  </tt> (page title excluding current subpage and namespace - effectively the parent page without the namespace.)
 *  </tt> (subpage part of title)
 *  </tt> (associated non-talk page)
 *  </tt> (associated talk page)
 *  </tt> (namespace of current page)
 * , </tt> (associated non-talk namespace)
 *  </tt> (associated talk namespace)
 * , </tt> etc. (equivalents encoded for use in MediaWiki URLs)

The above can all take a parameter, to operate on a page other than the current page.


 *  </tt>
 *  </tt>
 * <tt> </tt>
 * <tt> </tt>
 * <tt> </tt> (current MediaWiki version)
 * <tt> </tt> (latest revision to current page)
 * <tt>, , , , , </tt> (date, time, editor at last edit)


 * <tt> 2024, August, August, August, 27,, , Tuesday, , , , </tt> (current date/time variables)
 * <tt> 2024 </tt> etc. (as above, based on site's local time)


 * <tt>, , , , , , , </tt> (statistics on English Wikipedia; add <tt>:R</tt> to return numbers without commas)

Parser functions
These are documented at the main documentation page unless otherwise stated.

Metadata

 * <tt> </tt> (size of page in bytes)
 * <tt> </tt> (protection level for given action on the current page)
 * <tt> </tt> (number of pages in the given category)
 * <tt> </tt> (number of users in a specific group)

Add <tt>|R</tt> to return numbers without commas.

Formatting

 * <tt> string </tt> (convert to lower case)
 * <tt> string </tt> (convert first character to lower case)
 * <tt> STRING </tt> (convert to upper case)
 * <tt> String </tt> (convert first character to upper case)
 * <tt> NaN </tt> (format a number with comma separators; add <tt> | R</tt> to unformat a number)
 * <tt> </tt> (formats a date according to user preferences; a default can be given as an optional case-sensitive second parameter for users without date preference; can convert a date from an existing format to any of <tt>dmy</tt>, <tt>mdy</tt>, <tt>ymd</tt> or <tt>ISO 8601</tt> formats, with the user's preference overriding the specified format)
 * <tt> xyz </tt>, <tt> xyz </tt> (pad with zeros to the right or left; an alternative padding string can be given as a third parameter; the alternative padding string may be truncated if its length does not evenly divide the required number of characters)
 * <tt> NaN iss </tt> (produces alternative text according to whether n is greater than 1)
 * <tt> </tt> (for date/time formatting; also <tt>#timel</tt> for local time. Covered at the extension documentation page.)
 * <tt> </tt> (produces alternative text according to the gender specified by the given user in his/her preferences)
 * <tt> </tt> (equivalent to an HTML tag or pair of tags; can be used for nesting references)

Paths

 * <tt> </tt>, <tt>  </tt> (relative path to the title)
 * <tt> </tt>, <tt>  </tt> (absolute path to the title, without a protocol prefix)
 * <tt> </tt>, <tt>  </tt> (absolute path to the title, with a protocol prefix)
 * <tt> </tt> (absolute URL to a media file)
 * <tt> </tt> (input encoded for use in URLs)
 * <tt> </tt> (input encoded for use in URL section anchors)
 * <tt>     </tt> (name for the namespace with index n; use <tt>  </tt> for the equivalent encoded for MediaWiki URLs)
 * <tt> </tt> (converts a relative file path to absolute; see the extension documentation)
 * <tt> </tt> (splits title into parts; see the extension documentation)

Conditional expressions
These are covered at the extension documentation page. Some parameters are optional.
 * <tt> </tt> (evaluates the given expression; see Help:Calculation)
 * <tt> </tt> (selects one of two values based on whether the test string is empty)
 * <tt> </tt> (selects one of two values based on whether the test strings are equal – numerically if applicable)
 * <tt> </tt> (selects value based on whether the test string generates a parser error)
 * <tt> </tt> (selects value based on evaluation of expression)
 * <tt> </tt> (selects value depending on whether a page title exists)
 * <tt> </tt> (provides alternatives based on the value of the test string)

caveats:

 * 1) it is based on identifiers and not general strings
 * 2) tested input is limited.
 * 3) it should support type tags
 * 4) it should support comments and &lt; nowiki>.
 * 5) math support can be improved to consider precedence