Extension:ParserFunctions/String functions

This extension defines an additional set of parser functions that operate on strings.

Note: These functions are currently not installed at the Wikimedia projects.

Functions
This module defines these functions: len, pos, rpos, sub, pad, replace, explode, urlencode, and urldecode</tt>.

Note: Some parameters of these functions can be limited through global settings to ensure the functions operate in O(n) time complexity, and are therefore safe against DoS attacks. See section Limits below.

#len:
The #len function returns the length of the given string. The syntax is:

The return value is always a number of characters in the string. If no string is specified, the return value is zero.

Note: Trailing spaces are not counted. Example:  returns 8.

Note: This function is safe with utf-8 multibyte characters. Example:  returns 8.

#pos:
The #pos function returns the position of a given needle within the string. The syntax is:

The offset parameter, if specified, tells a starting position where this function should begin searching.

If the needle is found, the return value is a zero-based integer of the first position within the string. If the needle is not found, the function returns an empty string.

Note: This function is case sensitive.

Note: The maximum allowed length of the needle is limited through $wgStringFunctionsLimitSearch global setting.

Note: This function is safe with utf-8 multibyte characters. Example:  returns 4.

#rpos:
The #rpos function returns the last position of a given needle within the string. The syntax is:

If the needle is found, the return value is a zero-based integer of its last position within the string. If the needle is not found, the function returns -1.

Tip: When using this to search for the last delimiter, add +1 to the result to retreive position after the last delimiter. This also works when the delimiter is not found, because "-1 + 1" is zero, which is the beginning of the given value.

Note: This function is case sensitive.

Note: The maximum allowed length of the needle is limited through $wgStringFunctionsLimitSearch global setting.

Note: This function is safe with utf-8 multibyte characters. Example:  returns 4.

Note: When this extension is running on PHP 4, the needle can only be a single character. If a string is used as the needle, only the first character of that string will be used. The needle may be a string of more than one character as of PHP 5.0.0.

#sub:
The #sub function returns a substring from the given string. The syntax is:

The start parameter, if positive (or zero), specifies a zero-based index of the first character to be returned. Example:  returns cream</tt>.

If the start parameter is negative, it specifies how many characters from the end should be returned. Example:  returns eam</tt>.

The length parameter, if present and positive, specifies the maximum length of the returned string. Example:  returns cre</tt>.

If the length parameter is negative, it specifies, how many characters will be omitted from the end of the string. Example:  returns cr</tt>.

Note: If the length parameter is zero, it is not used for truncation at all. Example:  returns cream</tt>.

Note: If start denotes a position beyond the truncation from the end by negative length parameter, an empty string will be returned. Example:  returns an empty string.

The return value is a substring from the string, or an empty string.

Note: This function is safe with utf-8 multibyte characters. Example:  returns žlina</tt>.

#pad:
The #pad function returns the given string extended to a given width. The syntax is:

The length parameter specifies the desired length of the returned string.

The padstring parameter, if specified, specifies a pattern to be used to fill the missing space. It may be a single character, which will be used as many times as necessary; or a string which will be concatenated as many times as necessary and then trimed to the required length. Example:  returns XxXxXxXIce</tt>.

If the padstring is not specified, spaces are used for padding.

The direction parameter, if specified, can be one of these values:
 * left</tt> - the padding will be on the left side of the string. Example:  returns xxIce</tt>.
 * right</tt> - the padding will be on the right side of the string. Example:  returns Icexx</tt>.
 * center</tt> - the string will be centered in the returned string. Example:  returns xIcex</tt>.

If the direction is not specified, the padding will be on the left side of the string.

The return value is the given string extended to the length of characters, using the padstring to fill the missing part(s). If the given string is already longer than length, it is not extended, nor truncated.

Note: The maximum allowed value for the length is limited through $wgStringFunctionsLimitPad global setting.

Note: This function is NOT safe with utf-8 multibyte characters (i.e. non-english characters). Example:  returns xxŽmržlina</tt> (instead of xxxxŽmržlina</tt>).

#replace:
The #replace function returns the given string with all occurences of a needle replaced with product.

If the needle is not specified, or is empty a single space will be searched for.

If the product is not specified, or is empty, all occurences of the needle will be removed from the string.

Note: This function is case sensitive.

Note: Even if the product is a space, an empty string is used. This is a side-efect of the MediaWiki parser feature. To use a space as the product, put it in a nowiki tags. Example:  returns <tt>My little home page</tt>.

Note: The maximum allowed length of the needle is limited through $wgStringFunctionsLimitSearch global setting.

Note: The maximum allowed length of the product is limited through $wgStringFunctionsLimitProduct global setting.

Note: This function is safe with utf-8 multibyte characters. Example:  returns <tt>Žmrzlina</tt>.

#explode:
The #explode functions splits the given string into pieces and then returns one of the pieces. The syntax is:

The delimiter parameter specifies a string to be used to divide the string into pieces. This delimiter string is then not part of any piece, and when two delimiter strings are next to each other, they create an empty piece between them. If this parameter is not specified, a single space is used.

The position parameter specifies, which piece is to be returned. Pieces are counted from 0. If this parameter is not specified, the first piece is used (piece with number 0).

The return value is the position-th piece. This may be an empty string. If there are less pieces than the position specifies, an empty string is returned.

Note: This function is case sensitive.

Note: The maximum allowed length of the delimiter is limited through $wgStringFunctionsLimitSearch global setting.

Note: This function is safe with utf-8 multibyte characters. Example:  returns <tt>Žmrž</tt>.

#urlencode: and #urldecode:
These two functions operate in tandem: #urlencode converts a string into an URL-safe syntax, and #urldecode converts such a string back. The syntax is:

Note: These functions work by directly exposing PHP's urlencode and urldecode functions.

Pipes
To use a pipe ( "|" ) in the parameters in these functions, try the &#124; sequence, which is translated to a regular pipe before these functions are executed, but only after the control pipes are established by the parser. This way the pipes will be treated as pieces of string, not like wiki syntax.

Spaces
To use a space in the parameters in these functions (at the beginning or end of the parameters), you might need to put the space in the <'nowiki></'nowiki> tags. This will prevent the parser from trimming the space off before calling the functions.

Limits
This module defines these global settings: <tt>$wgStringFunctionsLimitSearch</tt>, <tt>$wgStringFunctionsLimitReplace</tt>, <tt>$wgStringFunctionsLimitPad</tt>.

They are used to limit some parameters of some functions to ensure the functions operate in O(n) time complexity, and are therefore safe against DoS attacks.

$wgStringFunctionsLimitSearch
This setting is used by #pos:, #rpos:, #replace:, #explode: functions. All these functions search for a substring in a larger string while they operate, which can run in O(n*m), and therefore they make the software more voulnerable to DoS attacks. By setting this value to a specific small number, the time complexity is decreased to O(n).

This setting limits the maximum allowed length of the string being searched for.

The default value is 30 multibyte characters.

$wgStringFunctionsLimitReplace
This setting is used by #replace: function. This function replaces all occurences of one string for another, which can be used to quickly generate very large amounts of data, and therefore it can make the software more voulnerable to DoS attacks. This setting limits the maximum allowed length of the replacing string.

The default value is 30 multibyte characters.

$wgStringFunctionsLimitPad
This setting is used by #pad: function. This function creates a string by the specified length, which can be used to quickly generate very large amounts of data, and therefore it can make the software more voulnerable to DoS attacks. This setting limits the maximum allowed length of the resulting padded string.

The default value is 100 multibyte characters.

Requirements
This extension requires MediaWiki 1.6+ and the PHP mbstring extension.

If you do not have mbstring installed, you can do without by deleting all occurrences of " " (without quotes) in the code, which will replace all mbstring functions (mb_strlen, mb_substr, mb_strpos, mb_strrpos) by their non-mbstring equivalents. By doing this, the extension will no longer work correctly with utf-8 multibyte characters (e.g. non-English letters), but it will work.

Instructions
Do the following to install these functions as an extension to MediaWiki.
 * 1) Copy the source code below to <tt>extensions/StringFunctions/StringFunctions.php</tt>.
 * 2) Add the following to <tt>LocalSettings.php</tt> (near the bottom) in the root of your MediaWiki installation: require_once ("$IP/extensions/StringFunctions/StringFunctions.php");

Compatibility with MediaWiki 1.6
All StringFunctions will work on MediaWiki 1.6, but their syntax is without the # character. If you want to use the # character, find this section of the <tt>/extensions/StringFunctions/StringFunctions.php : $wgParser->setFunctionHook ( 'len',      array ( &$wgExtStringFunctions, 'runLen'       ) ); $wgParser->setFunctionHook ( 'pos',      array ( &$wgExtStringFunctions, 'runPos'       ) ); $wgParser->setFunctionHook ( 'rpos',     array ( &$wgExtStringFunctions, 'runRPos'      ) ); $wgParser->setFunctionHook ( 'sub',      array ( &$wgExtStringFunctions, 'runSub'       ) ); $wgParser->setFunctionHook ( 'pad',      array ( &$wgExtStringFunctions, 'runPad'       ) ); $wgParser->setFunctionHook ( 'replace',  array ( &$wgExtStringFunctions, 'runReplace'   ) ); $wgParser->setFunctionHook ( 'explode',  array ( &$wgExtStringFunctions, 'runExplode'   ) ); $wgParser->setFunctionHook ( 'urlencode', array ( &$wgExtStringFunctions, 'runUrlEncode' ) ); $wgParser->setFunctionHook ( 'urldecode', array ( &$wgExtStringFunctions, 'runUrlDecode' ) );

Replace the above with this: $wgParser->setFunctionHook ( '#len',      array ( &$wgExtStringFunctions, 'runLen'       ) ); $wgParser->setFunctionHook ( '#pos',      array ( &$wgExtStringFunctions, 'runPos'       ) ); $wgParser->setFunctionHook ( '#rpos',     array ( &$wgExtStringFunctions, 'runRPos'      ) ); $wgParser->setFunctionHook ( '#sub',      array ( &$wgExtStringFunctions, 'runSub'       ) ); $wgParser->setFunctionHook ( '#pad',      array ( &$wgExtStringFunctions, 'runPad'       ) ); $wgParser->setFunctionHook ( '#replace',  array ( &$wgExtStringFunctions, 'runReplace'   ) ); $wgParser->setFunctionHook ( '#explode',  array ( &$wgExtStringFunctions, 'runExplode'   ) ); $wgParser->setFunctionHook ( '#urlencode', array ( &$wgExtStringFunctions, 'runUrlEncode' ) ); $wgParser->setFunctionHook ( '#urldecode', array ( &$wgExtStringFunctions, 'runUrlDecode' ) );

Code
This code has been tested on MediaWiki 1.6.8 and above.

History:
 * Jan 30, 2007 -- v1.9 -- Added limits to #pos, #pad, #replace, #explode.
 * Oct 30, 2006 -- v1.8 -- Fixed for MediaWiki 1.8.
 * Oct 26, 2006 -- v1.6 -- Fixed spaces in #rpos and #replace.
 * Oct 1, 2006 -- v1.5 -- Added #rpos, #pad, #replace, #explode.
 * May 18, 2006 -- v1.2 -- Renamed #toURL and #fromURL to match MediaWiki's &#123;&#123;urlencode:}} function.
 * May 18, 2006 -- v1.1 -- Added #pos.
 * May 15, 2006 -- v1.0 -- First stable release.

Applications

 * Constructing a suitable sortkey from a page name, in a category tag in the transcluded part of a template.