Extension:RegexParserFunctions

The RegexParserFunctions extension adds a parser function called 'regex' (or 'regexp') which is used to perform regular expression pattern matching and replacement.

The extension is unmaintained by the original author, but is now maintained by me (User:VitaliyFilippov) as a part of Mediawiki4Intranet project. The security issue found by User:Pastakhov (one with the null-byte) is fixed in our version, and the extension is compatible with latest versions of MediaWiki.


 * Git repository: https://github.com/mediawiki4intranet/RegexParserFunctions
 * New homepage: http://wiki.4intra.net/RegexParserFunctions
 * Old homepage: http://jimbojw.com/wiki/index.php?title=RegexParserFunctions_Extension
 * Licensing: RegexParserFunctions is released under The MIT License.

Installation

 * 1) Clone the repository from https://github.com/mediawiki4intranet/RegexParserFunctions into $IP/extensions/ subdirectory
 * Note: $IP is your MediaWiki install dir.
 * 1) Enable the extension by adding this line to your LocalSettings.php:

Usage
Once installed, editors of your wiki can evaluate regular expressions in one of two ways: simple match, and replacement.


 * Simple match:, evaluates to the matching portion of &lt;string&gt; (the behavior is the same as if the missing parameter was  );
 * Replacement:, evaluates to &lt;string&gt; with &lt;regex&gt; replaced by &lt;replacement&gt; globally.

For example, say you're trying to grab the last portion of a Title which is using '/' delimiting subpage notation. For that, you could use:

The supported regular expressions (in the second parameter) are the same as those implemented in the standard library of the local installation of PHP (used to run MediaWiki).


 * See the PHP documentation of PCRE patterns for details of the supported features and syntax and the differences with Perl and POSIX regular expressions (they are also different from the simpler regular expressions built in the Lua extension for MediaWiki).

Note that the regular expression pattern must be enclosed in a pair of delimiters — if this delimiter is missing, an error is currently returned. Within the regular expression itself that delimiter has to be escaped (with ) to match it literally. The supported delimiters must be one of,   or   (however this last character is not easily usable with the MediaWiki syntax as it would require MediaWiki escaping within a nowiki tag).

The delimited pattern is then optionnally followed by a one or more letters defining matching options. Currently the supported options are:
 * for case insensitive matching (without the  option, the recognized case pairs are those from the US-ASCII subet);
 * for multiline matching (the  and   metacharacters in patterns will match only once, respectively at start and end of the whole );
 * for matching newline controls with the pattern  or with character classes;
 * for recognizing non-ASCII characters that are encoded as multibyte sequences with UTF-8 as a single character for  and character classes (this option supported by PHP 4.1+ for Unix and Linux platforms).
 * This option disables some PCRE features that are not compatible with Perl regular expressions. Notably, it will block the PHP regexp extensions that can activate the PHP evaluation of embedded expressions and the conversion of their result into the source, or parameters (see the security notes below).
 * This option modifies the  option to support simple (1-to-1) case pairs defined in the Unicode Character Database (but it does not use the locale-dependant or complex casing rules; it ignores Unicode canonical equivalences, where some letters in a given letter case only exists in decomposed form, but the other letter case exists both in precomposed and decomposed forms which are canonically equivalent)
 * Unicode characters encoded as combining sequences are handled as multiple separate characters: if the Latin letter 'é' is encoded in NFD form (as 3 UTF-8 bytes) in the input string, the pattern  in the regexp will match only the base letter 'e', but not the two UTF-8 bytes after it that encode the combining acute accent; but if the same Latin letter is encoded in NFC form (as 2 UTF-8 bytes) in the input string, the same pattern will match it.
 * There's currently no regexp option for normalizing the input string or pattern into a standard Unicode form (such as NFD in order to match all simple case pairs), and renormalizing the output (for example in NFC form).
 * This option enables the syntax  for specifying between the braces a valid Unicode scalar value in hexadecimal.
 * This option also changes the interpretation of patterns like ..  (in hexadecimal) or  ..  (in octal), which are changed to cover the UTF-8 encoded range of U+0080..U+00FF instead of the range of individual bytes 0x80..0xFF — for example,   will match the non-breaking space U+00A0 (encoded as two bytes in UTF-8).
 * Since PHP 4.3.5, the validity of the, and parameters with strict UTF-8 encoding rules is checked: the acceptable character set is the whole set of Unicode code points that have a defined scalar value (i.e. U+0000..U+D7FF or U+E000..U+10FFFF): it's no longer possible to search, match or replace isolated or unpaired surrogates (that don't have a valid UTF-8 encoding), or other invalid bytes (such as \xC0, \xC1 or \xFF) and byte sequences that are invalid in UTF-8; but matching Unicode non-characters (such as U+FFFF) remains possible; since PHP 5.3.4 it's also no longer possible to accept non-standard legacy UTF-8 sequences encoded with 5 or 6 bytes that were only defined in an early obsoleted RFC (which extended the set of encoded legacy code points up to U-FFFFFFFF).

Security
Some old versions of this MediaWiki extension did not block the expansion of internal PHP variables and did not restrict the set of options:
 * If some specially crafted was used without the  option, this could open a security hole by allowing arbitrary PHP code to be executed.

For this reason the PHP-specific regexp  option (obsolete in PHP since version 5.5.0) is not accepted by this MediaWiki extension, and the PCRE engine now also forbids activating it within the pattern itself (with PHP-specific subpatterns extensions incompatible with Perl, notably with PHP-specific internal options).

Websites that use RegexParserFunctions

 * http://mtg.wikia.com
 * http://de.memory-alpha.org
 * http://potbs.wikia.com