From MediaWiki.org
Jump to: navigation, search
MediaWiki extensions manual
Crystal Clear action run.png

Release status: stable

Implementation Parser function
Description Adds a {{#regex}} parser function for evaluating regular expressions.
Author(s) Jim R. Wilson (Jimbojw), Vitaliy Filippov
Latest version 0.1
MediaWiki 1.6-1.22 and higher
License The MIT License
Download https://github.com/mediawiki4intranet/RegexParserFunctions

Translate the RegexParserFunctions extension if it is available at translatewiki.net

Check usage and version matrix; code metrics

The RegexParserFunctions extension adds a parser function called 'regex' (or 'regexp') which is used to perform regular expression pattern matching and replacement.

The extension in unmaintained by the original author, but is now maintained by me (User:VitaliyFilippov) as a part of Mediawiki4Intranet project. The security issue found by User:Pastakhov (one with the null-byte) is fixed in our version, and the extension is compatible with latest versions of MediaWiki.

Git repository
New homepage
Old homepage
RegexParserFunctions is released under The MIT License.

Installation[edit | edit source]

  1. Clone the repository from https://github.com/mediawiki4intranet/RegexParserFunctions into $IP/extensions/ subdirectory
    Note: $IP is your MediaWiki install dir.
  2. Enable the extension by adding this line to your LocalSettings.php:

Usage[edit | edit source]

Once installed, editors of your wiki can evaluate regular expressions in one of two ways: simple match, and replacement.

  • Simple match: {{#regex: <string> | <regex>}}, evaluates to the matching portion of <string> (the behavior is the same as if the missing <replacement> parameter was $0);
  • Replacement: {{#regex: <string> | <regex> | <replacement>}}, evaluates to <string> with <regex> replaced by <replacement> globally.

For example, say you're trying to grab the last portion of a Title which is using '/' delimiting subpage notation. For that, you could use:


The supported regular expressions (in the second parameter) are the same as those implemented in the standard library of the local installation of PHP (used to run MediaWiki).

See the PHP documentation of PCRE patterns for details of the supported features and syntax and the differences with Perl and POSIX regular expressions (they are also different from the simpler regular expressions built in the Lua extension for MediaWiki).

Note that the regular expression pattern must be enclosed in a pair of delimiters — if this delimiter is missing, an error is currently returned. Within the regular expression itself that delimiter has to be escaped (with \) to match it literally. The supported delimiters must be one of /, % or | (however this last character is not easily usable with the MediaWiki syntax as it would require MediaWiki escaping within a nowiki tag).

The delimited pattern is then optionnally followed by a one or more letters defining matching options. Currently the supported options are:

  • i for case insensitive matching (without the u option, the recognized case pairs are those from the US-ASCII subet);
  • m for multiline matching (the ^ and $ metacharacters in patterns will match only once, respectively at start and end of the whole <string>);
  • s for matching newline controls with the pattern . or with character classes;
  • u for recognizing non-ASCII characters that are encoded as multibyte sequences with UTF-8 as a single character for . and character classes (this option supported by PHP 4.1+ for Unix and Linux platforms).
    This option disables some PCRE features that are not compatible with Perl regular expressions. Notably, it will block the PHP regexp extensions that can activate the PHP evaluation of embedded expressions and the conversion of their result into the source <string>, <regexp> or <replacement> parameters (see the security notes below).
    This option modifies the i option to support simple (1-to-1) case pairs defined in the Unicode Character Database (but it does not use the locale-dependant or complex casing rules; it ignores Unicode canonical equivalences, where some letters in a given letter case only exists in decomposed form, but the other letter case exists both in precomposed and decomposed forms which are canonically equivalent)
    Unicode characters encoded as combining sequences are handled as multiple separate characters: if the Latin letter 'é' is encoded in NFD form (as 3 UTF-8 bytes) in the input string, the pattern . in the regexp will match only the base letter 'e', but not the two UTF-8 bytes after it that encode the combining acute accent; but if the same Latin letter is encoded in NFC form (as 2 UTF-8 bytes) in the input string, the same pattern will match it.
    There's currently no regexp option for normalizing the input string or pattern into a standard Unicode form (such as NFD in order to match all simple case pairs), and renormalizing the output (for example in NFC form).
    This option enables the syntax \x{...} for specifying between the braces a valid Unicode scalar value in hexadecimal.
    This option also changes the interpretation of patterns like \x80..\xFF (in hexadecimal) or \200..\377 (in octal), which are changed to cover the UTF-8 encoded range of U+0080..U+00FF instead of the range of individual bytes 0x80..0xFF — for example, \xA0 will match the non-breaking space U+00A0 (encoded as two bytes in UTF-8).
    Since PHP 4.3.5, the validity of the <string>, <regexp> and <replacement> parameters with strict UTF-8 encoding rules is checked: the acceptable character set is the whole set of Unicode code points that have a defined scalar value (i.e. U+0000..U+D7FF or U+E000..U+10FFFF): it's no longer possible to search, match or replace isolated or unpaired surrogates (that don't have a valid UTF-8 encoding), or other invalid bytes (such as \xC0, \xC1 or \xFF) and byte sequences that are invalid in UTF-8; but matching Unicode non-characters (such as U+FFFF) remains possible; since PHP 5.3.4 it's also no longer possible to accept non-standard legacy UTF-8 sequences encoded with 5 or 6 bytes that were only defined in an early obsoleted RFC (which extended the set of encoded legacy code points up to U-FFFFFFFF).

Security[edit | edit source]

Some old versions of this MediaWiki extension did not block the expansion of internal PHP variables and did not restrict the set of options:

If some specially crafted <regexp> was used without the u option, this could open a security hole by allowing arbitrary PHP code to be executed.

For this reason the PHP-specific regexp e option (obsolete in PHP since version 5.5.0) is not accepted by this MediaWiki extension, and the PCRE engine now also forbids activating it within the pattern itself (with PHP-specific subpatterns extensions incompatible with Perl, notably with PHP-specific internal options).

Websites that use RegexParserFunctions[edit | edit source]

See also[edit | edit source]

Language: English  • 日本語