Jump to content

Help talk:Bad title

Add topic
From mediawiki.org
Latest comment: 6 months ago by Theknightwho in topic Regex

Regex

[edit]

@Shirayuki Hi - I was wondering what the purpose of formatting the regex as a chunk of PHP code is:

  1. The regex is totally unusable in that form - even from within PHP - because self::legalChars() is specific to the internals of TitleParser.php.
  2. Changing the bad characters regex from [^ %!"$&'()*,\-./0-9:;=?@A-Z\\^_`a-z\~\x80-\x{10FFFF}+] to self::legalChars() only serves to obscure what the bad title characters are, and a link to Manual:$wgLegalTitleChars at the bottom of the page is unnecessarily awkward. The list hasn't changed since version 1.8 (with the change in 1.39 being a cosmetic code change with no effect on the string value), so I think we can consider it stable. Plus, 1.41 removed the option to modify it, too.
  3. Anyone unfamiliar with PHP is going to have a bad time trying to use it, and even those who do are going to have to copy and paste it in segments, which is annoying.
  4. As far as I can tell, the original reason it was formatted like that is because it was a direct copy-paste (of a very old version), but if it's no longer a direct representation of what's in TitleParser.php, then why are we formatting it this way?
  5. $rxTc is just a local variable specific to the function getTitleInvalidRegex() in TitleParser.php, and is not meaningful in any way outside of that function.

Theknightwho (talk) 05:00, 25 May 2025 (UTC)Reply

@Theknightwho: Thank you for your detailed feedback. I agree with some of your points and would like to share my thoughts as follows:
  • 1–2. I agree that using self::legalChars() makes the regex opaque to readers. Replacing it with the actual regex character class (as you've suggested) makes the intent clearer and helps users understand which characters are disallowed.
  • 3. That said, even the "raw" regex as you wrote it is still in PHP-specific syntax, so it's not directly reusable in languages like Python or JavaScript. Similarly, a "bare" regex wouldn't be directly usable in PHP either without proper escaping and wrapping.
  • 4–5. Providing an example that assigns the regex to a variable and uses preg_match() to validate a title could still be helpful. It would make the pattern more approachable for practical use.
Shirayuki (talk) 06:12, 25 May 2025 (UTC)Reply
@Shirayuki On point 3, I used PCRE2 syntax, which is indeed used in PHP, but it's also one of the most popular regex engines currently in use. That being said, it wouldn't be too difficult to give alternative formats for other popular engines (perhaps using collapsible boxes to avoid swamping the page), as the differences are really minor, since this isn't a very complex pattern. It's easily templatable, to avoid sync issues. The only real difference is that they all seem to use different syntaxes for Unicode hex sequences and properties, but that's not hard to deal with. Theknightwho (talk) 07:06, 25 May 2025 (UTC)Reply
On a tangentially-related note, this also made me realise that Pygments (so, by extension, SyntaxHighlight) doesn't have support for any regex formats, which feels like an oversight. I might see if I can knock something together for it that resembles what Regex101 does. Theknightwho (talk) 07:16, 25 May 2025 (UTC)Reply