Manual:PAGENAMEE encoding

MediaWiki pages name encoding is a complicated topic. MediaWiki magic words PAGENAME, PAGENAMEE and urlencode have distinct implementations, each with their own peculiarities.

A MediaWiki page name can have a leading space but not a trailing space. The ASCII characters that are not allowed in MediaWiki page names are the three types of brackets, pound sign, underscore and vertical bar, and all control characters (including tabs, newlines).


 * # &lt; &gt; [ ] _ { | }</tt>


 * Note that the underscore is not really disallowed, but is treated like a space without distinction in MediaWiki page names, so "A_B" and "A B" are referencing exactly the same page name (pages will be created, searched, and displayed (with their title) using spaces, never using underscores).

This article shall refer to these as the "not-allowed pagename characters". For clarity, we will present other ASCII 7-bit values for characters as the URL-style encoding of percent-hex-hex form known as percent-encoding.

PAGENAME
Some allowed characters returned by  are HTML-style encoded:


 * " (double quote %22) is converted to  (34 is the decimal value of hexadecimal 22); in standard HTML/XML style it could also be converted to.
 * &</tt> (ampersand %26) is converted to  (38 is the decimal value of hexadecimal 26); in standard HTML/XML style it could also be converted to
 * ' (single quote %27) is converted to  (39 is the decimal value of hexadecimal 27; in standard HTML/XML style it could also be converted to


 * This HTML/XML encoding is standard, even if the standard does not always requires escaping the single and double quotes except in few cases; the standard would also require reencoding the lower-than and greater-than  signs but these two characters are forbidden in MediaWiki pagenames due to the syntax of the MediaWiki code used to compose pages.

The same HTML-encoding is used also with,  ,  ,  , and.

We will refer to these as the "three special pagename characters".

PAGENAMEE
converts spaces to underscore and percent-encodes a set of characters:


 * It converts the 11 following ASCII characters (all allowed in pagenames):
 * "</tt> %</tt> &</tt> '</tt> +</tt> =</tt> ?</tt> \</tt> ^</tt> `</tt> <tt>~</tt>
 * to (using the hexadecimal representation of the ASCII encoding)
 * It also converts all non-ASCII (Unicode) characters also  triplets with nn in hexadecimal (one for each octet of the UTF-8 sequence encoding the Unicode code point associated to the character), the first triplet being between   and , followed by one to three triplets between   and   (for the worst case, it could generate 12 characters from a single Unicode character, most Latin, Cyrillic, Greek characters being encoded on 6 characters, but sinograms and Korean Hangul needing 9 on 6 characters).
 * It converts the ASCII space (<tt> </tt>) into an underscore.
 * It does not convert ASCII alphanumerics and the 13 following ASCII punctuations and symbols (all allowed in pagenames):
 * <tt>!</tt> <tt>$</tt> <tt>(</tt> <tt>)</tt> <tt>*</tt> <tt>,</tt> <tt>-</tt> <tt>.</tt> <tt>/</tt> <tt>:</tt> <tt>;</tt> <tt>@</tt> <tt>_</tt>
 * <tt>!</tt> <tt>$</tt> <tt>(</tt> <tt>)</tt> <tt>*</tt> <tt>,</tt> <tt>-</tt> <tt>.</tt> <tt>/</tt> <tt>:</tt> <tt>;</tt> <tt>@</tt> <tt>_</tt>

The same encoding is used also with,  ,  ,  , and.

When preparing a pagename for embedding in the "searchpart" of a URL (see RFC1738 and/or RFC3986), it might have to be both percent-encoded and all space characters converted  or plus sign   which we will call "searchpart-encoded".


 * This avoids the problematic coding of the three special pagename characters by encoding, for instance, ampersand (<tt>&</tt>) as ).

If no MediaWiki string manipulation extensions for string manipulation, then  might only be useful for constructing a URL back into one's own wiki, to other wikis or to other sites where the page they provide use the same name and use underscores (there's no standard here, the encoding presented above was defined by MediaWiki itself for its own local use, do not assume that other sites will perform the same conversion, most of them just use plain UTF-8 in their own local URLs if they need to represent non-ASCII charactersn and standard URL-encoding for the "unsafe" ASCII characters).

urlencode
The  function (in its current version using now the " " style by default since MediaWiki 1.17) percent-encodes many more characters than PAGENAMEE.


 * In can convert any valid input string from its native UTF-8 encoding.
 * ''This function will also convert the 9 characters that are forbidden in pagenames and listed at top of page.
 * It converts the "three special characters" differently than what is performed by, using   hexadecimal triplets, instead of HTML entities.
 * It preserves the distinction between space and underscore (a distinction lost only in MediaWiki pagenames).
 * The result is conforming to the RFC 1738 URL encoding standard, using only "safe" characters and the two characters  (followed by two hexadecimal digits) and   (to encode spaces).
 * This result is bijectively reversible, but MediaWiki does not natively provide a urldecode function to do it.

It can also be used to allow the wikisource editor to work with multilingual characters they are accustomed to rather than deal with the more opaque percent-encoded characters. When considering using urlencode to construct an external link URL, especially within a template, there are two design style where that might be appropriate. Which one is appropriate is a matter the trade-offs between generality and ease-of-use.


 * For maximum generality, there is no simple combination of PAGENAME and other default wiki magic words to provide a general solution and to handle names that include all possible characters in pagenames. The not-allowed pagename characters and the three special pagename characters both present issues. If a desired name uses any of those characters, then the actual pagename would have to be different. The most general design for a template would be a template with two parameters: a URL-style searchpart-encoded parameter for the URL link and an HTML-style parameter for the link label. The URL-style parameter would be added to a search or lookup URL and the HTML-style parameter would be used to label the link. For instance, a template called OrgName that looks up an organization by name with the unusual 10-character organization name of  would call the template as  . Variations on this might use %20 instead of + in the URL-style parameter for space.
 * Another (but unrelated) escaping used in HTML or XML is to use  instead of &amp;#62; for the greater-than character in the HTML-style parameter or just the plain characters when they work OK. To be rigorous, one might argue that having two mandatory arguments is the best style for long-term stability in case the page is moved or translated to some other wiki where where the naming style of pages is different such as where a different alphabet is used for naming pages.
 * The  parser function can be used to create a template that might be easy-to-use but not perfectly general. The urlencode function (in the QUERY style) converts to %nn hexadecimal sequences almost all characters (including percent and plus) except alphanumerics and two of the RFC 1738 URL "safe" characters: - . (dash, period), and it converts blank to plus (additionally it encodes all non-ASCII characters as %nn hexadecimal sequences.
 * The technique of embedding the code fragment  into a template to create an external link URL can be useful (i.e. treating simple pagenames as data). A pagename with any of the "three special" pagename characters (which are returned by PAGENAME and similar functions whose result is intended for display on an HTML page) might be a problem. For example, a pagename with an ampersand, this would result in an HTML-style ampersand (&amp;amp;) being converted by   into to the URL QUERY style   where the userparam is optional but when explicitly supplied would have to be search-encoded.

Note that there's NO mediawiki parser function that can successfully decode the HTML-encoding performed by. As well, there's no function to decode the special encoding performed by PAGENAMEE or found in URL paths to wiki pages. Parser functions like  with a specified static name containing one of the three special characters which has not been HTML-encoded (for example the value of a parameter given to a transcluded template in a wiki page), you can first  convert that parameter to the same special encoding performed by PAGENAME, by passing this value as a parameter of the   parser function.
 * You can do the same to compare pagenames according to the value of, or you can use the urlencode function with the supplementary style parameter as.

Web browser URL and wiki web server HTTP interface
The URL you type in or cut/paste into your web browser URL is similar but not exactly the same as PAGENAMEE.
 * In order to type in a pagename as a URL in your web browser that will go directly to the page, the following two characters must be URL-style encoded while being typed in: % ? as %25 %3F . A typical example is a pagename that ends in question mark where the wiki editors will create a wiki redirect without the question mark so that it works anyway. If you type in a space in the middle of a URL, you browser will convert it to %20 before sending it to any sort of web server. The same for that double-quote character  which is converted to %22. Depending on your browser, it may also encode some of the "unsafe" characters such as  . See RFC1738 for details but note that this behavior is browser-dependent. Compared to browsers that support only http, browsers that support schemes other than http such as ftp tend to convert more of these characters.
 * How a URL with percent-encoding is displayed in a web browser's address box depends on whether the wiki web server has used URL redirection. The characters of the PAGENAMEE character set will be converted only if they are adjacent to a space. For instance, If you type in a URL into your web browser ending in A_=_B or A=B then it will send that URL directly and you will get to the wiki page if it exists. If you enter a URL into your web browser ending in  (with spaces around the equals sign), then your web browser encodes spaces to %20, and thus sends A%20=%20B to the wiki web server. The wiki web server, then converts the string to A_%3D_B and sends that back to the wiki web browser via URL redirection. Now you can see why on a slow Internet link you might see the spaces in a pagename change first to a %20 and then to an underscore because your browser does the first conversion and the wiki web server does the second. You can try to see the real URL by copying the URL in the browser and pasting it as text into a simple text editor but you may find that even this technique produces browser-dependent results.
 * While not specific to the wiki web server, for wide characters, the browser performs a partial urldecode action on the real URL. This urldecoding is essential for the usability of wide characters in URLs. As an example, for an otherwise simple URL ending in a UTF-8 string percent-encoded as, your browser will usually urldecode that part and display it as 東京 (Unicode U+6771 U+5EAC), which are the two Kanji characters for Tokyo. This result can apply to both 7-bit and wide characters but is browser-dependent. For instance if you visit the eight-character pagename of   as   you may find that your web browser then displays a URL that has urldecoded none, some or all of the percent-encoded characters and that a cut-and-paste of the browser URL into simple text will include none, some or all of this urldecoding. How much of this urldecodding occurs during cut-and-paste is browser-dependent.