User:DanielRenfro/Character Encoding

From mediawiki.org

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers.

Unicode[edit]

Strictly speaking, Unicode is not a character encoding. It is a conceptual encoding (a character set) which pairs characters with numbers, rather than mapping to octets (bytes). For example the Cyrillic character capitol-Zhe (Ж) is paired with the number 1046. This number, called a code-point can be represented any number of ways -- in Unicode it is represented as "U+0416" (a capitol U with it's hexadecimal representation.) Unicode also contains some attributes about characters, such as "3 is a digit", or "É is an uppercase letter whose lowercase equivalent is é." Sometimes a character on the screen may be represented by more than one Unicode code-point. For example, most would consider à to be a single character, but in Unicode it can be composed of two code points: U+0061 (a) combined with the grave accent U+0300 (`). Unicode offers a number of these "combining-characters" that are intended to follow (and be combined with) a base character. This can cause some confusion when a regular expression is expecting a 'single-character' or a 'single byte.'

There are a number of ways to store this code-point, such as:

  • UCS-2
    all characters are encoded using two bytes
  • UCS-4
    all characters are encoded using four bytes
  • UTF-16
    most characters encoded with two bytes, but some with four
  • UTF-8
    characters encoded with one to six bytes

What is important is to know which type of encoding your program uses and how to convert from another encoding to this one. (From ASCII, Latin-1 to UTF-16 for example.)

Common character encodings[edit]

Translation[edit]

Template:NeedsEditing

HTTP[edit]

The Content-Type header should contain the correct encoding information.

Content-Type: text/html; charset=ISO-8859-1

HTML[edit]

Character encodings in HTML can be troublesome when you do not specify the correct encoding in your document. You can define this in the HTTP response header, or in the HTML document like so:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the HTML meta tag and the HTTP header don't match, most browsers will ignore the meta tag in favor of the header. But, this begs the question, which one is correct?.

PHP[edit]

Problem

PHP thinks one character is equal to one byte.

This might have been true in the late-90's web, but now-a-days this assumption leads to some non-obvious problems with internationalization. For example, the strlen() function erroneously returns 27 instead of 20 when calculating the length of the following string (due to all the accents, etc.):

Iñtërnâtiônàlizætiøn
12345678901234567890

PHP actually "sees" something more like this:

Iñtërnâtiônà lizætiøn
123456789012345678901234567

One good thing is that PHP doesn't try and convert your strings to some other encoding, it just acts on whatever you give it. Even though it doesn't (using the native string functions) understand that characters could be more than one byte, it won't screw things up. The iconv extension is enabled by default, but is only partially useful. Alternatively, you can compile PHP with the mbstring package, but if you want your software to be portable this can be a problem.

Solution

Hand the problem off to the browser and let it handle it. Web browsers have excellent support for many different character sets, the most important being UTF-8. All you have to do is tell it "everything is UTF-8" and your problems are (partially) solved! UTF-8 is good because it's Unicode, it's a standard, and it's backwards compatible with ASCII. Use the HTTP-header with the correct HTML-meta-tag and let the browser handle the problems.

Most of this section is taken from the good tutorial at http://www.phpwact.org/php/i18n/charsets.

Mediawiki[edit]

Code[edit]

All PHP files in the Mediawiki software suite must be encoded with UTF-8, without byte order marks. Otherwise bad things happen.

Wikitext[edit]

Mediawiki does not make any assumptions about what characters are coming in from the user.

Some global configs dealing with encoding:

$wgEditEncoding
For some languages, it may be necessary to explicitly specify which characters make it into the edit box raw or are converted in some way or another. Note that if $wgOutputEncoding is different from $wgInputEncoding, this text will be further converted to wgOutputEncoding.
$wgLegacyEncoding
Set this to eg 'ISO-8859-1' to perform character set conversion when loading old revisions not marked with "utf-8" flag. Use this when converting wiki to UTF-8 without the burdensome mass conversion of old text data. NOTE! This DOES NOT touch any fields other than old_text. Titles, comments, user names, etc still must be converted en masse in the database before continuing as a UTF-8 wiki.

Older global configs (you might run into these):

Storage[edit]

From what I can tell, Mediawiki uses the BLOB (binary large object) datatype to store wikitext in the text database table. This means that no matter what the encoding is coming from input, Mediawiki will store things just fine. The problem is when it is displayed again. Mediawiki can be configured to send different 'Content-Type' HTTP headers, but I think by default it uses UTF-8.


$wgDBmysql5
  • Set to true to engage MySQL 4.1/5.0 charset-related features; for now will just cause sending of 'SET NAMES=utf8' on connect.
  • You should not generally change this value once installed -- if you create the wiki in Binary or UTF-8 schemas, keep this on! If your wiki was created with the old "backwards-compatible UTF-8" schema, it should stay off.
  • (See also $wgDBTableOptions which in recentish versions will include the table type and character set used when creating tables.)
  • May break if you're upgrading an existing wiki if set differently. Broken symptoms likely to include incorrect behavior with page titles, usernames, comments etc containing non-ASCII characters. Might also cause failures on the object cache and other things.

includes/search/SearchMySQL.php[edit]

Using MySQL as the database backend, Mediawiki will run the following code:

/**
 * Converts some characters for MySQL's indexing to grok it correctly,
 * and pads short words to overcome limitations.
 */
function normalizeText( $string ) {
    global $wgContLang;

    wfProfileIn( __METHOD__ );

    $out = parent::normalizeText( $string );

    // MySQL fulltext index doesn't grok utf-8, so we
    // need to fold cases and convert to hex
    $out = preg_replace_callback(
        "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
        array( $this, 'stripForSearchCallback' ),
        $wgContLang->lc( $out ) );

    // And to add insult to injury, the default indexing
    // ignores short words... Pad them so we can pass them
    // through without reconfiguring the server...
    $minLength = $this->minSearchLength();
    if( $minLength > 1 ) {
        $n = $minLength - 1;
        $out = preg_replace(
            "/\b(\w{1,$n})\b/",
            "$1u800",
            $out );
    }

    // Periods within things like hostnames and IP addresses
    // are also important -- we want a search for "example.com"
    // or "192.168.1.1" to work sanely.
    //
    // MySQL's search seems to ignore them, so you'd match on
    // "example.wikipedia.com" and "192.168.83.1" as well.
    $out = preg_replace(
        "/(\w)\.(\w|\*)/u",
        "$1u82e$2",
        $out );

    wfProfileOut( __METHOD__ );

    return $out;
}

/**
 * Armor a case-folded UTF-8 string to get through MySQL's
 * fulltext search without being mucked up by funny charset
 * settings or anything else of the sort.
 */
protected function stripForSearchCallback( $matches ) {
    return 'u8' . bin2hex( $matches[1] );
}

The take-home message here is that any UTF-8 (which MySQL doesn't "grok", or so the MW comments say) will get turned into some other string (which can be encoded using the standard Latin-1 or ASCII encoding.) That means that anyone searching for a non-ASCII character won't find what they're looking for.

For example, this string: (from the LacZ product page, containing the Greek character Beta (β))

Component of [[:Category:Complex:β-galactosidase|β-galactosidase]]

will get turned into this string by the above code:

component of category complex u8ceb2-galactosidase u8ceb2-galactosidase beta-du800-galactosidase

Searching[edit]

Problem
Because of the problems listed above (in Storage), searching for non-ASCII characters won't work (because they've been turned into something else.)
Proposed Solutions
  1. create an extension that hooks into the search somewhere, capturing non-ASCII characters and converting them to their parsed equivalents
  2. create an extension that hooks into the code when the page is saved to the searchindex table, and convert the non-ASCII characters to something better (HTML-entities?)
  3. both 1 and 2 (?)

MySQL[edit]

Collations vs. Encodings[edit]

Regular Expressions[edit]