Thread:Project:Support desk/Disabling automatic character conversion for unicode characters in the Unified CJK Compatibility Blocks/reply (2)

I've managed to solve the problem.

Essentially, all unicode text is normalized, so that glyphs that appear identically will load identically. Normally this is what you want. If a user has Chinese fonts installed, then &#xfa46; looks identical to &#x6e1a;, so you want them to be able to search for either. You normally don't want two different articles both talking about "shoreline" just because some users are using a different unicode point.

In rare cases, you might want two separate pages. There are two ways to do it, depending on whether or not you have intl_pecl installed or not.

If you do *not* have intl_pecl installed, then it is easy.

Step one: cd to your mediawiki/includes/normal directory, and look at the file UtfNormalData.inc. This is the character conversion database. It lists the characters in encoded format.

Step two: UtfNormalData.inc is automatically generated by the file UtfNormalGenerate.php. UtfNormalGenerate.php uses the official Unicode Data tables to determine which characters are linked among other things. You will need to download these files:

wget http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt

wget http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt

wget http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Step three: The file we care about is UnicodeData.txt. To oversimplify the file format, it is separated into columns by a ; (semicolon), and column 1 (index 0) lists each character's unicode point. Column 6 (index 5) lists the target unicode point for conversion. So for example, if you want to remove the normalization for &#xfa46;->&#x6e1a;, you will need to go to the line where column 1 is equal to FA46, and then delete text 6E1A from column 6.

Step four: After all the modifications to UnicodeData.txt have been made, run UtfNormalGenerate.php. This can be done by simply running

php UtfNormalGenerate.php

Step five: Check to make sure that everything is working fine. Open up UtfNormalData.inc and make sure the characters are missing from the conversion tables. Then try editing any page in the wiki in your browser, and start typing in character that should not be converted. Hit the "Preview changes" button and verify that the characters were not converted.

That's it. If you *do* have intl_pecl installed, then it is different, and I don't know how to do it, but you will have to uninstall intl_pecl, download the source, edit the source (probably editing UnicodeData.txt), compile the source, and then install. It was not so easy for me to get ahold of the source files, or rather, not as easy as it was to simply disable intl_pecl and edit mediawiki's normalization library.

Note, in general this is *not* something that you want to do. Only very specific users who have very specific needs regarding unicode character points never changing should do this.