Utfnormal

From MediaWiki.org
Jump to: navigation, search

utfnormal is a library that contains Unicode normalization routines, including both pure PHP implementations and automatic use of the 'intl' PHP extension when present.

The main function to care about is UtfNormal\Validator::cleanUp(). This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C.

If you know the string is already valid UTF-8, you can directly call UtfNormal\Validator::toNFC(), toNFK(), or toNFKC(); this will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.

Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the hangul decomposition/composition code is extra slow).

Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.

To use it in your project, run composer require wikimedia/utfnormal.

This library was first introduced in MediaWiki 1.3 (rev:4965). It was split out of the MediaWiki codebase and published as an independent library during the MediaWiki 1.25 development cycle.

External links[edit]