utfnormal is a library that contains Unicode normalization routines. It includes pure PHP implementations, and automatically uses the php-intl extension if installed.
The main function to care about is
UtfNormal\Validator::cleanUp(). This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C (NFC). See also "Unicode equivalence" on Wikipedia.
If you know the string is already valid UTF-8, you can directly call
toNFKC(); this will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.
Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the hangul decomposition/composition code is extra slow).
Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.
To use it in your project, run
composer require wikimedia/utfnormal.
- Source code on gerrit.wikimedia.org (GitHub mirror)
- Package on Packagist.org
- API documentation
- Issue tracker