utfnormal is a library that contains Unicode normalization routines. It includes pure PHP implementations, and automatically uses the php-intl extension if installed.
The main function to care about is
UtfNormal\Validator::cleanUp(). This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C (NFC). See also "Unicode equivalence" on Wikipedia.
If you know the string is already valid UTF-8, you can directly call:
This will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.
Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the Hangul decomposition/composition code is extra slow).
Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.
To use it in your project, run
composer require wikimedia/utfnormal.
- Source code (Phabricator mirror, )
- Composer package
- API Documentation
- Test coverage report
- Issue tracker