Fun with mb strlen

posted this on dev blog as well

I noticed the fallback implementation for mb_strlen that we had in GlobalSettings.php sucked:

function mb_strlen( $str, $enc = "" ) { preg_match_all( '/./us', $str, $matches ); return count($matches); }

There are two things to note about this code:
 * 1) It doesn't actually work, because no matches are done -- it always returns 1
 * 2) Even if you fix it to return the matches, it's extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string

I'm replacing this with a new version which uses PHP's count_chars function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It's still a smidge slower than mb_strlen but it's... much better than the old one.

Some quick benchmarks using the UTF-8 normalization benchmark pages (/code):

Testing washington.txt: strlen     31526 chars    0.007ms mb_strlen     31526 chars    0.114ms old_mb_strlen     31526 chars 4813.686ms new_mb_strlen     31526 chars    0.132ms

Testing berlin.txt: strlen     36320 chars    0.001ms mb_strlen     35899 chars    0.129ms old_mb_strlen     35899 chars 6328.748ms new_mb_strlen     35899 chars    0.127ms

Testing bulgakov.txt: strlen     36849 chars    0.001ms mb_strlen     20418 chars    0.076ms old_mb_strlen     20418 chars 3003.042ms new_mb_strlen     20418 chars    0.133ms

Testing tokyo.txt: strlen     36244 chars    0.001ms mb_strlen     19936 chars    0.071ms old_mb_strlen     19936 chars 2623.109ms new_mb_strlen     19936 chars    0.131ms

Testing young.txt: strlen     36694 chars    0.001ms mb_strlen     16676 chars    0.063ms old_mb_strlen     16676 chars 2246.179ms new_mb_strlen     16676 chars    0.125ms

--brion 16:31, 9 March 2007 (UTC)