Fun with mb strlen
From MediaWiki.org
posted this on dev blog as well
I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:
function mb_strlen( $str, $enc = "" ) {
preg_match_all( '/./us', $str, $matches );
return count($matches);
}
There are two things to note about this code:
- It doesn't actually work, because no matches are done -- it always returns 1
- Even if you fix it to return the matches, it's extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string
I'm replacing this with a new version which uses PHP's count_chars() function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It's still a smidge slower than mb_strlen but it's... much better than the old one.
Some quick benchmarks using the UTF-8 normalization benchmark pages (/code):
Testing washington.txt:
strlen 31526 chars 0.007ms
mb_strlen 31526 chars 0.114ms
old_mb_strlen 31526 chars 4813.686ms
new_mb_strlen 31526 chars 0.132ms
Testing berlin.txt:
strlen 36320 chars 0.001ms
mb_strlen 35899 chars 0.129ms
old_mb_strlen 35899 chars 6328.748ms
new_mb_strlen 35899 chars 0.127ms
Testing bulgakov.txt:
strlen 36849 chars 0.001ms
mb_strlen 20418 chars 0.076ms
old_mb_strlen 20418 chars 3003.042ms
new_mb_strlen 20418 chars 0.133ms
Testing tokyo.txt:
strlen 36244 chars 0.001ms
mb_strlen 19936 chars 0.071ms
old_mb_strlen 19936 chars 2623.109ms
new_mb_strlen 19936 chars 0.131ms
Testing young.txt:
strlen 36694 chars 0.001ms
mb_strlen 16676 chars 0.063ms
old_mb_strlen 16676 chars 2246.179ms
new_mb_strlen 16676 chars 0.125ms
--brion 16:31, 9 March 2007 (UTC)

