Fun with mb strlen

From mediawiki.org

Posted this on dev blog as well.

I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:

	function mb_strlen( $str, $enc = "" ) {
		preg_match_all( '/./us', $str, $matches );
		return count($matches);
	}

There are two things to note about this code:

  1. It doesn't actually work, because no matches are done — it always returns 1.
  2. Even if you fix it to return the matches, it's extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string.

I'm replacing this with a new version which uses PHP's count_chars() function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It's still a smidge slower than mb_strlen but it's... much better than the old one.

Some quick benchmarks using the UTF-8 normalization benchmark pages (/code):

Testing washington.txt:
              strlen      31526 chars    0.007ms
           mb_strlen      31526 chars    0.114ms
       old_mb_strlen      31526 chars 4813.686ms
       new_mb_strlen      31526 chars    0.132ms

Testing berlin.txt:
              strlen      36320 chars    0.001ms
           mb_strlen      35899 chars    0.129ms
       old_mb_strlen      35899 chars 6328.748ms
       new_mb_strlen      35899 chars    0.127ms

Testing bulgakov.txt:
              strlen      36849 chars    0.001ms
           mb_strlen      20418 chars    0.076ms
       old_mb_strlen      20418 chars 3003.042ms
       new_mb_strlen      20418 chars    0.133ms

Testing tokyo.txt:
              strlen      36244 chars    0.001ms
           mb_strlen      19936 chars    0.071ms
       old_mb_strlen      19936 chars 2623.109ms
       new_mb_strlen      19936 chars    0.131ms

Testing young.txt:
              strlen      36694 chars    0.001ms
           mb_strlen      16676 chars    0.063ms
       old_mb_strlen      16676 chars 2246.179ms
       new_mb_strlen      16676 chars    0.125ms

--brion 16:31, 9 March 2007 (UTC)[reply]