Bugzilla/Notes/164

bugzilla:164: Support collation by a certain locale

Proposed solution

 * bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164: use MySQL collation
 * contact MySQL developer team to add language-specific collation on UTF-8
 * set default collation to utf8_bin
 * sorting by sending a specific collation to MySQL query, mysql> SELECT ... ORDER BY cl_sortkey COLLATE utf8_czech_ci LIMIT 201;

Drawback

 * bugzilla:164, bugzilla:164: The list of supported collations will depend on the database engine.
 * need MySQL >= 4.1 to support COLLATE keyword in sorting.
 * bugzilla:164: This requires major database alterations and possibly a massive, potentially breaking MySQL upgrade. (This is specific to Wikimedia projects, not to the MediaWiki software).
 * bugzilla:164, bugzilla:164: still slow if it is sorted using the collation that not match the default collation of that table/column (FIXME: need confirmation)

Proposed solution

 * bugzilla:164: use libc's LC_COLLATE on GNU/Linux
 * bugzilla:164: use ICU4C/libicu (FIXME: Is this support sort key generation, so it can be hybrid?)
 * bugzilla:164: use PHP Internationalization extension (Intl), an ICU wrapper for PHP (FIXME: Is this support sort key generation, so it can be hybrid?)

Drawback

 * bugzilla:164: MediaWiki will need to fetch all page title (for that category) from database to do sorting.

Problem

 * bugzilla:164: LC_COLLATE not work in PHP's sort
 * DONE: It doesn't work when set in environment variable, but work when set via setlocale in PHP, for example, setlocale(LC_COLLATE, "en_US"). (See ) --Ans 13:11, 12 May 2008 (UTC)

Hybrid database/PHP level sorting
(bugzilla:164: PHP level for key generation and database level for sorting)

Proposed solution

 * bugzilla:164: languages plug in a filter to generate sort keys from thier text.
 * bugzilla:164, bugzilla:164, bugzilla:164: normalize the input string using Unicode normalization form NFKD ("canonical decomposition")
 * bugzilla:164: This doesn't solve Thai sorting.
 * bugzilla:164, bugzilla:164, bugzilla:164: use Unihan database for Japanese
 * bugzilla:164, bugzilla:164: use Thai sorting algorithm to generate sort key for Thai language

Drawback

 * Need to write code for key generation on each language (FIXME: Not sure if ICU support this key generation. If yes, this drawback will not be the case)

Proposed client interface

 * bugzilla:164: add something like, Category-sort-option:en-GB, Category-sort-option:ja, Category-sort-option:th, to category page
 * bugzilla:164: set collation in MediaWiki config, $wgDBcollation
 * bugzilla:164
 * Collation can be specified in CGI variable like, [ http://.../Category:Abc?collation=czech_ci], and in Special:Preferences.
 * For setting in Special:Preferences, fetch all available collations from, mysql> show collation like 'utf8_%';, putting them in a dropdown list in Special:Preferences.

Multilingual sorting
This section will discuss, which one we should implement, between the "per language collation" and "universal multilingual collation". The latter will be needed by the multilingual wikis like meta wiki (bugzilla:164). It also concerns sorting the phrase that mix many languages in one phrase, for example,


 * Help Hjælp ヘルプ 帮助 Помощь วิธีใช้
 * Help Hjælp Помощь วิธีใช้ ヘルプ 帮助
 * discussion diskussion ‐ノート 对话 Дискуссия คุย
 * discussion diskussion Дискуссия คุย ‐ノート 对话

With multilingual collation, then no need to specify the collation in most languages.

Problem

 * There're some languages that have more than one collation algorithm. --Ans 12:12, 12 May 2008 (UTC)
 * For each specific word in the phase, it can't determine in what language the words are. --Ans 12:12, 12 May 2008 (UTC)

firstChar

 * must handle Language::firstChar for Thai language, immediately, when the category sorting (especially Thai sorting) has been implemented in database level or PHP level (no need to fix this in the hybrid approach) --Ans 12:25, 12 May 2008 (UTC)
 * When sorted by Thai collation the Thai word "กา" and "เก", will be put into the same group "ก", then firstChar must return "ก" for both "กา" and "เก". (In binary collation, "เก" will be in group "เ") --Ans 13:39, 12 May 2008 (UTC)

Appendix A: LC_COLLATE and PHP sort
From experiment with php 5.0.5-2ubuntu1.8 in Ubuntu GNU/Linux, setting LC_ALL or LC_COLLATE environment variable doesn't affect php5 sort function (bugzilla:164).

$ LC_ALL=C php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b $ LC_ALL=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b $ LC_COLLATE=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b

But it work when set via setlocale,

$ php5 -r 'setlocale(LC_COLLATE, "en_US"); $a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' a A b B