Bugzilla/Notes/164

bugzilla:164: Support collation by a certain locale

Proposed solution

 * bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164: use MySQL collation
 * contact MySQL developer team to add language-specific collation on UTF-8
 * set default collation to utf8_bin
 * sorting by sending a specific collation to MySQL query like,
 * use  on-the-fly in MySQL query, to deal with legacy (latin1_bin) table (convert'ing to utf8 need MySQL 5.x to work correctly), then no need to convert existing table to utf8.  The query will then look like,   --Ans 09:59, 13 May 2008 (UTC)


 * 1)   The query can't be simplified using column alias like, , since the column alias can't be used in WHERE clause.  (Use HAVING clause instead? --Ans 11:54, 13 May 2008 (UTC))

Drawback

 * bugzilla:164, bugzilla:164: The list of supported collations will depend on the database engine.
 * need MySQL >= 4.1 to support COLLATE keyword in sorting.
 * bugzilla:164: This requires major database alterations and possibly a massive, potentially breaking MySQL upgrade. (This is specific to Wikimedia projects, not to the MediaWiki software).
 * requires converting properties of existing data (e.g. charset, collation) in table of previous MediaWiki version
 * DONE: solved using  on-the-fly in MySQL query (see proposed solution above) --Ans 09:59, 13 May 2008 (UTC)
 * requires massive MySQL upgrade. The upgrade from MySQL 4.0 to 5.x requires repairing some tables.  From my experience, upgrading from MySQL 4.0.20 to 5.0.22 may crash the table searchindex, if it is not properly upgraded. --Ans 09:59, 13 May 2008 (UTC)
 * bugzilla:164, bugzilla:164: still slow if it is sorted using the collation that not match the default collation of that table/column (See: Summary: here's how indexed collation works:)

Summary: here's how indexed collation works:
 * 1) A collation is a method of comparing strings, which allows you to sort them into some canonical order.
 * 2) An index is a stored list of data fields from the database which have been pre-sorted using a particular defined collation (method of sorting).
 * 3) This index allows you to look up data by a particular value quickly -- because you can jump directly to the necessary part of the index.
 * 4) This index also allows you to return multiple adjacent data rows in that defined order very quickly -- you jump directly to the start point, then simply read out data in index order until you reach the end point.

But, if you want to pull that data out in a different order, you lose all the benefits of the index.

You can imagine this by the example of a telephone book. There are many different ways you might sort names, but let's use simple examples like:
 * alphabetical order by last name
 * alphabetical order by first name

A normal telephone book in most western societies sorts names alphabetically (by the local language's tradition) starting with the last name, then the first name.

It's trivial to retrieve a list of 100 people in alphabetical order by last, first starting at, say, "Wales, Jimmy". You go to the "W"s, quickly skip to "Wales", quickly skip everybody before "Jimmy", and then just copy out the next 100 names in the list.

Now, if you want to retrieve a list of 100 people by first, last order starting with "Jimmy Wales", how will you do this from the phone book?

You have to look through the entire phone book and create a whole new list of people whose names sort after "Jimmy Wales", then discard all but the first 100 of them.

So that's why you can't just arbitrarily choose which collation to use when sorting page, user, category, etc lists. They have to be chosen ahead of time so an index can be built, or else you spend your time sorting and re-sorting through thousands or millions of entries.

Problem

 * Must also specify the collation in the condition  within MySQL query, , but don't know how to specify it. --Ans 07:21, 13 May 2008 (UTC)
 * DONE: specifiy collation like,  --Ans 08:06, 13 May 2008 (UTC)

Proposed solution

 * bugzilla:164: use libc's LC_COLLATE on GNU/Linux
 * bugzilla:164: Setup each wiki to determine its sorting order on its own, perhaps as a string of characters separated by semi-colons and commas, such as: a,A,ā,Ā; æ, Æ; b,B; c, C; (and so on).
 * bugzilla:164: use ICU4C/libicu (FIXME: Is this support sort key generation, so it can be hybrid?)
 * bugzilla:164: use PHP Internationalization extension (Intl), an ICU wrapper for PHP (FIXME: Is this support sort key generation, so it can be hybrid?)

Drawback

 * bugzilla:164: MediaWiki will need to fetch all page title (for that category) from database to do sorting.
 * Which means this “proposed solution” is completely out of the question (except things like sorting the special pages list). --Mormegil 15:04, 14 May 2008 (UTC)

Problem

 * bugzilla:164: LC_COLLATE not work in PHP's sort
 * DONE: It doesn't work when set in environment variable, but work when set via setlocale in PHP, for example, setlocale(LC_COLLATE, "en_US"). (See ) --Ans 13:11, 12 May 2008 (UTC)

Hybrid database/PHP level sorting
(bugzilla:164: PHP level for key generation and database level for (binary) sorting)

Proposed solution

 * bugzilla:164: languages plug in a filter to generate sort keys from thier text.
 * bugzilla:164, bugzilla:164, bugzilla:164: normalize the input string using Unicode normalization form NFKD ("canonical decomposition")
 * bugzilla:164: This doesn't solve Thai sorting.
 * bugzilla:164, bugzilla:164, bugzilla:164: use Unihan database for Japanese
 * bugzilla:164, bugzilla:164: use Thai sorting algorithm to generate sort key for Thai language

Drawback

 * Need to write code for key generation on each language (FIXME: Not sure if ICU support this key generation. If yes, this drawback will not be the case)

Proposed client interface

 * bugzilla:164: add something like, Category-sort-option:en-GB, Category-sort-option:ja, Category-sort-option:th, to category page
 * bugzilla:164: set collation in MediaWiki config, $wgDBcollation
 * bugzilla:164
 * Collation can be specified in CGI variable like, [ http://.../Category:Abc?collation=czech_ci], and in Special:Preferences.
 * For setting in Special:Preferences, fetch all available collations from, mysql> show collation like 'utf8_%';, putting them in a dropdown list in Special:Preferences.

Multilingual sorting
This section will discuss, which one we should implement, between the "per language collation" and "universal multilingual collation". The latter will be needed by the multilingual wikis like meta wiki (bugzilla:164). It also concerns sorting the phrase that mix many languages in one phrase, for example,


 * Help Hjælp ヘルプ 帮助 Помощь วิธีใช้
 * Help Hjælp Помощь วิธีใช้ ヘルプ 帮助
 * discussion diskussion ‐ノート 对话 Дискуссия คุย
 * discussion diskussion Дискуссия คุย ‐ノート 对话

With multilingual collation, then no need to specify the collation in most languages.

Proposed solution

 * use DUCET (as specified in http://www.unicode.org/reports/tr10/ (bugzilla:164, w:Unicode collation algorithm)) for default multilingual collation --Ans 09:12, 14 May 2008 (UTC)

Problem

 * There're some languages that have more than one collation algorithm. --Ans 12:12, 12 May 2008 (UTC). The problem is that, which one should be chosen to be used in the multilingual collation.  For example, considering the combination, "en+fr1+ja+th+..." and "en+fr2+ja+th+...", which one to be used? --Ans 10:29, 13 May 2008 (UTC)
 * "Some" is a weasel term. :-) Is there a way to know how many languages are OK with a specific collation order, and how many is where it doesn't fit the needs? (My educated guess would be that the former is far more numerous than the latter, which would suggest to implement per language collation order first and do the special sorting for the few specific ones.) --Grin 09:11, 13 May 2008 (UTC)
 * DONE: use DUCET as specified in http://www.unicode.org/reports/tr10/ (bugzilla:164) --Ans 09:12, 14 May 2008 (UTC)
 * For each specific word in the phase, it can't determine in what language the words are. For example, the word car is a valid word in at least ten languages. --Ans 10:29, 13 May 2008 (UTC)
 * DONE: use DUCET as specified in http://www.unicode.org/reports/tr10/ (bugzilla:164) --Ans 09:12, 14 May 2008 (UTC)


 * DUCET is useless for languages like Tibetan and Dzongkha where the second or third character in a word is often the Primary sort key. Need to use a UCA tailored collation as found in CLDR for languages like these. For multi-lingual Wikis may need to combine several tailorings for the languages used CFynn 16:24, 2 August 2008 (UTC)
 * Some Thai words also have second character as the primary sortkey, however, DUCET does define a way to reorder the sequence of those Thai characters. I'm not if DUCET define this in Tibetan and Dzongkha or not. --Ans 06:50, 18 September 2008 (UTC)

firstChar

 * must properly handle Language::firstChar for specific languages, immediately, when the category sorting has been implemented (especially Thai and w:European ordering rules (DUCET also implement these)) in database level or PHP level (no need to fix this in the hybrid approach) --Ans 11:42, 14 May 2008 (UTC)
 * When sorted by Thai collation the Thai word "กา" and "เก", will be put into the same group "ก", then firstChar must return "ก" for both "กา" and "เก". (In binary collation, "เก" will be in group "เ") --Ans 13:39, 12 May 2008 (UTC)
 * For european, the words, Ab ab Àb àb Áb áb Âb âb Ãb ãb Äb äb Åb åb Æb æb, will be in the same group, then firstChar should return "A" (or "a"?). --Ans 11:42, 14 May 2008 (UTC)

Appendix A: LC_COLLATE and PHP sort
From experiment with php 5.0.5-2ubuntu1.8 in Ubuntu GNU/Linux, setting LC_ALL or LC_COLLATE environment variable doesn't affect php5 sort function (bugzilla:164).

$ LC_ALL=C php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b $ LC_ALL=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b $ LC_COLLATE=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b

But it work when set via setlocale,

$ php5 -r 'setlocale(LC_COLLATE, "en_US"); $a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' a A b B