Bugzilla/Notes/164

bugzilla:164: Support collation by a certain locale

Proposed solution

 * bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164: use MySQL collation
 * contact MySQL developer team to add language-specific collation on UTF-8
 * set default collation to utf8_bin
 * sorting by sending a specific collation to MySQL query like,
 * use  on-the-fly in MySQL query, to deal with legacy (latin1_bin) table (convert'ing to utf8 need MySQL 5.x to work correctly), then no need to convert existing table to utf8.  The query will then look like,  --Ans 09:59, 13 May 2008 (UTC)

, since the column alias can't be used in WHERE clause. (Use HAVING clause instead? --Ans 11:54, 13 May 2008 (UTC))

Drawback

 * bugzilla:164, bugzilla:164: The list of supported collations will depend on the database engine.
 * need MySQL >= 4.1 to support COLLATE keyword in sorting.
 * bugzilla:164: This requires major database alterations and possibly a massive, potentially breaking MySQL upgrade. (This is specific to Wikimedia projects, not to the MediaWiki software).
 * requires converting properties of existing data (e.g. charset, collation) in table of previous MediaWiki version
 * DONE: solved using  on-the-fly in MySQL query (see proposed solution above) --Ans 09:59, 13 May 2008 (UTC)
 * requires massive MySQL upgrade. The upgrade from MySQL 4.0 to 5.x requires repairing some tables.  From my experience, upgrading from MySQL 4.0.20 to 5.0.22 may crash the table searchindex, if it is not properly upgraded. --Ans 09:59, 13 May 2008 (UTC)
 * bugzilla:164, bugzilla:164: still slow if it is sorted using the collation that not match the default collation of that table/column (See: Summary: here's how indexed collation works:)

Summary: here's how indexed collation works:
 * 1) A collation is a method of comparing strings, which allows you to sort them into some canonical order.
 * 2) An index is a stored list of data fields from the database which have been pre-sorted using a particular defined collation (method of sorting).
 * 3) This index allows you to look up data by a particular value quickly -- because you can jump directly to the necessary part of the index.
 * 4) This index also allows you to return multiple adjacent data rows in that defined order very quickly -- you jump directly to the start point, then simply read out data in index order until you reach the end point.

But, if you want to pull that data out in a different order, you lose all the benefits of the index.

You can imagine this by the example of a telephone book. There are many different ways you might sort names, but let's use simple examples like:
 * alphabetical order by last name
 * alphabetical order by first name

A normal telephone book in most western societies sorts names alphabetically (by the local language's tradition) starting with the last name, then the first name.

It's trivial to retrieve a list of 100 people in alphabetical order by last, first starting at, say, "Wales, Jimmy". You go to the "W"s, quickly skip to "Wales", quickly skip everybody before "Jimmy", and then just copy out the next 100 names in the list.

Now, if you want to retrieve a list of 100 people by first, last order starting with "Jimmy Wales", how will you do this from the phone book?

You have to look through the entire phone book and create a whole new list of people whose names sort after "Jimmy Wales", then discard all but the first 100 of them.

So that's why you can't just arbitrarily choose which collation to use when sorting page, user, category, etc lists. They have to be chosen ahead of time so an index can be built, or else you spend your time sorting and re-sorting through thousands or millions of entries.

Problem

 * Must also specify the collation in the condition  within MySQL query, , but don't know how to specify it. --Ans 07:21, 13 May 2008 (UTC)
 * DONE: specifiy collation like,  --Ans 08:06, 13 May 2008 (UTC)

Proposed solution

 * bugzilla:164: use libc's LC_COLLATE on GNU/Linux
 * bugzilla:164: Setup each wiki to determine its sorting order on its own, perhaps as a string of characters separated by semi-colons and commas, such as: a,A,ā,Ā; æ, Æ; b,B; c, C; (and so on).
 * bugzilla:164: use ICU4C/libicu ( Does this support sort key generation, so it can be hybrid? Yes, ICU supports this by Collator::getSortKey (bugzilla:164, bugzilla:164).)
 * bugzilla:164: use PHP Internationalization extension (Intl), an ICU wrapper for PHP (Does this support sort key generation, so it can be hybrid?  No, it does not yet support directly.  Yes, but need php-intl 1.0.3+)

Drawback

 * bugzilla:164: MediaWiki will need to fetch all page title (for that category) from database to do sorting.
 * Which means this “proposed solution” is completely out of the question (except things like sorting the special pages list). --Mormegil 15:04, 14 May 2008 (UTC)

Problem

 * bugzilla:164: LC_COLLATE not work in PHP's sort
 * DONE: It doesn't work when set in environment variable, but work when set via setlocale in PHP, for example, setlocale(LC_COLLATE, "en_US"). (See ) --Ans 13:11, 12 May 2008 (UTC)

Hybrid database/PHP level sorting
(bugzilla:164: PHP level for key generation and database level for (binary) sorting)

Proposed solution

 * bugzilla:164, bugzilla:164: Implement an extension to generate sort keys.
 * bugzilla:164, bugzilla:164, bugzilla:164: normalize the input string using Unicode normalization form NFKD ("canonical decomposition")
 * bugzilla:164: This doesn't solve Thai sorting.
 * bugzilla:164, bugzilla:164, bugzilla:164: use Unihan database for Japanese
 * bugzilla:164, bugzilla:164: use Thai sorting algorithm to generate sort key for Thai language
 * bugzilla:164, bugzilla:164, bugzilla:164, bugzilla:164: use ICU4C/libicu for key generation
 * This may be done by implementing the direct C/C++ interface to Collator::getSortKey in mediawiki.
 * bugzilla:164, bugzilla:164: use PHP Internationalization extension (Intl), for key generation
 * bugzilla:164, bugzilla:164: PENDING: waiting for Bug #16831: Need direct interface to ICU's Collator::getSortKey

Where to store the generated sort key?

 * bugzilla:164, bugzilla:164: add a new sort key field
 * bugzilla:164: store in existing categorylinks.cl_sortkey field
 * bugzilla:164: cannot store this generated sort key in categorylinks.cl_sortkey, since the firstChar of the original sort key is needed in rendering the Category pages.
 * Problem: the firstChar is dependant of the the collation used (notably from the first grapheme cluster of each group at primary level). It is not uique even within the same locale (e.g. Chinese Pinyin sort order will use ASCII firstChar, Chinese radical/stroke will use the base radical, Hangul Jamo sort order will use the first basic jamo and not the first Hangul syllable).
 * Side note: Each collation needs its own list of firstChar values, but these values can perfectly be given collation keys like other page names (this means adding firstChar entries to the total count of rows returned in the SQL query to generate the subsection names in the catogory page).
 * The list of firstChar values does not need to be inserted in the category index, it is shared for all categories viewed under the same collation, so it can come from an alternate table, or can be generated on the fly, after sorting the page names, just by parsing each returned page name to extract its firstChar and generate the subsection headings on the fly. consequence: don't need to store the firstChar. so it's not a problem if the generated sortkey is stored in a field like "cl_sortkey" or into a new separate field "cl_collkey" (or a field in an additional table for collations, of there are multiple alternate collations supported in each category).
 * When supporting multiple collation orders in each category, don't forget the category-sort-option:locale1,locale2,... for the locales that will be supported in a given category: the first locale will be the default one used, but the wiki (eg. fr.wikipedia) can also specify an additional site-wide default locale (locale=fr) that should probably be present, at end of this list, just before the locale-neutral DUCET collation (locale=root). Don't need to add or support collation keys for all possible locales, only those that are specified in category-sort-option:locale1,locale2, and in the site-wide default locale(s) (there may be several one by default on CJK wikis) and the DUCET. Most categories won't even use the optional category-sort-option, so these categories will just only, by default, the site-wide default locale(s) and the DUCET.
 * bugzilla:164: add a new sort key field in table of pages
 * bugzilla:164: this does not solve the categories case, since the category sort key for each page can be customized to be different from the page's title
 * Not a problem: Don't use just Collator::getsortKey(cl_pagename) or Collator::getsortKey(cl_sortkey), simply use Collator::getsortKey(cl_sortkey + "higher than everything" + cl_fullpagename) to feed the new database field or to use the SQL-supported collation (problem: how to we choose the "higher than everything" delimiter in a locale-independant way? Can ICU's collation support multiple keys to compute the collation key, from a hierarchical pair of strings?)

Drawback

 * Need to write code for key generation on each language (Not sure if ICU support this key generation. If yes, this drawback will not be the case)
 * bugzilla:164, bugzilla:164: ICU support this by Collator::getSortKey.
 * bugzilla:164: The approach of implementing the direct C/C++ interface to Collator::getSortKey in mediawiki, will make the mediawiki installation to be architecture dependent. It also introduce an additional dependency (libicu) of mediawiki installation.
 * bugzilla:164: This will not be the case for mediawiki core, if the key generation is implemented as an extension.
 * bugzilla:164: Using php-intl for key generation, will introduce additional dependencies (php-intl extension and libicu) of mediawiki installation.
 * bugzilla:164: This will not be the case for mediawiki core, if the key generation is implemented as an extension.

Problem

 * bugzilla:164, bugzilla:164: PHP Internationalization extension (Intl) seem to not yet implement the direct interface to Collator::getSortKey. It provides only the indirect interface.
 * Bug #16831: Need direct interface to ICU's Collator::getSortKey
 * fixed in php-intl 1.0.3

Proposed client interface

 * bugzilla:164: add something like, Category-sort-option:en-GB, Category-sort-option:ja, Category-sort-option:th, to category page
 * bugzilla:164: set collation in MediaWiki config, $wgDBcollation
 * bugzilla:164
 * Collation can be specified in CGI variable like, [ http://.../Category:Abc?collation=czech_ci], and in Special:Preferences.
 * For setting in Special:Preferences, fetch all available collations from, mysql> show collation like 'utf8_%';, putting them in a dropdown list in Special:Preferences.

Multilingual sorting
This section will discuss, which one we should implement, between the "per language collation" and "universal multilingual collation". The latter will be needed by the multilingual wikis like meta wiki (bugzilla:164). It also concerns sorting the phrase that mix many languages in one phrase, for example,


 * Help Hjælp ヘルプ 帮助 Помощь วิธีใช้
 * Help Hjælp Помощь วิธีใช้ ヘルプ 帮助
 * discussion diskussion ‐ノート 对话 Дискуссия คุย
 * discussion diskussion Дискуссия คุย ‐ノート 对话

With multilingual collation, then no need to specify the collation in most languages.

Proposed solution

 * use DUCET (as specified in http://www.unicode.org/reports/tr10/ (bugzilla:164, w:Unicode collation algorithm)) for default multilingual collation --Ans 09:12, 14 May 2008 (UTC)

Problem

 * There're some languages that have more than one collation algorithm. --Ans 12:12, 12 May 2008 (UTC). The problem is that, which one should be chosen to be used in the multilingual collation.  For example, considering the combination, "en+fr1+ja+th+..." and "en+fr2+ja+th+...", which one to be used? --Ans 10:29, 13 May 2008 (UTC)
 * "Some" is a weasel term. :-) Is there a way to know how many languages are OK with a specific collation order, and how many is where it doesn't fit the needs? (My educated guess would be that the former is far more numerous than the latter, which would suggest to implement per language collation order first and do the special sorting for the few specific ones.) --Grin 09:11, 13 May 2008 (UTC)
 * DONE: use DUCET as specified in http://www.unicode.org/reports/tr10/ (bugzilla:164) --Ans 09:12, 14 May 2008 (UTC)
 * For each specific word in the phase, it can't determine in what language the words are. For example, the word car is a valid word in at least ten languages. --Ans 10:29, 13 May 2008 (UTC)
 * DONE: use DUCET as specified in http://www.unicode.org/reports/tr10/ (bugzilla:164) --Ans 09:12, 14 May 2008 (UTC)


 * DUCET is useless for languages like Tibetan and Dzongkha where the second or third character in a word is often the Primary sort key. Need to use a UCA tailored collation as found in CLDR for languages like these. For multi-lingual Wikis may need to combine several tailorings for the languages used CFynn 16:24, 2 August 2008 (UTC)
 * Some Thai words also have second character as the primary sortkey, however, DUCET does define a way to reorder the sequence of those Thai characters. I'm not if DUCET define this in Tibetan and Dzongkha or not. --Ans 06:50, 18 September 2008 (UTC)
 * This is plain wrong. The DUCET is not used alone, but as part of the UCA algorithm, that clearly speaks about the case of the few prepended vowels (currently encoded in Unicode only for the Thai, Lao, and Tai Viet scripts : a dozen characters are concerned only). Unicode also defines very clear properties for these prepended vowels. In other words, there's no defect in the DUCET for these vowels, because the DUCET only gives collation weights for each collation element, but does NOT specify how separate collation elements are to be reordered (the special case of "prepended vowels" is specified in the UCA algorithm itself, and it does not even require any specific tailoring, so you won't find data about them in the CLDR dataset, because this is definitely not a custom tailoring). Verdy p 17:04, 17 July 2011 (UTC)
 * Note that the Tibetan script causes no difficulty: there's no encoded characters in the Tibetan script that uses the "prepended vowel" exception.
 * This exception was initially introduced in the Thai script, mostly because of the desire to keep roundtrip compatbility with the legacy (but then widely used) Thai standard TIS-620, which did not use the "Unicode character model" defined much later and where dependant vowels or diacritics are to be encoded in texts after the base letter; this exception was later extended to the Lao and Tai Viet scripts, only due to lots of similarities of these scripts with Thai, and the desire to maintain roundtrip compatibility of algorithmic transcriptions between those two scripts and the Thai script that has been used for long to transcribe other languages of the region, instead of their native scripts; future similar exceptions are likely to occur in some future, when additional minority scripts of Thailand/Laos will be encoded, if those scripts use prepended vowels.
 * Note that the TIS-620 standard has been used extensively for decennials, including with ancestor encodings created by IBM working in Thailand since the 1960's (so this exception also exists in legacy EBCDIC encodings as well, and even in the legacy 8-bit Thai codepage for Windows, or in the MacOS Thai encoding, all of them predating the encoding of the Thai script in the UCS). When the Thai script was encoded in the UCS (since Unicode 1.0, but it has never changed since then), lots of Thai documents and applications were built by converting them to the UCS from TIS-620 sources, without taking int account the "Unicode character model" that still did not exist at this time. So even if the encoding of texts written in the Thai script seems illogical (because it uses the visual order instead of the logical order), this is because of history of the UCS. For stability of the standard, this will never change.
 * Other Indic scripts, not used in Thailand, use prepended vowels but they are effectively encoded in texts after the base letter (e.g. in Malayalam, used in India, which did not use the TIS-620 standard, but the ISCII standard instead, where they were already encoded as diacritics after the base consonnant in all Indic syllables/aksharas ; this has caused another defect with some scripts used in India, where the ISCII model was not very appropriate, notably for the Tamil script which has its own separate set of exceptions).
 * You should also be aware of exceptions of a different kind in Hebrew (due to the "canonical" ordering of combining diacritics, during normalization, notably for Hebrew points diacritics used for vowels, and sin/shin dot modifiers). But they are less critical because these cases do not occur at the primary collation level. But if you are collating at higher levels, you'll also have to "denormalize" the order of these diacritics, when producing the sequence of collation elements from which you'll get the collation weights for each one in the same order. Verdy p 17:12, 17 July 2011 (UTC)
 * The DUCET is valid including for the Thai and Tibetan scripts: you can automatiaclly determine the script for each character, and apply by default the Thai rules or Tibetan rules for the Thai characters and Tibetan characters, even within a multilingual sort ; this requires reordering some spans of characters, but it requiers some complex rules and a dictionary lookup for Thai to determine the syllable boundaries delimiting the spans to reorder, something which is not very easy to implement and can have significant performance costs. Note that this preprocessing needs not be performed when creating the sorted list : it can be precomputed when the article is saved and categorized (you just need to know which collation the target category uses, and then you can compute the correct sort key and store it when filling the category with the indexed article. Any given category does not need several collations (if you wanted several sort orders, it would be simpler to index the same article several times into separate categories, each one with its own collation defined : this would be useful for Chinese notably and there's no need to specify an extra query parameter to render a category with another tailored collation order) --2009-01-14T13:48:28 (GMT+7) 79.94.111.79
 * The DUCET alone is not defining the collation, because the rest of the processing is also important when computing the sort keys (notably determining the boundaries between scripts, where they can't be part of the same words : the exception being between Japanese scripts where there may exist characters of multiple kana scripts in the same words) --2009-01-14T13:48:28 (GMT+7) 79.94.111.79
 * Be careful also with the script boundaries when there are different directions in BiDi strings : the preprocessing needed for correctly handling BiDi strings is described in the UCA algorithm (BiDi ordering is not uncommon: just consider the digits present in strings along with Hebrew or Arabic letters). This part is not specified by the DUCET alone but by the collation-specific preprocessing that requires reordering (this will be also important for the multilingual collation based on DUCET, which should still support the BiDi algorithm for this preprocessing). --2009-01-14T13:48:28 (GMT+7) 79.94.111.79

firstChar

 * must properly handle Language::firstChar for specific languages, immediately, when the category sorting has been implemented (especially Thai and w:European ordering rules (DUCET also implement these)) in database level or PHP level (no need to fix this in the hybrid approach) --Ans 11:42, 14 May 2008 (UTC)
 * When sorted by Thai collation the Thai word "กา" and "เก", will be put into the same group "ก", then firstChar must return "ก" for both "กา" and "เก". (In binary collation, "เก" will be in group "เ") --Ans 13:39, 12 May 2008 (UTC)
 * For european, the words, Ab ab Àb àb Áb áb Âb âb Ãb ãb Äb äb Åb åb Æb æb, will be in the same group, then firstChar should return "A" (or "a"?). --Ans 11:42, 14 May 2008 (UTC)
 * Should handle the Hangûl script specifically and more gracefully:
 * precomposed syllables must be decomposed (into sequences of the few jamos that actually make the Hangul alphabet) to get the first jamo in them ;
 * the jamos themselves should be sorted according to the selected Korean order (however there are two collation orders for Korean, one for North Korea which is more traditional, and another for South Korea that uses simplified rules) ;
 * special care should be taken to handle gracefully the minor differences between a leading consonnant and a trailing consonnant : the trailing consonnant is extremely similar like a leading consonnant except that there's an implied (non-encoded) syllable separator between the last trailing consonnant at end of a syllable and the first leading consonnant (or vowel if there's no filler or if fillers are filtered out as a even lower priority level) of the next syllable : this difference is used in the modern Hangul encoding for easy determination of syllable breaks, because these breaks are driving the effective rendering of all letters that make a syllable into the same composed square
 * note also that there may exist articles containing "compatibility jamos" (i.e. those that do not make any distinction between leading and trailing consonnants, where the syllabic separation is not easy to determine completely without using a complex lookup, in an algorithm quite similar to the hyphenation breaking rules for Latin/Greek/Cyrillic). These compatibility consonnant jamos should sort at the primary level just like the standard leading consonnant and its associated trailing consonnant.
 * For complete support of the jamo ordering, some of the basic jamos of the modern Encoding are in fact composed of two jamos (notably the "SSANG-" consonnant jamos, which are sorted as if there was two identical successive consonnants). When building the sorted list in a category, this can influence what we consider as the "firstChar" (do we have to decompose the modern SSANG- jamos into their historic components? This may not be important for Wikipedia where modern Korean is used in article names, but it may be very important for Wiktionary that is indexing all languages including their historic versions with their historic orthography, and where the correct ordering would be useful for correctly taking into account the etymology).
 * The Hangûl alphabet is in fact very simple (much more than Latin) if you just ignore the 11000 precomposed syllables and just decompose it first to NFD form, which does not remove any textual information, and then handle the compatibility variants and leading/trailing differences at a higher UCA level (there's not more than about 60 letters in all, including historical letters or historical compound jamos that can also be decomposed using compatibility equivalences, because the Hangul script is the only one that is featural with a very simple and logical composition rule).
 * Note however that searching for a word in a Korean category starting by a prefix should also take into account the same decomposition of the prefix search key specified in the query, to get consistant results in a Korean category (for non-Korean categories, or multilingual categories, where no specific collation can be given and that must be sorted in a language-neutral way, you could still use at least the UCA's DUCET as the default sort order)
 * --2009-01-14T13:48:28 (GMT+7) 79.94.111.79
 * Correctly handle Turkish and Azeri (or other Turkic Altaic languages) written in Latin: be careful about the special casing rule when building sort keys with such locales (see Unicode specs for the details of the special casing rules made for some locales like these ones): "i" and "j" are NOT the same letter as "I", "J", because there is NO "soft-dotting" (like in other languages) in these languages ; in Turkic Altaic languages, the dot above i or j is an important diacritic (logically independant of the base letter that has no dot) which is significant at the secondary level, more important than case differences (at ternary level), and that must be preserved (not added or removed) when the letter case is changed or removed when another diacritic is encoded on top of them (e.g. lowercase "i" plus "acute", is decomposed as "i" + dot + acute in Turkic languages, but just "i" + acute in other Latin-based languages). Such special casing however is not possible for multilingual categories (so they must be sorted using the default UCA rules, ignoring the combining dot except at the fourth level if it is explicitly encoded). This is even more important in Wiktionary (because the lettercase is significant there for the first letter). --2009-01-14T13:48:28 (GMT+7) 79.94.111.79

Appendix A: LC_COLLATE and PHP sort
From experiment with php 5.0.5-2ubuntu1.8 in Ubuntu GNU/Linux, setting LC_ALL or LC_COLLATE environment variable doesn't affect php5 sort function (bugzilla:164).

$ LC_ALL=C php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b $ LC_ALL=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b $ LC_COLLATE=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' A B a b

But it work when set via setlocale,

$ php5 -r 'setlocale(LC_COLLATE, "en_US"); $a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING); foreach ($a as $b) { print "$b\n"; }' a A b B