User:Bawolff/collation2

[My older thoughts on collation is at User:Bawolff/collation]

Problems
Collations support has some non-ideal properties, which has limited their deployment in Wikimedia, and also made them default to being off for third parties.


 * updateCollation.php takes a really long time. Can't do category at a time.
 * no versioning of intl library. This means every time you update php, all categories break. This is bad.
 * changing intl library requires updateCollation.php --force (Since we don't record version of intl used to make collation)
 * Which means update.php won't fix wikis if you update version of php
 * if collation mismatch (e.g. Due to php update, or even just changing what the default collation is), no auto-fix over time

What we (I) want

 * We want collation support to be good enough, that the installer enables it by default.
 * If something bad happens causing collation to go out of sync, we want it to be able to slowly fix itself, instead of requiring that a script be run (Mostly for a third party context. In a Wikimedia context, we would always run a script)
 * When collations are being updated (via script), we want to minimize disruption. DB should not explode, and categories should be updated in order so that each individual category is in an inconsistent state for as short a time as possible. (See also T119173/T58041/bugzilla 45970)
 * update.php will automatically detect if collations need to be updated (ie due to intl library version change) and run the updates.

Proposed solutions

 * cl_collation field should be changed to be "collation-name intl version" (Where is something that's not likely to be in a collation name. I prefer, but newline, tab, null, ~, random ascii control characters, etc are probably all valid choices).
 * LinksUpdate should verify the collation name, and if the collation name is incorrect, it should re-insert the category. (Thus, collation will get fixed upon edits (and null edits), instead of the current situation where you have to remove the category, save, and re-add the category, to fix a messed up collation
 * Add an index on (cl_collation, cl_to, cl_type, cl_from) [For the updateCollation.php]
 * Previously, it was attempted to use the existing index on (cl_to,cl_type,cl_sortkey,cl_from) in order to update on a category by category basis, but this resulted in a filesort of basically the entire categorylinks table (Which was unacceptable on large wikis). An an alternative to adding a new index, perhaps the query could be optimized to require strict equality on (cl_to, cl_type) and instead of paging through the entire categorylinks, page within those tuples. This would maybe bring the query into acceptable range performance wise (If the wiki has categories with over 500,000 members, which commons certainly does, I would guess the query is still a bit dicey [unsure]. If its truly still a problem for giant categories, and we can't add a new index, then we could perhaps use the strict equality for (cl_to, cl_type) pairs when they have less than some cut-off number of members [e.g. 20,000], and then switch to the cl_collation index for the big categories. Its less likely for users to be disturbed for the big categories to be in an inconsistent state, as typically those are more of a tracking nature than a navigational one)

Long term

 * Chinese wikis want multi-collation support, as their is disagreement about what the proper way to alphabetize lists is (see Liangent's zh-collation branch. Unfortunately this never got code review and is probably a bit bit-rotted, but it fully implemented this feature)
 * Some wiktionaries probably want the ability to set a collation on a per-category basis. (e.g. Category:Swedish_nouns get uca-sv sort order, Category:German_nouns get uca-de sort order). The approach taken in the zh-collation branch doesn't scale to this usecase, but its certainly possible to make something which would scale (imo)
 * Commons might want the ability to sort images by say file size, file type, number of views, etc
 * more radical ideas like ordering categories based on reader ratings

Other things

 * Some people want collations to extend to other lists (e.g. Special:Listusers). This seems like a lot of work in the current architecture. DB collations aren't generally used currently. They might make sense for cases like Special:Listusers (Maybe. I have very little knowledge about db collations)
 * The idea of having categories self-fixing over time, and the idea of having categories in an inconsistent state for as short as possible a time, are kind of contradictory, as the self-fixing will introduce inconsistencies before the script gets to a specific category.
 * Lots of people want to use intl tailoring support, but the php bindings don't support it :(