User:TJones (WMF)/Notes/On Generic ICU Folding

From mediawiki.org

December 2016 — See TJones_(WMF)/Notes for other projects. (T132637) For help with the technical jargon used to discuss folding, check out the Language Analysis section of the Search Glossary.

Intro[edit]

I wrote this up for T132637, but wanted to include it in my Notes for easy finding later.

The ticket is about generic diacritic folding, and I got asked to weigh in on what to ask from the various language communities. I also got called a language nerd. It was brutal.

My Response[edit]

[Lightly edited to remove the need for all the context.]

TL;DR: The question for the communities is, I think, which diacritics should not be folded in your language, and is it fair to assume all the others should be folded on wikis in your language?

People generally have problems with text in other languages with unfamiliar diacritics, and that searching without unfamiliar diacritics should match words that have them.

French speakers usually have no trouble typing French diacritics, but they may have no idea how to type Ancient Greek polytonic diacritics—which speakers of Modern Greek may also have trouble with, just as speakers of Modern English usually don't know how to type ð, þ, æ, or ē, despite them all being used in the first few lines of BeowulfHwæt! (You call me a language nerd, now I gotta act like one.)

I think the most general description is that people want to fold most diacritics, except the ones that are relevant to the "host" language of the wiki they are on. On English Wikipedia/Wiktionary/etc, you'd probably want to fold all the Cyrillic and (Modern and Ancient) Greek diacritics, and probably all the Latin ones, too, even though English makes light use of ´ and ¨ (e.g., resumé and Zoë) and rarely a few others, many English speakers don't see them as important.

French probably wants to fold mācrōns and hold on to ácúté and gràvè accents, while Hawaiian wants to do the opposite.

Generally, precomposed characters in the "host" language are keepers, though there are exceptions: Russian doesn't seem to care about the distinction between ё and е—though Belarusian and Rusyn do! I have no idea what Belarusian speakers searching for Russian words on Belarusian Wiktionary want to do about that, given that the dictionary citation form of a Russian word may use ё (such as чёрная дыра, "black hole"), but usage even in formal and academic sources may not (i.e., a user may have only seen it in print in Russian as черная дыра, as a quick search on Russian Google News shows is at least plausible).

Dealing with precomposed vs composed versions of the same character is more complex, and probably needs to be handled as it comes up. If I type an e+combining acute accent <é> (U+0065 U+0341) in an article on my test wiki in Vagrant, it gets converted to a single precomposed character <é> (U+00E9). I'm not sure if it's my OS, my browser, or mediawiki that does the conversion, so I'm not sure how big of a problem it is. We'll see what Phab does when I submit; the preview is keeping them distinct so far—yep, Phab kept them distinct: éé. They even look slightly different on my screen.

There are also non-diacritic foldings that happen, such as converting alternate forms of letters to their base forms, such as Greek word-final ς to standard σ, converting stylistic ligatures like  to fi, and converting fullwidth Latin characters to "normal" halfwidth forms.

I believe that most wikis would want to implement maximal ICU folding, with exceptions for diacritics (and possibly other folded distinctions) that are relevant in the "host" language of that wiki—and this is what @dcausse has already implemented; we just don't have the exception list for many languages.

We need to find out those exceptions for each language—preferably by asking people who are familiar with the language, though it can be done by research into the host language's orthography. Additional corner cases will come up and we'll have to figure those out as they happen. We should focus first on languages where people are noting that they have problems, since that's probably where there is the biggest overlap between number of users and the size of the problem.

Note: Unfortunately, I can't find a good list of everything that could/would/should get folded. There are withdrawn drafts of Unicode Consortium technical reports. The Lucene docs have a list of categories, at least:

  • Accent removal
  • Case folding
  • Canonical duplicates folding
  • Dashes folding
  • Diacritic removal (including stroke, hook, descender)
  • Greek letterforms folding
  • Han Radical folding
  • Hebrew Alternates folding
  • Jamo folding
  • Letterforms folding
  • Math symbol folding
  • Multigraph Expansions: All
  • Native digit folding
  • No-break folding
  • Overline folding
  • Positional forms folding
  • Small forms folding
  • Space folding
  • Spacing Accents folding
  • Subscript folding
  • Superscript folding
  • Suzhou Numeral folding
  • Symbol folding
  • Underline folding
  • Vertical forms folding
  • Width folding