User:TJones (WMF)/Notes/Upgrading ASCII Folding to ICU Folding for French and English

From mediawiki.org

September 2016 — See TJones_(WMF)/Notes for other projects. (T146402) For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background[edit]

After the recent implementation of ASCII folding for French and English, we decided to upgrade to ICU Folding. ASCII folding converts accented and variant forms of ASCII characters into their more basic forms. ICU folding is similar, but works on many more Unicode characters. Some of the “simplifications” of ICU folding are undesirable in particular languages (such as “simplifying” Й to И in Russian—which is a bad idea).

For “foreign” character sets (e.g. Cyrillic or Arabic on English or French Wikipedia), aggressive character folding of Unicode characters isn’t as much of a concern; there are fewer terms to conflate, and the proper use of diacritics and variants is less consistent anyway.

Corpora[edit]

To test make sure that ICU folding doesn’t do (or fail to do) anything surprising and unexpected, I created an approximately 10K article corpus for each of French and English. (They consist of the first 20% of recent 50K article corpora I created.)

I also extracted individual characters from the larger 50K article corpora, and created character-based corpora with each character as its own term. The 50K article corpora have approximately twice as many distinct characters as the 10K article corpora (2864 vs 5848 for English, and 2424 vs 5057 for French). These corpora lose any contextual effects on folding, but they make it much easier to investigate and isolate the effects on a greater number of individual characters in much less time (less than 3 minutes vs over 400 minutes per language).

Character Results[edit]

In English, ICU folding affected an additional 189 distinct characters in the 50K article sample. Large classes of characters include:

  • Number variants: e.g., all of ۱, १, ১, ੧, ๑, ၁,1are folded to 1.
  • Polytonic Greek:  Ά, ά, ἀ, ἁ, ἄ, ἅ, Ἀ, Ἄ, Ἅ, ὰ, ᾶ are all folded to α.
  • Cyrillic: Ѓ, Ғ to г; ё, Ё to е; й, ѝ, Й to и; etc.
  • Arabic: ئ, ى, ی to ي; etc.
  • Japanese: removed dakuten and handakuten (guides to pronunciation), e.g., べ and ぺ to へ.
  • Phonetic symbols: ʀ, ʁ to r; etc.

In French, ICU folding affected an additional 191 distinct characters in the 50K article sample. Large classes of characters include a list very similar to that for English above.

Text Results[edit]

For the 10K article English corpus, out of 3,044,105 tokens and 139,220 types, there were 173 / 0.124% distinct types (282 / 0.009% of tokens) that were folded together for new collisions.

  • The most common collisions are phonetic pronunciation information being folded into the original word (showing how many spelling systems make so much more sense than English!). Examples: ˈaltdɔʁf and ˈɑːltˌdɔːrf folded to Altdorf; sɪtəˈɡlɪptɪn folded to Sitagliptin; nákʰɔ̄ːn folded to Nakhon.
  • Similarly, original and variant native spellings folded to the more typical English spellings for some names: e.g., Moʻiliʻili and Mōʻiliʻili to Moiliili; İstanbul to Istanbul.
  • Number variants are folded to Arabic numerals: e.g., ১০০ to 100.
  • Hebrew/Yiddish Dagesh is removed: בּונד folds to בונד‎.
  • Russian stress accents are folded away: Ива́нович to Иванович; And other Cyrillic “diacritics” are simplified as above.
  • Arabic and Japanese forms are simplified as above.
  • Short words have more potentially undesirable collisions, such as ˈpɐ̃ɰ̃ to pam; alɛ̃ to ale; ˈkarlos to Karlo.

For the 10K article French corpus, out of 2,761,905 tokens and 158511 types, there were 88 / 0.056% distinct types (195 / 0.007% of tokens) that were folded together for new collisions.

  • Lots of phonetic pronunciations folded into the source word, as above.
  • Greek, Arabic, and Russian as above.
  • Short words have more possibilities for collisions, and there are more “short” words in French because of the stemming and letter deduplication: ʿAlāʾfolds with Alâ, àla, ALA; ʽÉvèr with eV, Ev, EV; ʿĀbbāy with Abay and Abbaye, etc.
  • Parts of equations are being condensed to “words”: n·p to np, J·K to jk, J·m to jm.
  • Modifier letter apostrophes (ʼ—distinct from right single quotation mark: ’) are being stripped, which is causing dʼen to fold with den; dʼun, and dʼune to fold with dun; Lʼidée to lid. However, these were not being treated correctly before (stemming as d’en, d’un, and lʼid), as the french_elision filter doesn’t seem to know what to do with them.

Conclusions[edit]

As expected, ICU folding affects many non-Latin characters, but also some Latin characters (Hawaiian ō, Turkish İ, phonetic characters) that ASCII-folding misses. These distinctions are important in the source orthographies, but much less so in English and French.

Overall, the impact is small, but a net positive.

I’d suggest adding a char filter to map Modifier letter apostrophes (ʼ U+02BC) to straight or curly right single quotes (' U+0027 or ’ U+2019), though that would require additional testing. Out of ~10K articles, this affects 5 tokens (1 each of 5 types)—so it’s not a big problem.