User:TJones (WMF)/Notes/Language-Specific Lowercasing and ICU Normalization

March 2019 — See TJones_(WMF)/Notes for other projects. See also T217602. For help with the technical jargon used in Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
While looking into unpacking the Greek analysis chain to add a filter for zero-length tokens (see also T203117), I ran into the fact that the Greek "lowercase" filter does more than lowercasing—it also converts final sigma (ς) to regular sigma (σ) and, very importantly, removes some very common Greek diacritics (particularly tonos, but also dialytika) that are not removed by ICU normalization (which usually replaces the lowercase filter in our analysis chains).

There are also special lowercase settings for Irish and Turkish. Turkish I knew about, since it has I/ı and İ/i as different letters, and you want to do the right thing and not lowercase I to i or, worse, İ to i̇ (an i with an extra dot). The Irish situation was news to me: Irish has t-prothesis and n-eclipsis, which can add t- or n- to the beginning of a word with a vowel, though the added t- or n- is still lowercase, and the original initial vowel can still be uppercase, as in the Irish Wikipedia titles “An tAigéan Ciúin” and “Ceol na nOileán”. So the lowercase of Nathair is naither, but the lowercase of nAthair is n-aither (which gets stemmed to just aither, which is what you want).

Rather than hack together a kludge to only address the Greek "text" analysis chain, I wanted to properly address language-specific lowercasing in general. Everywhere that we replace "lowercase" with "icu_normalization" we are losing out on this language-specific normalization. We already created a partial work around for Greek by enabling ICU Folding for the plain field, but the language-specific "lowercase" normalization should actually happen in many other fields, too.

Data and a Plain Analysis Plan
I grabbed 10K articles/entries each from Turkish Wikipedia and Turkish Wiktionary to test the Turkish changes, and 10K Irish Wikipedia articles and 1K Irish Wiktionary entries to test Irish (Irish Wiktionary is still very small).

I wasn’t planning on unpacking the Turkish or Irish analyzers in the text field just yet. Like all monolithic analyzers, they use the the lowercase filter internally (in this case, the language-specific version), so they should be doing the right thing.

Rather I want to test the effect of restoring the language-specific “lowercasing” in all the places it has been removed by the automatic upgrade to ICU normalization. So I’m testing on the plain field, which doesn’t anything other than applying ICU normalization and “word_break_helper” (which converts underscores and other characters to spaces to break up tokens like “word_break_helper” into its constituent parts).

Lowercase + ICU Normalization in Turkish
In Turkish, the overall impact in terms of collisions and splits was small but non-trivial.

For the Wikipedia data set:


 * New collisions: 1256 pre-analysis types (0.496% of pre-analysis types) / 16984 tokens (0.868% of tokens) were added to 1224 groups (0.535% of post-analysis types), affecting a total of 2699 pre-analysis types (1.066% of pre-analysis types) in those groups.
 * New splits: 468 pre-analysis types (0.185% of pre-analysis types) / 5282 tokens (0.270% of tokens) were lost from 444 groups (0.194% of post-analysis types), affecting a total of 1006 pre-analysis types (0.397% of pre-analysis types) in those groups.

For the Wiktionary data set the effect was even smaller:


 * New collisions: 89 pre-analysis types (0.224% of pre-analysis types) / 192 tokens (0.138% of tokens) were added to 89 groups (0.237% of post-analysis types), affecting a total of 181 pre-analysis types (0.456% of pre-analysis types) in those groups.
 * New splits: 36 pre-analysis types (0.091% of pre-analysis types) / 70 tokens (0.050% of tokens) were lost from 35 groups (0.093% of post-analysis types), affecting a total of 71 pre-analysis types (0.179% of pre-analysis types) in those groups.

Patterns in the changes that I noticed:


 * Most of the new collisions are clearly good—they are upper/lowercase variants of the same word, like İkinci/ikinci or Işıklı/ışıklı or UYARI/uyarı.
 * There aren’t any more double-dot i’s indexed any more, as in i̇nşaat, which when ICU normalized from İnşaat picked up an extra dot over the i.
 * Many splits are non-Turkish words (notably English, and some German and French) are indexed as expected and uppercase and lowercase versions don’t match because of the I/i split. Ich, Internet, MID, içi—though içi is also a Turkish word, and now stems the same as İçi, as it should.
 * Roman numerals are affected. VIII is indexed as “vııı” while viii is indexed as “viii”. (On the other hand, special unicode Roman numeral characters, like Ⅲ are indexed as i’s with dots, as “iii”, thanks to the ICU normalization (in the text field, they just get lowercased, like “ⅲ”—which is a single character).
 * A handful of mixed-script words are affected. Some of the Caucasian languages use Latin uppercase I in place of the palochka (Ӏ) and this gets lowercased as i (previously) or ı (now).

Overall, this looks like an improvement. There are going to be inconsistencies because non-Turkish words don’t follow Turkish rules, but now the discrepancy will now be in favor of doing things the Turkish way on Turkish-language wikis.

Lowercase + ICU Normalization in Irish
In Irish, the overall impact in terms of collisions and splits was even smaller.

For the Wikipedia data set:


 * New collisions: 1 pre-analysis types (0.001% of pre-analysis types) / 1 tokens (0.000% of tokens) were added to 1 groups (0.001% of post-analysis types), affecting a total of 3 pre-analysis types (0.003% of pre-analysis types) in those groups.
 * New splits: 15 pre-analysis types (0.015% of pre-analysis types) / 67 tokens (0.006% of tokens) were lost from 15 groups (0.016% of post-analysis types), affecting a total of 37 pre-analysis types (0.036% of pre-analysis types) in those groups.

The Wiktionary data set had no new collisions or splits, but the sample is only 1K entries (the entire Irish Wiktionary has fewer than 3000 entries).

Patterns in the changes that I noticed:


 * The Wikipedia splits all seem good. nathair (“snake”) is no longer stemmed with nAthair, a form of athair (“father”).
 * The one collision is the result of a inadvertent fix to a shortcoming in ICU normalization. ICU normalization lowercases dotted I (İ) as a lowercase i with an extra dot (i̇). The normal lowercase filter does this correctly; I routinely have to add a filter to fix this when we unpack monolithic analyzers and switch from lowercase to ICU normalization. Anyway, long story short, İnsan is now grouped with Insan in the plain field.

One possible sub-optimality remains: the standard tokenizer splits on dashes, so n-athair and n-Athair (and also n athair, though that isn’t a natural phrase) are tokenized as two tokens: n + athair. But nAthair is tokenized as one token: n-athair, because the dash is introduced after tokenization. This is definitely good in the sense that nAthair won’t match nathair (tokenized as nathair). It’s arguable whether searching for “nAthair” (with quotes) should or should not match “n-athair” (with quotes).

Overall, this seems like a good step in the right direction for lowercasing.

Next Steps

 * Commit the changes. (IN PROGRESS)
 * Reindex Turkish and Irish-language wikis. (SOON)