User:TJones (WMF)/Notes/Hiragana to Katakana Mapping for English and Japanese

November 2017 — See TJones_(WMF)/Notes for other projects. See also T176197 (and earlier T173650)

Background
After a discussion in T173650 about apparent inconsistencies in search behavior for Japanese words on English Wiktionary, there was a sense that while there wasn't a bug, there was a potential lack of a feature, namely, the ability to search for Japanese words in either hiragana or katakana and find the other.

The consensus was that this would be useful on English projects, but there wasn't a consensus for Japanese projects, though it seems plausible that it could be useful.

So, my goal here is to assess the impact of adding the hiragana-to-katakana (H2K) mapping on English and Japanese Wikipedias and Wiktionaries. An unexpected complication is that applying the mapping to Japanese language projects required unpacking the CJK ("Chinese, Japanese, Korean") analyzer, which has some additional knock-on effects.

Both the English and Japanese/CJK analysis chains use the Elasticsearch "standard" tokenizer, however, the CJK analysis chain re-works all CJK characters into bigrams. Japanese is also configured for all ICU "upgrades", which includes upgrading the "standard" tokenizer to the "icu_tokenizer" and the "lowercase" filter to the the "icu_normalizer" filter. This doesn't affect the monolithic CJK analyzer used in the text field—which still uses the standard analyzer and lowercase filter—but it does affect all the other analyzers, including the near_match, suggest, and plain fields.

English is Easy
The analysis of English is relatively easy. The analyzer is already unpacked and the incidence of hiragana and katakana characters is very low in both Wikipedia and Wiktionary. I analyzer the text of 10K random Wikipedia articles and 10K random Wiktionary entries.

I Want a New Bug: The Standard Tokenizer vs Hiragana
I've found something I'll call... an "inconsistency". Looking at the English analysis examples, the standard tokenizer breaks up hiragana character-by-character, while katakana is not broken up that way. As a result, the current English config tokenizes katakana オオカミ ("wolf") as オオカミ, while the hiragana equivalent, おおかみ gets divided into separate characters お | お | か | み.

Since this is the standard tokenizer used by most analyzers, this is what happens to hiragana almost everywhere, except for Japanese and Korean (which use CJK) and a handful of languages that use the icu_tokenizer.

The CJK analyzer creates overlapping bigrams for both katakana and hiragana:
 * オオカミ → オオ | オカ | カミ
 * おおかみ → おお | おか | かみ

The icu_tokenizer seems to follow the Unicode Segmentation Algorithm correctly—as the standard analyzer is supposed to do—and breaks them up like this:


 * オオカミ → オオカミ
 * おおかみ → お | おかみ

I don't quite understand why おおかみ is split like that, but it is what the Unicode Segmentation demo does, too, so it is at least following the rules.

Enabling Kana Mapping in English
I enabled the H2K kana mapping for English and analyzed the effects on the 10K documents from Wikipedia and Wiktionary.

There were slightly fewer tokens (fewer than 300 out of more than 3 million for Wikipedia, around 800 out of more than 160K for Wiktionary), due to hiragana not being split up character by character.

Overall there was a very small effect size in Wikipedia (0.002% of tokens affected) & Wiktionary (0.009% of tokens affected). An unexpected effect of the standard tokenizer breaking up hiragana is that individual words converted to katakana are preserved. This should improve precision, but may decrease recall.

This is pretty much as expected, other than hiragana being split character by character.

Implementation in English, and Others
I suggest implementing the mapping for English language projects. The question is whether we should implement it for other non-CJK languages using the standard tokenizer.

It would be straightforward to implement it for English, French, Italian, Russian, and Swedish, because those analyzers use the standard analyzer and have already been unpacked.

It would probably also be straightforward to add it to Chinese and Hebrew, which have their own analyzers. Chinese breaks up both katakana and hiragana character-by-character, so the mapping might increase unwanted recall a lot. Hebrew keeps both as single words, so the mapping would be properly limited and targeted.

Japanese is Complicated
For Japanese, I pulled 5K Wikipedia articles and 5K Wiktionary entries—only half of the 10K for English on the assumption that all Japanese characters would be considerably more common on Japanese-language wikis.

Unpacking the CJK analyzer
In order to enable an additional character filter (the H2K kana map), we have to unpack the CJK analyzer. I also had to disable the automatic upgrades from the `standard` tokenizer to the `icu_tokenizer` and the `lowercase` filter to the `icu_normalizer` to test unpacking by itself.

Unpacking the CJK analyzer had some unexpected consequences, as expected.


 * The CJK analyzer uses English stopwords, but somehow the lists are not identical when unpacked.
 * an is treated as a stopword when unpacked, but not by the monolithic analyzer.
 * s, t, and www are treated as stopwords by the monolithic analyzer, but not when unpacked.

These differences in stop words had a small effect on the number of tokens in the Wikipedia and Wiktionary texts.

Enabling Kana Mapping
There were no differences in token counts, since all CJK text is tokenized as bigrams by the CJK analyzer.

Enabling the kana mapping on the ICU-disabled unpacked CJK analyzer did exactly as expected. Most of the token differences are bigrams, with a handful of unigrams. 10.296% of Wiktionary tokens and 28.283% of Wikipedia tokens were affected, which is a lot!

Enabling ICU Mapping
Allowing the normal configuration process to replace the `standard` tokenizer with the `icu_tokenizer` and the `lowercase` filter with the `icu_normalizer` on the unpacked CJK analyzer had a huge effect on both tokenization and normalization.

Wikipedia had about 6800 more tokens out of ~5.8M, and Wiktionary had about 900 more out of ~210K.

There are a very small number of new collisions (tokens that are mapped to the same indexed form): 0.041% for Wikipedia, 0.042% for Wiktionary.

I Want a New Bug II—Tokenization Boogaloo
Maintaining Context after Whitespace

Some of the changes in tokenization are surprising, and it appears that the ICU tokenizer also has some trouble following the Unicode Segmentation Algorithm.

It appears that the tokenization rules change following a non-space character at Unicode point U+0370 (Greek) and above, but revert back to the expected rules following a character that is or normalizes to a non-empty, non-space character below U+0370.

As a result, x 14th tokenizes as x | 14th, while ァ 14th tokenizes as ァ | 14 | th. Similarly:
 * x _x → x | _x, but ァ _x → ァ | x
 * x __x → x | __x, but ァ _x → ァ | __ | x

This seems like an error to me, since anything after a space should not be affected by what comes before the space, and the Unicode Segmentation demo from earlier agrees.

I also double checked that this is strictly caused by the ICU tokenizer by switching briefly to Tibetan, which uses the ICU normalizer + ICU tokenizer. It also has this problem, as do Min Dong, Cree, Dzongkha, Gan, Hakka, Khmer, Lao, Burmese, Wu, Classical Chinese, Min Nan, Cantonese, which were all switched to the ICU normalizer + ICU tokenizer. (See Phab T147512 & T149717.)

Spaces and Dakuten & Handakuten

There's also some weird behavior going on with dakuten (゛) and handakuten (゜), which are diacritics used to indicate voicing changes. Both have "regular" forms and "combining" forms (which combine with the preceding character).

Depending on the application, in most or all contexts away from certain Katakana or Hiragana characters, the combining forms are not rendered, come across as tofu, or do something else infelicitous. Dakuten by itself: (゙), or with ヘ: (ベ). Handakuten by itself: (゚), or with ヘ: (ペ).

Below I'm going to use <゛> and <゜> for the combining forms and  for space, so we can see them.

The non-combining dakuten and handakuten forms are generating tokens with spaces as a result of the interaction between the ICU normalization and CJK Bigram filter.

The ICU normalization converts the regular forms into combining forms and adds a space before them; the bigram filter breaks those into tokens with spaces (because it assumes earlier tokenization has ).


 * 魂゜ → 魂<゜> → 魂 | <゜>
 * ヘ゛ → ヘ<゛> → ヘ | <゛>

Expected Changes
There's plenty of the expected ICU normalization/tokenization changes. The normalization changes are a "light" version of full-blown ICU folding.


 * Mixed-script words are broken up, e.g, Cамооборона → c | амооборона (that initial C is Latin!), eπ → e | π, βII → β | II, ØωØver → ø | ω | øver, etc.
 * The usual regularizations:
 * Greek final-ς is converted to σ
 * ß becomes ss
 * single-character Roman Numerals (ⅰ,ⅱ,ⅲ,ⅳ) are converted to plain Latin letters (i,ii,iii,iv).
 * ㍉ (single character milli- prefix) → ミリ
 * ㄹ (U+3139 "hangul letter rieul") → ᄅ ("hangul choseong rieul" U+1105)
 * IPA strings are broken up (in this case because a Greek letter is used mid-word): lizˈβoɐ → liz | β | oɐ; or normalized: pʰɜːkɪnz → phɜːkɪnz
 * other non-Latin scripts (Khmer, Thai) are broken up, apparently according to the Unicode Segmentation Algorithm
 * bi-directional characters, non-breaking spaces, etc. are stripped

ICU mapping and Kana mapping
Enabling the kana mapping after enabling the ICU tokenization seems to have had the expected effect of H2K kana mapping and not much else.

Recommendations & Next Steps
Here's the short version of my recommendations:


 * Enable H2K mapping for English, as requested.
 * Do not enable H2K mapping for Japanese—it wasn't part of the original request, the community was ambivalent, enabling it exposes a number of tokenization bugs, it has a very large impact that may or may not be good.
 * For other languages: post to some of the Wikipedia and Wiktionary Village Pumps for (in search volume order) French, Russian, Italian, Swedish, Chinese, and Hebrew. If there is some enthusiasm for it, add it and test it as needed. I think I'll start with four posts to French and Russian WP/Wikt VPs.
 * File upstream bugs for tokenization problems.

A more detailed look at the issues, trade-offs, and next steps is below (✓ matches recommendations above):


 * ✓ English: Looks good!
 * Enable the mapping.
 * Re-index English-language projects.


 * Japanese: Decide what to do next.
 * It's not clear that the Japanese-language community particularly wants this and enabling it may cause more problems than it solves.
 * We could enable it and accept the side effects of the weird tokenization. This would require testing on RelForge before deployment.
 * We could enable it but find some way to disable the ICU tokenizer and ICU Normalizer from being enabled for the `text` field. This would complicate the configuration even further.
 * ✓ We could stick with the status quo for Japanese-language projects for now, until we get a request from the community, or the technical problems have been solved.


 * Tokenization problems
 * ✓ File upstream bugs for the following
 * standard tokenizer: should follow Unicode Segmentation Algorithm for Hiragana
 * ICU Tokenizer: pre-whitespace context should not matter after whitespace
 * ICU Normalizer: should not add a space before normalized non-combining dakuten and handakuten


 * Other languages
 * Decide whether the mapping is useful enough to enable for French (✓), Russian (✓), Italian, Swedish, Chinese, and Hebrew (in search volume order).
 * If so:
 * Enable it for the right ones.
 * Re-index all of the relevant wikis.
 * Make it a general policy to enable the H2K mapping for non-CJK language analyzers that are unpacked in the future (e.g., German is likely to be unpacked and customized in the near future).
 * Decide whether the mapping is useful enough to warrant unpacking and customizing other analyzers.
 * Pro: We'd have useful H2K kana mapping almost everywhere.
 * Pro: This would also allow us to add other useful custom code everywhere and improve consistency (like better handling of zero-width non-breaking spaces, soft hyphens, and left-to-right/right-to-left markers)
 * Con: It would take a while to test properly.
 * Con: That's a lot of re-indexing.
 * Con: It's possible that we'd lose out on new functionality added to the monolithic analyzers in the future until we changed our unpacked version to match.
 * Semi-Pro: OTOH, those changes wouldn't suddenly appear because we upgraded Elasticsearch; not all improvements turn out to be good.
 * Semi-Con: Some third party analyzers (e.g., Ukrainian) can't be unpacked.