User:TJones (WMF)/Notes/Kuromoji Analyzer Analysis

June 2017 — See TJones_(WMF)/Notes for other projects. See also T166731.

Intro
The Kuromoji Japanese language analyzer has lots of configuration options, and can be unpacked for custom configuration, and it is supported by Elastic. Seems like the right place to start.

Baseline (CJK) vs Kuromoji
The instance of the analyzer results in many fewer tokens on a 5K jawiki article sample: And many fewer token pre-/post-analysis types: Many non-Japanese, non-Latin characters are not handled well:
 * 5,515,225 (CJK)
 * 2,379,072 (Kuromoji)
 * 310,339 / 305,770 (CJK)
 * 139,282 / 127,983 (Kuromoji)
 * (there are fewer post-analysis types because some merge, like Apple and apple)
 * Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, IPA, Mongolian, Myanmar, Thaana, Thai, and Tibetan are removed.

Lots of Japanese tokens (Hiragana, Katakana, and Ideographic characters) also change, but that's because the baseline CJK tokenizer works on bigrams rather than actually trying to segment words.
 * Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
 * Cyrillic words are split on our old friend, the combining acute accent.
 * Greek is treated very oddly, with some characters removed, and words sometimes split into individual letters, sometimes not. I can't figure out the pattern.

Fullwidth numbers starting with １ are split character-by-character (１９３３ → 1 + 9 + 3 + 3 rather than 1933). Mixed fullwidth and halfwidth numbers are inconsistent, depending not only on where the fullwidth and halfwidth forms are, but also which ones they are. Leading fullwidth １ seems to like to split off. That seems odd, and while not all of these mixed patterns occur in my sample, some do.
 * 1９３３ → 1933
 * １９３3 → 1 + 933
 * １9３３ → 1 + 933
 * １９3３ → 1 + 933
 * ２９３3 → 2933
 * ２9３３ → 2933
 * ２９3３ → 2933
 * 2９３３ → 2933

Kuromoji vs Kuromoji Unpacked
I unpacked the Kuromoji analyzer according to the docs on the Elastic website. I disabled our usual automatic upgrade of lowercase filter to icu_normalizer filter to focus on the effect of unpacking.

On a 1K jawiki article sample, the number of tokens was very similar: Also, the number of pre- and post- analysis types are similar: Some non-Japanese, non-Latin characters are treated differently: Others, the same: Overall, it's a big improvement for non-Japanese, non-Latin character handling.
 * 520,971 (Kuromoji)
 * 525,325 (unpacked)
 * 59,350 / 55,448 (Kuromoji)
 * 59,643 / 55,716 (unpacked)
 * Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, Mongolian, Myanmar, Thaana, Thai, and Tibetan are now preserved!
 * IPA characters are preserved, but words are split on them: dʒuglini → d + ʒ + uglini.
 * Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
 * Cyrillic words are split on our old friend, the combining acute accent.
 * Greek and full-width numbers are treated oddly, as before.

Lowercase vs ICU Normalizer
I re-enabled the lowercase-to-icu_normalizer upgrade. The differences in the 1K sample were very slight and expected: Those are all great, so we'll leave that on in the unpacked version for now.
 * ², ⑥, and ⑦ were converted to 2, 6, and 7.
 * Greek final-sigma ς became regular sigma σ.
 * German ß became ss.
 * Pre-composed Roman numerals (Ⅲ ← that's a single character) were decomposed (III ← that's three i's).

Fullwidth Numbers
I did some experimentation, and the problem with the fullwidth numbers is coming from the tokenizer. I've added a custom character filter to convert the fullwidth numbers to halfwidth numbers before the tokenizer, which solves the weird inconsistencies.

It does have one semi-undesirable side effect: months, like ４月 (4th month = "April"), are split into two tokens, 4 + 月, if a halfwidth number is used. While I think indexing ４月 or 4月 as a unit it better, it only happens for the fullwidth version, so while this is slightly worse, it is also more consistent, in that ４月 and 4月 will be indexed the same.

Bits and Bobs
I tested a number of other options available with Kuromoji. These are the ones that didn't pan out.

Kuromoji Tokenizer Modes
The Kuromoji tokenizer has several modes and a couple of expert parameters (see documentation). I didn't want to dig into all of it, but the "search" mode seemed interesting.

The "normal" mode returns compounds as single terms. So, 関西国際空港 ("Kansai International Airport") is indexed as just 関西国際空港.

In "search" mode, 関西国際空港 is indexed as four terms, the full compound (関西国際空港) and the three component parts (関西, 国際, 空港—"Kansai", "International", "Airport").

It turns out that "search" is the default, even though "normal" is listed first and sounds like it might be the normal mode of operation.

There is also an "extended" mode that breaks up unknown tokens into unigrams. (For comparison, the CJK analyzer breaks up everything into bigrams.)

The "search" mode seems better, so we'll stick with that.

Part of Speech Token Filter
I disabled the kuromoji_part_of_speech token filter, and it had no effect on a 1k sample of jawiki articles. According to the docs, it filters based on part of speech tags, but apparently none are configured, so it does nothing. Might as well leave it disabled if it isn't doing anything.

Iteration Mark Expansion
There's an option to expand iteration marks (e.g., 々), which indicate that the previous character should be repeated. However, sometimes the versions of the word with and without the iteration marks have different meanings. More importantly, expanding the iteration mark can change the tokenization—usually resulting in a word being split into pieces. The iteration mark seems much less ambiguous, so I think we should leave it alone.

Stop Words
A small number of types (~125) but a huge number of tokens (278,572 stop words vs 525,320 non-stop words—34.6%) are filtered as stop words. A quick survey of the top stop words all make sense.

Prolonged Sound Mark "Stemmer"
The kuromoji_stemmer token filter isn't really a stemmer (the kuromoji_baseform token filter does some stemming for verbs and adjectives), it just strips the Prolonged Sound Mark from the ends of words. A quick test disabling it shows that's exactly what it does.

This seems to be a mark used in loanwords to indicate long vowel sounds. I'm not sure why you'd want to remove it at the end of a word, but it often doesn't make any difference in Google translate, so it seems to be semi-optional.

The removal is on by default, so I'll leave it that way.

Japanese Numeral Conversion
The kuromoji_number [scroll down at the link to find it] normalizes Japanese numerals (〇, 一, etc.) to Arabic numerals (0, 1, etc.).

However it is wildly aggressive and ignores spaces, commas, periods, leading zeros, dashes, slashes, colons, number signs, and many more. So, ''1, 2, 3. : 456 -7 #8 is tokenized as 12345678 and 一〇〇.一九 #七 九:〇 as 10019790. It also tokenizes 0.001 as just 1.''

Unfortunately, the rules for parsing properly formatted Japanese numerals can't be implemented as simple digit-by-digit substitution, so a simple char filter can't solve this problem.

Fortunately, not using it is no worse than the current situation.

Groups for Review
Below are some groupings for review by fluent speakers of Japanese. These are tokens that are indexed together, so searching for one should find the others. The format is  - [ ].... The  is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the analyzer and the tokens being considered, so take it as a hint to the meaning of the group, but not definitive. The is a token found by the language analyzer (more or less a "word" but not exactly—it could be a grammatical particle, etc.) and is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters.

Groups with no common prefix/suffix
These are the groups that don't have a common prefix or suffix across all tokens in the group. That's not necessarily a bad thing (good, better, best is a fine group in English, for example)—but it's worth looking at them just to be sure. I've filtered out the half-width/full-width variants that were obvious to me. Things are a bit complicated by the fact that "る" is in two different groups above. A number of tokens are normalized in multiple ways, which appears to be context-dependent.
 * く - [47 か][14 きゃ][220 く][8 け][351 っ]
 * くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]
 * す - [36 さ][4 し][81 しゃ][199 す]
 * たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
 * ぬ - [4 ざり][142 ざる][1 ざれ][4927 ず][5 ずん][407 ぬ][84 ね]
 * り - [512 り][188 る]
 * る - [4 よ][1 りゃ][166 る][13 るる][1 るれ][38 れ][2 ろ]

Examples of where these tokens come from in the text are available on a separate page.

Largest Groups
These are the groups with the largest number of unique tokens. Again, these aren't necessarily wrong, but it is good to review them. They are actually pretty small groups compared to other language analyzers. The first two are duplicates from above. These are not duplicates: Examples of where these tokens come from in the text are available on a separate page.
 * たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
 * くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]
 * てる - [115 て][5 てっ][1 てよ][7 てら][3 てり][1 てりゃ][282 てる][1 てれ][11 てん]
 * よい - [32 よ][217 よい][22 よかっ][2 よから][1 よかれ][19 よき][138 よく][6 よけれ][134 よし]
 * 悪い - [33 悪][250 悪い][1 悪う][27 悪かっ][1 悪から][1 悪き][143 悪く][1 悪けれ][5 悪し]
 * 良い - [75 良][383 良い][47 良かっ][2 良かれ][38 良き][265 良く][1 良けりゃ][2 良けれ][16 良し]

Random Groups
Below are 50 random groups. I filtered out groups that consisted of just a deleted Prolonged Sound Mark (ー), since the "stemmer" (see above) is supposed to do that.

Examples of where these tokens come from in the text are available on a separate page.
 * かき消す - [1 かき消し][2 かき消す][1 かき消そ]
 * きつい - [6 きつ][7 きつい][1 きつう][1 きつき][6 きつく]
 * こうじる - [41 こうじ][3 こうじろ]
 * こじれる - [6 こじれ][1 こじれる]
 * つける - [547 つけ][11 つけよ][179 つける][3 つけれ][9 つけろ][2 つけん]
 * てんじる - [3 てんじ][1 てんじん]
 * のく - [1 のい][12 のき][3 のく][2 のこ]
 * ぶつ - [2 ぶた][2 ぶち][2 ぶちゃ][10 ぶっ][80 ぶつ][1 ぶて]
 * まぎる - [1 まぎ][1 まぎら][1 まぎる][1 まぎれ][1 まぎん]
 * もてる - [12 もて][3 もてる]
 * 丸っこい - [2 丸っこい][1 丸っこく]
 * 乗り換える - [24 乗り換え][19 乗り換える]
 * 乗る - [16 乗][260 乗っ][8 乗ら][67 乗り][92 乗る][2 乗れ][1 乗ろ]
 * 仕上げる - [23 仕上げ][7 仕上げる]
 * 任じる - [111 任じ][4 任じる]
 * 任す - [75 任さ][1 任し]
 * 信じる - [179 信じ][5 信じよ][29 信じる][1 信じろ]
 * 働かせる - [9 働かせ][3 働かせる]
 * 助かる - [9 助かっ][4 助から][10 助かる][1 助かれ]
 * 取りやめる - [29 取りやめ][1 取りやめる]
 * 取り戻す - [3 取り戻さ][85 取り戻し][50 取り戻す][2 取り戻せ][12 取り戻そ]
 * 唸る - [2 唸ら][1 唸り][3 唸る][1 唸れ]
 * 太い - [81 太][25 太い][32 太く]
 * 差し入れる - [1 差し入れ][2 差し入れる]
 * 弱まる - [7 弱まっ][7 弱まり][6 弱まる]
 * 従える - [24 従え][2 従えよ][5 従える]
 * 思いとどまる - [3 思いとどまら][1 思いとどまり][8 思いとどまる]
 * 承る - [164 承][1 承り]
 * 押さえ込む - [1 押さえ込ま][2 押さえ込み][6 押さえ込む][1 押さえ込ん]
 * 振る舞う - [6 振る舞い][17 振る舞う][4 振る舞っ]
 * 携わる - [98 携わっ][3 携わら][23 携わり][48 携わる][1 携われ]
 * 摂る - [32 摂][4 摂っ][3 摂ら][4 摂り][10 摂る][1 摂れ]
 * 暴く - [4 暴い][12 暴か][3 暴き][5 暴く][1 暴こ]
 * 癒える - [11 癒え][1 癒える]
 * 策する - [1 策し][1 策する]
 * 脱ぐ - [10 脱い][2 脱が][4 脱ぎ][4 脱ぐ]
 * 致す - [1 致し][1 致す]
 * 虐げる - [12 虐げ][3 虐げる]
 * 裁く - [1 裁い][6 裁か][1 裁き][10 裁く]
 * 要す - [3 要さ][42 要し][2 要す]
 * 見て取れる - [2 見て取れ][5 見て取れる]
 * 解く - [28 解い][37 解か][41 解き][35 解く][5 解こ]
 * 言い表す - [2 言い表さ][2 言い表し][1 言い表す]
 * 試す - [6 試さ][15 試し][15 試す][3 試そ]
 * 謀る - [5 謀っ][1 謀ら][2 謀り][2 謀る]
 * 譲り渡す - [1 譲り渡さ][3 譲り渡し]
 * 護る - [3 護っ][8 護り][8 護る][1 護れ][1 護ろ]
 * 起こす - [32 起こさ][400 起こし][155 起こす][2 起こせ][9 起こそ]
 * 遠い - [113 遠][77 遠い][2 遠かっ][7 遠き][58 遠く][1 遠し]
 * 闘う - [7 闘い][42 闘う][1 闘え][1 闘お][14 闘っ][4 闘わ]

Longest Tokens
Below are the tokens that are 25 characters or more.

The longest tokens in Latin characters are all reasonable—the first one is the English/basic Latin alphabet, while the next two are transposed versions of the alphabet (I smell a cipher). There are a few long German words, some chemical names, some English words run together as part of a URL, and a really long string that looks to be a transliteration of Sanskrit translated to Tibetan and back. (The corresponding English Wikipedia article breaks it up into multiple words.)
 * abcdefghijklmnopqrstuvwxyz
 * zabcdefghijklmnopqrstuvwxy
 * hijklmnopqrstuvwxyzabcdefg


 * luftwaffenausbildungskommando
 * polizeidienstauszeichnung
 * staedtischermusikvereinduesseldorf


 * chlorobenzalmalononitrile
 * dimethylmethylideneammonium
 * dinitrosopentamethylenetetramine
 * glycerylphosphorylcholine
 * hydroxydihydrochelirubine
 * hydroxydihydrosanguinarine
 * hydroxyphenylacetaldehyde
 * methylenedioxypyrovalerone


 * diggingupbutchandsundance

The long Thai tokens look to be noun phrases in Thai, written without spaces in the usual Thai way. (The Japanese language analyzer isn't really supposed to know what to do with those.) The longest Japanese tokens can be broken up into two groups. The first group are in katakana. These are long tokens that are indexed both as one long string and as smaller parts (see "Kuromoji Tokenizer Modes" above). The breakdown of the tokens is provided beneath each long token. Notice that some of the sub-tokens are still pretty long ("クリテリウムイベント", "レーシングホールオブフェイムステークス", "カロッツェリアサテライトクルージングシステム"). The second group of Japanese tokens are all hiragana, and a lot of them start with "ょ". When submitted to the analyzer, these all come back as single tokens with no alternate breakdown. Based on Google translate, I think most or all of these are errors (I wouldn't be shocked if a few turned out to be the Japanese equivalent of antidisestablishmentarianism and supercalifragilisticexpialidocious). One way to deal with these very long tokens is to change the tokenizer mode to "extended", which breaks up unknown words into unigrams (see "Kuromoji Tokenizer Modes" above). This would improve recall, but at the expense of precision.
 * mahāvairocanābhisaṃbodhivikurvitādhiṣṭhānavaipulyasūtrendrarāja
 * ที่ทําการปกครองอําเภอเทิง
 * ศูนย์เทคโนโลยีสารสนเทศและการสื่อสาร
 * ジャパンカップサイクルロードレースクリテリウムイベント
 * ジャパン - カップ - サイクル - ロードレース - クリテリウムイベント
 * ナショナルミュージアムオブレーシングホールオブフェイムステークス
 * ナショナル - ミュージアム - オブ - レーシングホールオブフェイムステークス
 * パイオニアカロッツェリアサテライトクルージングシステム
 * パイオニア - カロッツェリアサテライトクルージングシステム
 * パシフィックゴルフグループインターナショナルホールディングス
 * パシフィック - ゴルフ - グループ - インターナショナル - ホールディングス
 * ざいさんぎょうだいじんのしょぶんにかかるしんさきじゅんとう
 * ゃくかちょうみたてほんぞうふでつむしこえのとりどり
 * ゅういっかいせんばつちゅうとうがっこうやきゅうたいかい
 * ゅうとうがっこうゆうしょうやきゅうたいかいしこくたいかい
 * ょうがいをりゆうとするさべつのかいしょうにかんするほうりつ
 * ょうさつじんこういをおこなっただんたいのきせいにかんするほうりつ
 * ょうせいほうじんこくりつびょういんきこうかながわびょういん
 * ょうせいほうじんこくりつびょういんきこうもりおかびょういん
 * ょうせんとうきょくによってらちされたひがいしゃとうのしえんにかんするほうりつ
 * ょくかんれんさんぎょうろうどうくみあいそうれんごう
 * ょけんおよびぎじゅつじょうのちしきのこうりゅうをよういにするためのにほんこくせいふと
 * ょせいかんりょうとかぞくのよんひゃくろくじゅうごにち

For now, I say we let the pendulum swing in the other direction (away from indexing bigrams), but keep this potential problem in the back of our minds.

I still need to test the general tokenization of the language analyzer. If that's generally very good, I'll stick with this suggestion. If it's not great, we can reconsider the unigrams for unknown tokens.

Kuromoji Tokenization Analysis
One of the big benefits of a Japanese-specific language analyzer is better tokenization. Written Japanese generally doesn't use spaces, so the default CJK analyzer breaks everything up into bigrams (i.e., overlapping sets of two characters). Trying to find actual words in the text is best, if you can do a decent job.

A rough parallel in English would be to break English text up by some unit of meter (apologies to any poets, I'm gonna just wing it). So "the president of the united states" might be broken up into "the presi", "president", "dent of", "of the", "the unit", "united", and "ed states". You can't apply stop words (i.e., ignoring "the" and "of") and the matches you do get are not guaranteed to be what you intended. So, "..independent of the presiding consul's concerns, the unit belonging to Ed States, United Airlines president ..." matches all the pieces, relatively close together, but isn't at all about the president of the US.

Tools and Corpora
Using the SIGHAN framework I used to test Chinese segmentation, I set out to analyze the Japanese segmentation using Kuromoji.

I was able to extract tokenization information from the much more heavily annotated KNBC corpus. There are 4,186 sentences in the corpus. I had to drop a handful of them because my extracted tokenization did not match the original sentence after ""de-tokenizing" it. Some were missing, some mismatched. My final corpus had 4,178 sentences, so I don't think there is any major bias to the dropped sentences.

From the list of Longest Tokens above, we know that Kuromoji's tokenization is not 100% perfect—at least a small number of errors are expected.

I disabled stop words and tested the extended, search, and normal tokenizer modes (see Kuromoji Tokenizer Modes above). The extended mode is expected to have problems since unknown strings are split into individual characters. The search mode also has problems because it includes multiple tokens for the same string (strings recognized as compounds are indexed both as one longer string, and as constituent parts).

Punctuation Problems
I also had to deal with a somewhat unexpected problem of punctuation. The Kuromoji analyzer, rightly, drops punctuation and non-word symbols it isn't going to index. My script that extracts the tokenization fills in the dropped bits from the original string. That's mostly okay, but sometimes it causes problems when there are multiple punctuation and other symbols in a row. For example, when tokenizing "「そうだ 、京都 、行こう. 」", Kuromoji returned the tokens そう, だ, 京都, 行こ, and う. My script filled in 「, 、, 、, and. 」. The problem is that the KNBC annotations split. 」 into separate tokens:. and 」. Disagreement here isn't terribly important, since we aren't indexing those tokens.

I wrote a script to identify correspondences in tokenization, and used it to identify where I should manually munge tokenization differences for punctuation and other symbols (e.g., "−−＞" vs "− − ＞", "…. " vs "… . ", etc.). I also normalized a small number of fullwidth spaces.

Recall and Precision Results
It turns out that the differences are pretty small, and recall is roughly 82-83% and precision 72-76%, depending on the tokenizer: These are okay results, but not great.

However, there are some systematic differences between the KNBC tokenization and the Kuromoji tokenization.

Common Tokenization Discrepancies
Below is a list of the most common alternations (10 or more occurrences) found between the KNBC and Kuromoji tokenizations. I've also provided Google translations for the ones with 20 or more occurrences. (I've added a bullet, •, between tokens because I have trouble seeing the spaces in fullwidth text, and I suspect others who don't read Japanese may as well.)

The Google translations are far from definitive and comments from speakers of Japanese would be helpful, but the translations do hint that the most common alternations are not really content words. I've bolded the ones that seem to have some content. The rest account for 3,010 out of 9,214 alternations (32.6%, including those with <10 occurrences, which are not shown here).

Below I have split out the tokens that participate in alternations, to help identify regular patterns across alternations. I've included those with >=50 occurrences.

On the Kuromoji side, the top 9 are single characters, and 8 of them are identified by English Wiktionary as being particles (very briefly: particles are typically small words that provide additional grammatical information). To get a sense of the scope here, this would be like deciding whether "have been" or "look up" should be tokenized as one word or two. Consistency is probably more important than choosing either option.

It seems that Kuromoji is more aggressive about separating particles, and these 8 account for 6,656 of the 17,510 Kuromoji tokens (38.0%) that appear in alternations.

Tokenization Analysis Summary
My sense is that a significant portion of the disagreements between KNCB and Kuromoji are based on aggressiveness in separating particular parts of speech. There are surely a fair number of errors in the Kuromoji tokenization, but I'm not so worried that I'd want to stop and not proceed to set up the test index in labs.

Further Review
Below is some additional analysis, done as the result of issues brought up by speaker review or elsewhere.

Some 1- and 2-Character Tokens
In light of the discussion with whym on Phab and the concern that 1- and 2-character tokens are often highly ambiguous and can be grammatical suffixes, I've taken all of the 1- and 2-character tokens in the Groups with no common prefix/suffix above and run some additional analysis. In the tables below we have: The remaining columns come in triples, which are: The first table is the single-character tokens, which are generally much more common in the corpus. Many of these are indexed only vanishingly rarely, with 98% or more of the instances in the corpus being omitted from the index. Those with significant rates of indexing are relatively uncommon, occurring hundred to less than ten thousand times in the corpus, rather than one to two hundred thousand times. The second table is the two-character tokens. I didn't bold higher % values since almost everything would be bolded. There are many fewer occurrences of these strings in the corpus overall, with some occurring fewer than 10 times and none more than 1500 times (compared to hundreds of thousands of occurrences above). Overall, how these 1- and 2-character tokens are indexed is still a concern, but the numbers lean towards it not being a gigantic problem.
 * token: the 1- or 2-character token from Groups with no common prefix/suffix above
 * char_freq: the number of times the token string occurs in my 10,000-article corpus.
 * omitted: the number of times Kuromoji omitted the string and did not index it.
 * omit%: omitted/char_freq as a percentage. Values below 95% are bolded.
 * freq: the number of times the token was normalized in a particular way
 * %: freq/char_freq as a percentage. Values above 1% are bolded.
 * norm: the normalized version of the token.

To Do

 * Get some native/fluent speaker review of the groupings above (In progress)
 * Test tokenization independently (Done—see above)
 * Figure out what to do about BM25 (in progress)
 * Set up one or more of the configurations in labs (in progress)
 * Post request for feedback to the Village Pump (in progress)
 * Do the deployment + reindexing dance! ♩♫♩