User:TJones (WMF)/Notes/Kuromoji Analyzer Analysis

June-July 2017 — See TJones_(WMF)/Notes for other projects. See also T166731. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro

The Kuromoji Japanese language analyzer has lots of configuration options, and can be unpacked for custom configuration, and it is supported by Elastic. Seems like the right place to start.

Baseline (CJK) vs Kuromoji

The instance of the analyzer results in many fewer tokens on a 5K jawiki article sample:

5,515,225 (CJK)
2,379,072 (Kuromoji)

And many fewer token pre-/post-analysis types:

310,339 / 305,770 (CJK)
139,282 / 127,983 (Kuromoji)
(there are fewer post-analysis types because some merge, like Apple and apple)

Many non-Japanese, non-Latin characters are not handled well:

Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, IPA, Mongolian, Myanmar, Thaana, Thai, and Tibetan are removed.

Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
Cyrillic words are split on our old friend, the combining acute accent.
Greek is treated very oddly, with some characters removed, and words sometimes split into individual letters, sometimes not. I can't figure out the pattern.

Lots of Japanese tokens (Hiragana, Katakana, and Ideographic characters) also change, but that's because the baseline CJK tokenizer works on bigrams rather than actually trying to segment words.

Fullwidth numbers starting with １ are split character-by-character (１９３３ → 1 + 9 + 3 + 3 rather than 1933). Mixed fullwidth and halfwidth numbers are inconsistent, depending not only on where the fullwidth and halfwidth forms are, but also which ones they are. Leading fullwidth １ seems to like to split off.

1９３３ → 1933
１９３3 → 1 + 933
１9３３ → 1 + 933
１９3３ → 1 + 933
２９３3 → 2933
２9３３ → 2933
２９3３ → 2933
2９３３ → 2933

That seems odd, and while not all of these mixed patterns occur in my sample, some do.

Kuromoji vs Kuromoji Unpacked

I unpacked the Kuromoji analyzer according to the docs on the Elastic website. I disabled our usual automatic upgrade of lowercase filter to icu_normalizer filter to focus on the effect of unpacking.

On a 1K jawiki article sample, the number of tokens was very similar:

520,971 (Kuromoji)
525,325 (unpacked)

Also, the number of pre- and post- analysis types are similar:

59,350 / 55,448 (Kuromoji)
59,643 / 55,716 (unpacked)

Some non-Japanese, non-Latin characters are treated differently:

Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, Mongolian, Myanmar, Thaana, Thai, and Tibetan are now preserved!
IPA characters are preserved, but words are split on them: dʒuglini → d + ʒ + uglini.

Others, the same:

Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
Cyrillic words are split on our old friend, the combining acute accent.
Greek and full-width numbers are treated oddly, as before.

Overall, it's a big improvement for non-Japanese, non-Latin character handling.

Lowercase vs ICU Normalizer

I re-enabled the lowercase-to-icu_normalizer upgrade. The differences in the 1K sample were very slight and expected:

², ⑥, and ⑦ were converted to 2, 6, and 7.
Greek final-sigma ς became regular sigma σ.
German ß became ss.
Pre-composed Roman numerals (Ⅲ ← that's a single character) were decomposed (III ← that's three i's).

Those are all great, so we'll leave that on in the unpacked version for now.

Fullwidth Numbers

I did some experimentation, and the problem with the fullwidth numbers is coming from the tokenizer. I've added a custom character filter to convert the fullwidth numbers to halfwidth numbers before the tokenizer, which solves the weird inconsistencies.

It does have one semi-undesirable side effect: months, like ４月 (4th month = "April"), are split into two tokens, 4 + 月, if a halfwidth number is used. While I think indexing ４月 or 4月 as a unit it better, it only happens for the fullwidth version, so while this is slightly worse, it is also more consistent, in that ４月 and 4月 will be indexed the same.

Bits and Bobs

I tested a number of other options available with Kuromoji. These are the ones that didn't pan out.

Kuromoji Tokenizer Modes

The Kuromoji tokenizer has several modes and a couple of expert parameters (see documentation). I didn't want to dig into all of it, but the "search" mode seemed interesting.

The "normal" mode returns compounds as single terms. So, 関西国際空港 ("Kansai International Airport") is indexed as just 関西国際空港.

In "search" mode, 関西国際空港 is indexed as four terms, the full compound (関西国際空港) and the three component parts (関西, 国際, 空港—"Kansai", "International", "Airport").

It turns out that "search" is the default, even though "normal" is listed first and sounds like it might be the normal mode of operation.

There is also an "extended" mode that breaks up unknown tokens into unigrams. (For comparison, the CJK analyzer breaks up everything into bigrams.)

The "search" mode seems better, so we'll stick with that.

Part of Speech Token Filter

I disabled the kuromoji_part_of_speech token filter, and it had no effect on a 1k sample of jawiki articles. According to the docs, it filters based on part of speech tags, but apparently none are configured, so it does nothing. Might as well leave it disabled if it isn't doing anything.

Iteration Mark Expansion

There's an option to expand iteration marks (e.g., 々), which indicate that the previous character should be repeated. However, sometimes the versions of the word with and without the iteration marks have different meanings. More importantly, expanding the iteration mark can change the tokenization—usually resulting in a word being split into pieces. The iteration mark seems much less ambiguous, so I think we should leave it alone.

Stop Words

A small number of types (~125) but a huge number of tokens (278,572 stop words vs 525,320 non-stop words—34.6%) are filtered as stop words. A quick survey of the top stop words all make sense.

Prolonged Sound Mark "Stemmer"

The kuromoji_stemmer token filter isn't really a stemmer (the kuromoji_baseform token filter does some stemming for verbs and adjectives), it just strips the Prolonged Sound Mark from the ends of words. A quick test disabling it shows that's exactly what it does.

This seems to be a mark used in loanwords to indicate long vowel sounds. I'm not sure why you'd want to remove it at the end of a word, but it often doesn't make any difference in Google translate, so it seems to be semi-optional.

The removal is on by default, so I'll leave it that way.

Japanese Numeral Conversion

The kuromoji_number [scroll down at the link to find it] normalizes Japanese numerals (〇, 一, etc.) to Arabic numerals (0, 1, etc.).

However it is wildly aggressive and ignores spaces, commas, periods, leading zeros, dashes, slashes, colons, number signs, and many more. So, 1, 2, 3. : 456 -7 #8 is tokenized as 12345678 and 一〇〇.一九 #七九:〇 as 10019790. It also tokenizes 0.001 as just 1.

Unfortunately, the rules for parsing properly formatted Japanese numerals can't be implemented as simple digit-by-digit substitution, so a simple char filter can't solve this problem.

Fortunately, not using it is no worse than the current situation.

Groups for Review

Below are some groupings for review by fluent speakers of Japanese. These are tokens that are indexed together, so searching for one should find the others. The format is <normalized_form> - [<count> <token>].... The <normalized_form> is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the analyzer and the tokens being considered, so take it as a hint to the meaning of the group, but not definitive. The <token> is a token found by the language analyzer (more or less a "word" but not exactly—it could be a grammatical particle, etc.) and <count> is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters.

Groups with no common prefix/suffix

These are the groups that don't have a common prefix or suffix across all tokens in the group. That's not necessarily a bad thing (good, better, best is a fine group in English, for example)—but it's worth looking at them just to be sure. I've filtered out the half-width/full-width variants that were obvious to me.

く - [47 か][14 きゃ][220 く][8 け][351 っ]
くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]
す - [36 さ][4 し][81 しゃ][199 す]
たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
ぬ - [4 ざり][142 ざる][1 ざれ][4927 ず][5 ずん][407 ぬ][84 ね]
り - [512 り][188 る]
る - [4 よ][1 りゃ][166 る][13 るる][1 るれ][38 れ][2 ろ]

Things are a bit complicated by the fact that "る" is in two different groups above. A number of tokens are normalized in multiple ways, which appears to be context-dependent.

Examples of where these tokens come from in the text are available on a separate page.

Largest Groups

These are the groups with the largest number of unique tokens. Again, these aren't necessarily wrong, but it is good to review them. They are actually pretty small groups compared to other language analyzers. The first two are duplicates from above.

たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]

These are not duplicates:

てる - [115 て][5 てっ][1 てよ][7 てら][3 てり][1 てりゃ][282 てる][1 てれ][11 てん]
よい - [32 よ][217 よい][22 よかっ][2 よから][1 よかれ][19 よき][138 よく][6 よけれ][134 よし]
悪い - [33 悪][250 悪い][1 悪う][27 悪かっ][1 悪から][1 悪き][143 悪く][1 悪けれ][5 悪し]
良い - [75 良][383 良い][47 良かっ][2 良かれ][38 良き][265 良く][1 良けりゃ][2 良けれ][16 良し]

Examples of where these tokens come from in the text are available on a separate page.

Random Groups

Below are 50 random groups. I filtered out groups that consisted of just a deleted Prolonged Sound Mark (ー), since the "stemmer" (see above) is supposed to do that.

Examples of where these tokens come from in the text are available on a separate page.

かき消す - [1 かき消し][2 かき消す][1 かき消そ]
きつい - [6 きつ][7 きつい][1 きつう][1 きつき][6 きつく]
こうじる - [41 こうじ][3 こうじろ]
こじれる - [6 こじれ][1 こじれる]
つける - [547 つけ][11 つけよ][179 つける][3 つけれ][9 つけろ][2 つけん]
てんじる - [3 てんじ][1 てんじん]
のく - [1 のい][12 のき][3 のく][2 のこ]
ぶつ - [2 ぶた][2 ぶち][2 ぶちゃ][10 ぶっ][80 ぶつ][1 ぶて]
まぎる - [1 まぎ][1 まぎら][1 まぎる][1 まぎれ][1 まぎん]
もてる - [12 もて][3 もてる]
丸っこい - [2 丸っこい][1 丸っこく]
乗り換える - [24 乗り換え][19 乗り換える]
乗る - [16 乗][260 乗っ][8 乗ら][67 乗り][92 乗る][2 乗れ][1 乗ろ]
仕上げる - [23 仕上げ][7 仕上げる]
任じる - [111 任じ][4 任じる]
任す - [75 任さ][1 任し]
信じる - [179 信じ][5 信じよ][29 信じる][1 信じろ]
働かせる - [9 働かせ][3 働かせる]
助かる - [9 助かっ][4 助から][10 助かる][1 助かれ]
取りやめる - [29 取りやめ][1 取りやめる]
取り戻す - [3 取り戻さ][85 取り戻し][50 取り戻す][2 取り戻せ][12 取り戻そ]
唸る - [2 唸ら][1 唸り][3 唸る][1 唸れ]
太い - [81 太][25 太い][32 太く]
差し入れる - [1 差し入れ][2 差し入れる]
弱まる - [7 弱まっ][7 弱まり][6 弱まる]
従える - [24 従え][2 従えよ][5 従える]
思いとどまる - [3 思いとどまら][1 思いとどまり][8 思いとどまる]
承る - [164 承][1 承り]
押さえ込む - [1 押さえ込ま][2 押さえ込み][6 押さえ込む][1 押さえ込ん]
振る舞う - [6 振る舞い][17 振る舞う][4 振る舞っ]
携わる - [98 携わっ][3 携わら][23 携わり][48 携わる][1 携われ]
摂る - [32 摂][4 摂っ][3 摂ら][4 摂り][10 摂る][1 摂れ]
暴く - [4 暴い][12 暴か][3 暴き][5 暴く][1 暴こ]
癒える - [11 癒え][1 癒える]
策する - [1 策し][1 策する]
脱ぐ - [10 脱い][2 脱が][4 脱ぎ][4 脱ぐ]
致す - [1 致し][1 致す]
虐げる - [12 虐げ][3 虐げる]
裁く - [1 裁い][6 裁か][1 裁き][10 裁く]
要す - [3 要さ][42 要し][2 要す]
見て取れる - [2 見て取れ][5 見て取れる]
解く - [28 解い][37 解か][41 解き][35 解く][5 解こ]
言い表す - [2 言い表さ][2 言い表し][1 言い表す]
試す - [6 試さ][15 試し][15 試す][3 試そ]
謀る - [5 謀っ][1 謀ら][2 謀り][2 謀る]
譲り渡す - [1 譲り渡さ][3 譲り渡し]
護る - [3 護っ][8 護り][8 護る][1 護れ][1 護ろ]
起こす - [32 起こさ][400 起こし][155 起こす][2 起こせ][9 起こそ]
遠い - [113 遠][77 遠い][2 遠かっ][7 遠き][58 遠く][1 遠し]
闘う - [7 闘い][42 闘う][1 闘え][1 闘お][14 闘っ][4 闘わ]

Longest Tokens

Below are the tokens that are 25 characters or more.

The longest tokens in Latin characters are all reasonable—the first one is the English/basic Latin alphabet, while the next two are transposed versions of the alphabet (I smell a cipher). There are a few long German words, some chemical names, some English words run together as part of a URL, and a really long string that looks to be a transliteration of Sanskrit translated to Tibetan and back. (The corresponding English Wikipedia article breaks it up into multiple words.)

abcdefghijklmnopqrstuvwxyz
zabcdefghijklmnopqrstuvwxy
hijklmnopqrstuvwxyzabcdefg

luftwaffenausbildungskommando
polizeidienstauszeichnung
staedtischermusikvereinduesseldorf

chlorobenzalmalononitrile
dimethylmethylideneammonium
dinitrosopentamethylenetetramine
glycerylphosphorylcholine
hydroxydihydrochelirubine
hydroxydihydrosanguinarine
hydroxyphenylacetaldehyde
methylenedioxypyrovalerone

diggingupbutchandsundance

mahāvairocanābhisaṃbodhivikurvitādhiṣṭhānavaipulyasūtrendrarāja

The long Thai tokens look to be noun phrases in Thai, written without spaces in the usual Thai way. (The Japanese language analyzer isn't really supposed to know what to do with those.)

ที่ทําการปกครองอําเภอเทิง
ศูนย์เทคโนโลยีสารสนเทศและการสื่อสาร

The longest Japanese tokens can be broken up into two groups. The first group are in katakana. These are long tokens that are indexed both as one long string and as smaller parts (see "Kuromoji Tokenizer Modes" above). The breakdown of the tokens is provided beneath each long token. Notice that some of the sub-tokens are still pretty long ("クリテリウムイベント", "レーシングホールオブフェイムステークス", "カロッツェリアサテライトクルージングシステム").

ジャパンカップサイクルロードレースクリテリウムイベント
- ジャパン - カップ - サイクル - ロードレース - クリテリウムイベント
ナショナルミュージアムオブレーシングホールオブフェイムステークス
- ナショナル - ミュージアム - オブ - レーシングホールオブフェイムステークス
パイオニアカロッツェリアサテライトクルージングシステム
- パイオニア - カロッツェリアサテライトクルージングシステム
パシフィックゴルフグループインターナショナルホールディングス
- パシフィック - ゴルフ - グループ - インターナショナル - ホールディングス

The second group of Japanese tokens are all hiragana, and a lot of them start with "ょ". When submitted to the analyzer, these all come back as single tokens with no alternate breakdown. Based on Google translate, I think most or all of these are errors (I wouldn't be shocked if a few turned out to be the Japanese equivalent of antidisestablishmentarianism and supercalifragilisticexpialidocious).

ざいさんぎょうだいじんのしょぶんにかかるしんさきじゅんとう
ゃくかちょうみたてほんぞうふでつむしこえのとりどり
ゅういっかいせんばつちゅうとうがっこうやきゅうたいかい
ゅうとうがっこうゆうしょうやきゅうたいかいしこくたいかい
ょうがいをりゆうとするさべつのかいしょうにかんするほうりつ
ょうさつじんこういをおこなっただんたいのきせいにかんするほうりつ
ょうせいほうじんこくりつびょういんきこうかながわびょういん
ょうせいほうじんこくりつびょういんきこうもりおかびょういん
ょうせんとうきょくによってらちされたひがいしゃとうのしえんにかんするほうりつ
ょくかんれんさんぎょうろうどうくみあいそうれんごう
ょけんおよびぎじゅつじょうのちしきのこうりゅうをよういにするためのにほんこくせいふと
ょせいかんりょうとかぞくのよんひゃくろくじゅうごにち

One way to deal with these very long tokens is to change the tokenizer mode to "extended", which breaks up unknown words into unigrams (see "Kuromoji Tokenizer Modes" above). This would improve recall, but at the expense of precision.

For now, I say we let the pendulum swing in the other direction (away from indexing bigrams), but keep this potential problem in the back of our minds.

I still need to test the general tokenization of the language analyzer. If that's generally very good, I'll stick with this suggestion. If it's not great, we can reconsider the unigrams for unknown tokens.

Kuromoji Tokenization Analysis

One of the big benefits of a Japanese-specific language analyzer is better tokenization. Written Japanese generally doesn't use spaces, so the default CJK analyzer breaks everything up into bigrams (i.e., overlapping sets of two characters). Trying to find actual words in the text is best, if you can do a decent job.

A rough parallel in English would be to break English text up by some unit of meter (apologies to any poets, I'm gonna just wing it). So "the president of the united states" might be broken up into "the presi", "president", "dent of", "of the", "the unit", "united", and "ed states". You can't apply stop words (i.e., ignoring "the" and "of") and the matches you do get are not guaranteed to be what you intended. So, "..independent of the presiding consul's concerns, the unit belonging to Ed States, United Airlines president ..." matches all the pieces, relatively close together, but isn't at all about the president of the US.

Tools and Corpora

Using the SIGHAN framework I used to test Chinese segmentation, I set out to analyze the Japanese segmentation using Kuromoji.

I was able to extract tokenization information from the much more heavily annotated KNBC corpus. There are 4,186 sentences in the corpus. I had to drop a handful of them because my extracted tokenization did not match the original sentence after ""de-tokenizing" it. Some were missing, some mismatched. My final corpus had 4,178 sentences, so I don't think there is any major bias to the dropped sentences.

From the list of Longest Tokens above, we know that Kuromoji's tokenization is not 100% perfect—at least a small number of errors are expected.

I disabled stop words and tested the extended, search, and normal tokenizer modes (see Kuromoji Tokenizer Modes above). The extended mode is expected to have problems since unknown strings are split into individual characters. The search mode also has problems because it includes multiple tokens for the same string (strings recognized as compounds are indexed both as one longer string, and as constituent parts).

Punctuation Problems

I also had to deal with a somewhat unexpected problem of punctuation. The Kuromoji analyzer, rightly, drops punctuation and non-word symbols it isn't going to index. My script that extracts the tokenization fills in the dropped bits from the original string. That's mostly okay, but sometimes it causes problems when there are multiple punctuation and other symbols in a row. For example, when tokenizing "「そうだ、京都、行こう。」", Kuromoji returned the tokens そう, だ, 京都, 行こ, and う. My script filled in 「, 、, 、, and 。」. The problem is that the KNBC annotations split 。」 into separate tokens: 。 and 」. Disagreement here isn't terribly important, since we aren't indexing those tokens.

I wrote a script to identify correspondences in tokenization, and used it to identify where I should manually munge tokenization differences for punctuation and other symbols (e.g., "−−＞" vs "− − ＞", "…。" vs "… 。", etc.). I also normalized a small number of fullwidth spaces.

Recall and Precision Results

It turns out that the differences are pretty small, and recall is roughly 82-83% and precision 72-76%, depending on the tokenizer:

	Recall	Precision	F1
extended	81.8%	72.0%	76.6%
search	82.8%	75.1%	78.7%
normal	82.4%	75.4%	78.8%
norm munged	83.3%	76.1%	79.5%

These are okay results, but not great.

However, there are some systematic differences between the KNBC tokenization and the Kuromoji tokenization.

Common Tokenization Discrepancies

Below is a list of the most common alternations (10 or more occurrences) found between the KNBC and Kuromoji tokenizations. I've also provided Google translations for the ones with 20 or more occurrences. (I've added a bullet, •, between tokens because I have trouble seeing the spaces in fullwidth text, and I suspect others who don't read Japanese may as well.)

The Google translations are far from definitive and comments from speakers of Japanese would be helpful, but the translations do hint that the most common alternations are not really content words. I've bolded the ones that seem to have some content. The rest account for 3,010 out of 9,214 alternations (32.6%, including those with <10 occurrences, which are not shown here).

Alternations	Tokenization		Google Translation
Freq	KNBC	Kuromoji	KNBC	Kuromoji
374	して	し • て	do it	Then. The
256	と • いう	という	When. Say	It is called
236	ました	まし • た	Was	Better. It was
131	した	し • た	did	Then. It was
131	である	で • ある	Is	so. is there
119	いた	い • た	Was there	Yes. It was
103	れて	れ • て	Have been	Re The
101	なって	なっ • て	Become	Become The
93	のだ	の • だ	It was	of. It is
89	のです	の • です	It is	of. is
88	ように	よう • に	like	Looks like. Into
80	と • か	とか	When. Or	And
80	と • して	として	When. do it.	As
76	だった	だっ • た	was	So. It was
71	だろう	だろ • う	right	Right. Cormorant
71	なかった	なかっ • た	There was not	Not. It was
69	いて	い • て	Stomach	Yes. The
63	行って	行っ • て	go	Go. The
60	ような	よう • な	like	Looks like. What
56	見て	見 • て	look	You see. The
55	なった	なっ • た	became	Become It was
55	んです	ん • です	It is	Hmm. is
52	思って	思っ • て	I thought to	I thought. The
50	行った	行っ • た	went	Go. It was
47	れた	れ • た	Was done	Re It was
47	使って	使っ • て	Use	Use. The
45	でした	でし • た	was	It is. It was
43	清水 • 寺	清水寺	Shimizu. temple	Kiyomizudera
41	きた	き • た	Came	き It was
41	しまった	しまっ • た	Oops	Oops. It was
40	的に	的 • に	Specifically	Target. Into
39	でしょう	でしょ • う	Oh, yeah.	right. Cormorant
39	なくて	なく • て	I do not need it.	Not. The
37	あった	あっ • た	there were	Ah. It was
36	あって	あっ • て	There	Ah. The
34	的な	的 • な	Sophisticated	Target. What
33	でも	で • も	But	so. Also
32	いって	いっ • て	Go	I say. The
32	に • とって	にとって	To Handle	for
32	好きな	好き • な	Favorite	Like. What
31	いえば	いえ • ば	Speaking	House. The
31	に • ついて	について	To about	about
31	持って	持っ • て	Wait	Have. The
30	ので	の • で	Because	of. so
30	やって	やっ • て	do it	Do it. The
29	んだ	ん • だ	I	Hmm. It is
29	出て	出 • て	Came out	Out The
28	のだろう	の • だろ • う	Would be	of. Right. Cormorant
27	食べて	食べ • て	eat	eat. The
25	入って	入っ • て	go in	Enter. The
24	きて	き • て	come	き The
24	したり	し • たり	Or	Then. Or
24	考えて	考え • て	think	Thoughts. The
22	わけで	わけ • で	For that	Why so
21	に • よって	によって	To Accordingly	By
21	思った	思っ • た	thought	I thought. It was
20	のでしょう	の • でしょ • う	I guess	of. right. Cormorant
20	みて	み • て	look	Only. The
20	住んで	住ん • で	Live	Live. so
19	お • 寺	お寺
19	であった	で • あっ • た
19	言って	言っ • て
18	いけない	いけ • ない
18	そうです	そう • です
18	修学 • 旅行	修学旅行
18	夏 • 休み	夏休み
18	来て	来 • て
18	様々な	様々 • な
18	確かに	確か • に
18	買って	買っ • て
17	であり	で • あり
17	と • いった	といった
17	みた	み • た
17	書いて	書い • て
16	他の	他 • の
16	知って	知っ • て
15	これ • から	これから
15	そこ • で	そこで
15	なければ	なけれ • ば
15	ひと • つ	ひとつ
15	られた	られ • た
15	一 • つ	一つ
15	有名な	有名 • な
15	来た	来 • た
15	買った	買っ • た
14	お • 茶	お茶
14	のである	の • で • ある
14	られて	られ • て
14	んじゃ	ん • じゃ
14	歩いて	歩い • て
14	非常に	非常 • に
13	いった	いっ • た
13	って • いう	っていう
13	ついて	つい • て
13	もの • の	ものの
13	んだろう	ん • だろ • う
13	聞いて	聞い • て
13	見た	見 • た
13	逆に	逆 • に
12	お • 金	お金
12	それ • で	それで
12	できた	でき • た
12	できて	でき • て
12	ようです	よう • です
12	何度	何 • 度
12	好きだ	好き • だ
12	河原 • 町	河原町
11	しよう	しよ • う
11	わけです	わけ • です
11	三 • 条	三条
11	入れて	入れ • て
11	目の前	目 • の • 前
11	聞いた	聞い • た
10	いつでも	いつ • でも
10	お • 気に入り	お気に入り
10	かけて	かけ • て
10	このような	この • よう • な
10	して • る	し • てる
10	せて	せ • て
10	それ • でも	それでも
10	たかった	たかっ • た
10	に • 対して	に対して
10	みたいな	みたい • な
10	よかった	よかっ • た
10	一 • 度	一度
10	今では	今 • で • は
10	作った	作っ • た
10	四 • 条	四条
10	変わって	変わっ • て
10	始めて	始め • て
10	百人一首	百 • 人 • 一 • 首
10	簡単に	簡単 • に
10	置いて	置い • て

Below I have split out the tokens that participate in alternations, to help identify regular patterns across alternations. I've included those with >=50 occurrences.

On the Kuromoji side, the top 9 are single characters, and 8 of them are identified by English Wiktionary as being particles (very briefly: particles are typically small words that provide additional grammatical information). To get a sense of the scope here, this would be like deciding whether "have been" or "look up" should be tokenized as one word or two. Consistency is probably more important than choosing either option.

It seems that Kuromoji is more aggressive about separating particles, and these 8 account for 6,656 of the 17,510 Kuromoji tokens (38.0%) that appear in alternations.

Most Commonly Alternating Tokens
Freq	KNBC	…	Freq	Kuromoji
468	と		2286	て	Request maker sentence-final particle.
466	して		1650	た	interrogative personal pronoun
271	いう		567	し	Conjunctive particle
238	ました		526	で	Particle meaning at/ or with
168	に		494	な	Several particle meanings
133	した		463	に	Several particle meanings
131	である		398	の	case marking particle
119	いた		296	う	?
107	れて		272	だ	nominal predicate particle
104	か		256	という
102	なって		240	まし
95	お		235	です
93	のだ		211	い
89	のです		203	よう
88	ように		184	ある
83	る		174	ば
76	だった		168	なっ
71	だろう		166	れ
71	なかった		138	ん
69	いて		127	行っ
67	寺		116	だろ
65	行って		114	だっ
60	ような		109	あっ
57	んです		100	たり
56	なった		84	的
56	見て		82	とか
53	思って		81	たら
51	行った		80	として
			80	思っ
			78	き
			75	なかっ
			72	も
			72	見
			68	好き
			67	てる
			65	でしょ
			60	いっ
			59	そう
			58	でし
			57	ー
			57	使っ
			54	と
			53	しまっ

Tokenization Analysis Summary

My sense is that a significant portion of the disagreements between KNCB and Kuromoji are based on aggressiveness in separating particular parts of speech. There are surely a fair number of errors in the Kuromoji tokenization, but I'm not so worried that I'd want to stop and not proceed to set up the test index in labs.

Further Review

Below is some additional analysis, done as the result of issues brought up by speaker review or elsewhere. In particular, check out the discussion on Phab with whym, starting here.

Some 1- and 2-Character Tokens

In light of the discussion with whym on Phab and the concern that 1- and 2-character tokens are often highly ambiguous and can be grammatical suffixes, I've taken all of the 1- and 2-character tokens in the Groups with no common prefix/suffix above and run some additional analysis. In the tables below we have:

token: the 1- or 2-character token from Groups with no common prefix/suffix above
char_freq: the number of times the token string occurs in my 10,000-article corpus.
omitted: the number of times Kuromoji omitted the string and did not index it.
omit%: omitted/char_freq as a percentage. Values below 95% are bolded.

The remaining columns come in triples, which are:

freq: the number of times the token was normalized in a particular way
%: freq/char_freq as a percentage. Values above 1% are bolded.
norm: the normalized version of the token.

The first table is the single-character tokens, which are generally much more common in the corpus. Many of these are indexed only vanishingly rarely, with 98% or more of the instances in the corpus being omitted from the index. Those with significant rates of indexing are relatively uncommon, occurring hundred to less than ten thousand times in the corpus, rather than one to two hundred thousand times.

token	char_freq	omitted	omit%	..	freq	%	norm	..	freq	%	norm	..	freq	%	norm
か	78465	78418	99.940%		47	0.060%	く
き	35541	33237	93.517%		5	0.014%	きる		2299	6.469%	くる
く	37657	37326	99.121%		220	0.584%	く		73	0.194%	くい		38	0.101%	くる
け	29808	29594	99.282%		8	0.027%	く		206	0.691%	け
こ	59029	58317	98.794%		187	0.317%	くる		494	0.837%	こ		31	0.053%	こい
さ	71158	71122	99.949%		36	0.051%	す
し	169553	169549	99.998%		4	0.002%	す
す	62089	61890	99.679%		199	0.321%	す
ず	9163	4236	46.229%		4927	53.771%	ぬ
た	205766	205698	99.967%		68	0.033%	たい
っ	79632	79281	99.559%		351	0.441%	く
ぬ	905	498	55.028%		407	44.972%	ぬ
ね	2891	2392	82.740%		84	2.906%	ぬ		402	13.905%	ね		13	0.450%	ねる
よ	43157	42158	97.685%		963	2.231%	よ		32	0.074%	よい		4	0.009%	る
り	71178	70666	99.281%		512	0.719%	り
る	218636	218282	99.838%		188	0.086%	り		166	0.076%	る
れ	120917	120879	99.969%		38	0.031%	る
ろ	6659	6423	96.456%		2	0.030%	る		234	3.514%	ろ

The second table is the two-character tokens. I didn't bold higher % values since almost everything would be bolded. There are many fewer occurrences of these strings in the corpus overall, with some occurring fewer than 10 times and none more than 1500 times (compared to hundreds of thousands of occurrences above).

token	char_freq	omitted	omit%	..	freq	%	norm	..	freq	%	norm	..	freq	%	norm	..	freq	%	norm	..	freq	%	norm
きゃ	68	54	79.41%		14	20.59%	く
くっ	113	106	93.81%		5	4.42%	くう		2	1.77%	くる
くら	816	756	92.65%		53	6.50%	くら		2	0.25%	くらい		5	0.61%	くる
くり	678	589	86.87%		79	11.65%	くり		10	1.47%	くる
くる	976	243	24.90%		733	75.10%	くる
くれ	544	249	45.77%		10	1.84%	くる		285	52.39%	くれる
くろ	183	119	65.03%		13	7.10%	くる		51	27.87%	くろい
こい	162	78	48.15%		14	8.64%	くる		56	34.57%	こい		2	1.23%	こう		5	3.09%	こく		7	4.32%	こぐ
ざり	23	19	82.61%		4	17.39%	ぬ
ざる	160	14	8.75%		4	2.50%	ざる		142	88.75%	ぬ
ざれ	5	4	80.00%		1	20.00%	ぬ
しゃ	488	407	83.40%		81	16.60%	す
ずん	14	9	64.29%		5	35.71%	ぬ
たい	1497	480	32.06%		1015	67.80%	たい		2	0.13%	たく
たき	129	85	65.89%		12	9.30%	たい		21	16.28%	たき		11	8.53%	たく
たく	522	339	64.94%		143	27.39%	たい		40	7.66%	たく
たし	1103	1092	99.00%		4	0.36%	たい		7	0.63%	たす
てぇ	3	0	0.00%		3	100.00%	たい
とう	828	653	78.86%		20	2.42%	たい		155	18.72%	とう
りゃ	16	15	93.75%		1	6.25%	る
るる	47	11	23.40%		13	27.66%	る		23	48.94%	るる
るれ	4	3	75.00%		1	25.00%	る

Overall, how these 1- and 2-character tokens are indexed is still a concern, but the numbers lean towards it not being a gigantic problem.

Non-Indexed Characters

I’ve noticed that the analyzer drops a lot of characters and just doesn’t index them. (This isn’t a disaster—we have “text” field with the analyzed text, but also the “plain” field, which is generally unchanged, so exact matches are always possible.)

As an example of text being dropped, I analyzed this sentence fragment. The characters in [square brackets] are not indexed. Running the text through Google translate, there don’t seem to be any egregious errors—lots of function words (or at least things that get translated to function words) are getting omitted.

グレート [・]アトラクター [が] 数億光年 [に] 渡る宇宙 [の] 領域内 [にある] 銀河 [とそれが] 属する銀河団 [の] 運動 [に] 及ぼす影響 [の] 観測 [から] 推定 [されたものである。]

To Do

✓ Get some native/fluent speaker review of the groupings above (Done! Thanks whym!)
✓ Test tokenization independently (Done—see above)
✓ Figure out what to do about BM25 (Done—as with Chinese, we'll enable it in the labs version and if it is well received, we'll go enable it in production)
- ✗ Enable BM25 for Japanese in prod if the labs review goes well.
  - It didn't go well...
✓ Set up one or more of the configurations in labs (Done: http://ja-wp-kuromoji-relforge.wmflabs.org/w/index.php?search= )
- ✓ Post request for feedback to the Village Pump (Done: got feedback—check it out!)
✗ Do the deployment + reindexing dance! ♩♫♩

Abandon Ship!

Unfortunately, the user/speaker review from the Village Pump didn't go well. There were some problems with scoring and configuration in Labs, but even with that settled the results were often not as good, and often had lots of extra extraneous results. (Extra results probably would have been okay if better results ended up at the top of the list, but that didn't happen.)

It's possible that better scoring and weighting would give better results, but there's no simple, obvious fix to try, and careful tuning would require significant time and significant help from a fluent speaker. Since we weren't specifically trying to fix a problem with Japanese, just offering a potential improvement, it's okay to abandon this change.

We can come back to Kuromoji or another analyzer in the future if it offers better accuracy, or if we think it would fix a problem for the Japanese language wikis.