User:TJones (WMF)/Notes/Kuromoji Analyzer Analysis

From MediaWiki.org
Jump to navigation Jump to search

June-July 2017 — See TJones_(WMF)/Notes for other projects. See also T166731. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro[edit]

The Kuromoji Japanese language analyzer has lots of configuration options, and can be unpacked for custom configuration, and it is supported by Elastic. Seems like the right place to start.

Baseline (CJK) vs Kuromoji[edit]

The instance of the analyzer results in many fewer tokens on a 5K jawiki article sample:

  • 5,515,225 (CJK)
  • 2,379,072 (Kuromoji)

And many fewer token pre-/post-analysis types:

  • 310,339 / 305,770 (CJK)
  • 139,282 / 127,983 (Kuromoji)
  • (there are fewer post-analysis types because some merge, like Apple and apple)

Many non-Japanese, non-Latin characters are not handled well:

  • Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, IPA, Mongolian, Myanmar, Thaana, Thai, and Tibetan are removed.
  • Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
  • Cyrillic words are split on our old friend, the combining acute accent.
  • Greek is treated very oddly, with some characters removed, and words sometimes split into individual letters, sometimes not. I can't figure out the pattern.

Lots of Japanese tokens (Hiragana, Katakana, and Ideographic characters) also change, but that's because the baseline CJK tokenizer works on bigrams rather than actually trying to segment words.

Fullwidth numbers starting with 1 are split character-by-character (1933 → 1 + 9 + 3 + 3 rather than 1933). Mixed fullwidth and halfwidth numbers are inconsistent, depending not only on where the fullwidth and halfwidth forms are, but also which ones they are. Leading fullwidth 1 seems to like to split off.

  • 1933 → 1933
  • 1933 → 1 + 933
  • 1933 → 1 + 933
  • 1933 → 1 + 933
  • 2933 → 2933
  • 2933 → 2933
  • 2933 → 2933
  • 2933 → 2933

That seems odd, and while not all of these mixed patterns occur in my sample, some do.

Kuromoji vs Kuromoji Unpacked[edit]

I unpacked the Kuromoji analyzer according to the docs on the Elastic website. I disabled our usual automatic upgrade of lowercase filter to icu_normalizer filter to focus on the effect of unpacking.

On a 1K jawiki article sample, the number of tokens was very similar:

  • 520,971 (Kuromoji)
  • 525,325 (unpacked)

Also, the number of pre- and post- analysis types are similar:

  • 59,350 / 55,448 (Kuromoji)
  • 59,643 / 55,716 (unpacked)

Some non-Japanese, non-Latin characters are treated differently:

  • Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, Mongolian, Myanmar, Thaana, Thai, and Tibetan are now preserved!
  • IPA characters are preserved, but words are split on them: dʒuglini → d + ʒ + uglini.

Others, the same:

  • Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
  • Cyrillic words are split on our old friend, the combining acute accent.
  • Greek and full-width numbers are treated oddly, as before.

Overall, it's a big improvement for non-Japanese, non-Latin character handling.

Lowercase vs ICU Normalizer[edit]

I re-enabled the lowercase-to-icu_normalizer upgrade. The differences in the 1K sample were very slight and expected:

  • ², ⑥, and ⑦ were converted to 2, 6, and 7.
  • Greek final-sigma ς became regular sigma σ.
  • German ß became ss.
  • Pre-composed Roman numerals (Ⅲ ← that's a single character) were decomposed (III ← that's three i's).

Those are all great, so we'll leave that on in the unpacked version for now.

Fullwidth Numbers[edit]

I did some experimentation, and the problem with the fullwidth numbers is coming from the tokenizer. I've added a custom character filter to convert the fullwidth numbers to halfwidth numbers before the tokenizer, which solves the weird inconsistencies.

It does have one semi-undesirable side effect: months, like 4月 (4th month = "April"), are split into two tokens, 4 + 月, if a halfwidth number is used. While I think indexing 4月 or 4月 as a unit it better, it only happens for the fullwidth version, so while this is slightly worse, it is also more consistent, in that 4月 and 4月 will be indexed the same.

Bits and Bobs[edit]

I tested a number of other options available with Kuromoji. These are the ones that didn't pan out.

Kuromoji Tokenizer Modes[edit]

The Kuromoji tokenizer has several modes and a couple of expert parameters (see documentation). I didn't want to dig into all of it, but the "search" mode seemed interesting.

The "normal" mode returns compounds as single terms. So, 関西国際空港 ("Kansai International Airport") is indexed as just 関西国際空港.

In "search" mode, 関西国際空港 is indexed as four terms, the full compound (関西国際空港) and the three component parts (関西, 国際, 空港—"Kansai", "International", "Airport").

It turns out that "search" is the default, even though "normal" is listed first and sounds like it might be the normal mode of operation.

There is also an "extended" mode that breaks up unknown tokens into unigrams. (For comparison, the CJK analyzer breaks up everything into bigrams.)

The "search" mode seems better, so we'll stick with that.

Part of Speech Token Filter[edit]

I disabled the kuromoji_part_of_speech token filter, and it had no effect on a 1k sample of jawiki articles. According to the docs, it filters based on part of speech tags, but apparently none are configured, so it does nothing. Might as well leave it disabled if it isn't doing anything.

Iteration Mark Expansion[edit]

There's an option to expand iteration marks (e.g., 々), which indicate that the previous character should be repeated. However, sometimes the versions of the word with and without the iteration marks have different meanings. More importantly, expanding the iteration mark can change the tokenization—usually resulting in a word being split into pieces. The iteration mark seems much less ambiguous, so I think we should leave it alone.

Stop Words[edit]

A small number of types (~125) but a huge number of tokens (278,572 stop words vs 525,320 non-stop words—34.6%) are filtered as stop words. A quick survey of the top stop words all make sense.

Prolonged Sound Mark "Stemmer"[edit]

The kuromoji_stemmer token filter isn't really a stemmer (the kuromoji_baseform token filter does some stemming for verbs and adjectives), it just strips the Prolonged Sound Mark from the ends of words. A quick test disabling it shows that's exactly what it does.

This seems to be a mark used in loanwords to indicate long vowel sounds. I'm not sure why you'd want to remove it at the end of a word, but it often doesn't make any difference in Google translate, so it seems to be semi-optional.

The removal is on by default, so I'll leave it that way.

Japanese Numeral Conversion[edit]

The kuromoji_number [scroll down at the link to find it] normalizes Japanese numerals (〇, 一, etc.) to Arabic numerals (0, 1, etc.).

However it is wildly aggressive and ignores spaces, commas, periods, leading zeros, dashes, slashes, colons, number signs, and many more. So, 1, 2, 3. : 456 -7 #8 is tokenized as 12345678 and 一〇〇.一九 #七 九:〇 as 10019790. It also tokenizes 0.001 as just 1.

Unfortunately, the rules for parsing properly formatted Japanese numerals can't be implemented as simple digit-by-digit substitution, so a simple char filter can't solve this problem.

Fortunately, not using it is no worse than the current situation.

Groups for Review[edit]

Below are some groupings for review by fluent speakers of Japanese. These are tokens that are indexed together, so searching for one should find the others. The format is <normalized_form> - [<count> <token>].... The <normalized_form> is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the analyzer and the tokens being considered, so take it as a hint to the meaning of the group, but not definitive. The <token> is a token found by the language analyzer (more or less a "word" but not exactly—it could be a grammatical particle, etc.) and <count> is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters.

Groups with no common prefix/suffix[edit]

These are the groups that don't have a common prefix or suffix across all tokens in the group. That's not necessarily a bad thing (good, better, best is a fine group in English, for example)—but it's worth looking at them just to be sure. I've filtered out the half-width/full-width variants that were obvious to me.

  • く - [47 か][14 きゃ][220 く][8 け][351 っ]
  • くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]
  • す - [36 さ][4 し][81 しゃ][199 す]
  • たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
  • ぬ - [4 ざり][142 ざる][1 ざれ][4927 ず][5 ずん][407 ぬ][84 ね]
  • り - [512 り][188 る]
  • る - [4 よ][1 りゃ][166 る][13 るる][1 るれ][38 れ][2 ろ]

Things are a bit complicated by the fact that "る" is in two different groups above. A number of tokens are normalized in multiple ways, which appears to be context-dependent.

Examples of where these tokens come from in the text are available on a separate page.

Largest Groups[edit]

These are the groups with the largest number of unique tokens. Again, these aren't necessarily wrong, but it is good to review them. They are actually pretty small groups compared to other language analyzers. The first two are duplicates from above.

  • たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
  • くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]

These are not duplicates:

  • てる - [115 て][5 てっ][1 てよ][7 てら][3 てり][1 てりゃ][282 てる][1 てれ][11 てん]
  • よい - [32 よ][217 よい][22 よかっ][2 よから][1 よかれ][19 よき][138 よく][6 よけれ][134 よし]
  • 悪い - [33 悪][250 悪い][1 悪う][27 悪かっ][1 悪から][1 悪き][143 悪く][1 悪けれ][5 悪し]
  • 良い - [75 良][383 良い][47 良かっ][2 良かれ][38 良き][265 良く][1 良けりゃ][2 良けれ][16 良し]

Examples of where these tokens come from in the text are available on a separate page.

Random Groups[edit]

Below are 50 random groups. I filtered out groups that consisted of just a deleted Prolonged Sound Mark (ー), since the "stemmer" (see above) is supposed to do that.

Examples of where these tokens come from in the text are available on a separate page.

  • かき消す - [1 かき消し][2 かき消す][1 かき消そ]
  • きつい - [6 きつ][7 きつい][1 きつう][1 きつき][6 きつく]
  • こうじる - [41 こうじ][3 こうじろ]
  • こじれる - [6 こじれ][1 こじれる]
  • つける - [547 つけ][11 つけよ][179 つける][3 つけれ][9 つけろ][2 つけん]
  • てんじる - [3 てんじ][1 てんじん]
  • のく - [1 のい][12 のき][3 のく][2 のこ]
  • ぶつ - [2 ぶた][2 ぶち][2 ぶちゃ][10 ぶっ][80 ぶつ][1 ぶて]
  • まぎる - [1 まぎ][1 まぎら][1 まぎる][1 まぎれ][1 まぎん]
  • もてる - [12 もて][3 もてる]
  • 丸っこい - [2 丸っこい][1 丸っこく]
  • 乗り換える - [24 乗り換え][19 乗り換える]
  • 乗る - [16 乗][260 乗っ][8 乗ら][67 乗り][92 乗る][2 乗れ][1 乗ろ]
  • 仕上げる - [23 仕上げ][7 仕上げる]
  • 任じる - [111 任じ][4 任じる]
  • 任す - [75 任さ][1 任し]
  • 信じる - [179 信じ][5 信じよ][29 信じる][1 信じろ]
  • 働かせる - [9 働かせ][3 働かせる]
  • 助かる - [9 助かっ][4 助から][10 助かる][1 助かれ]
  • 取りやめる - [29 取りやめ][1 取りやめる]
  • 取り戻す - [3 取り戻さ][85 取り戻し][50 取り戻す][2 取り戻せ][12 取り戻そ]
  • 唸る - [2 唸ら][1 唸り][3 唸る][1 唸れ]
  • 太い - [81 太][25 太い][32 太く]
  • 差し入れる - [1 差し入れ][2 差し入れる]
  • 弱まる - [7 弱まっ][7 弱まり][6 弱まる]
  • 従える - [24 従え][2 従えよ][5 従える]
  • 思いとどまる - [3 思いとどまら][1 思いとどまり][8 思いとどまる]
  • 承る - [164 承][1 承り]
  • 押さえ込む - [1 押さえ込ま][2 押さえ込み][6 押さえ込む][1 押さえ込ん]
  • 振る舞う - [6 振る舞い][17 振る舞う][4 振る舞っ]
  • 携わる - [98 携わっ][3 携わら][23 携わり][48 携わる][1 携われ]
  • 摂る - [32 摂][4 摂っ][3 摂ら][4 摂り][10 摂る][1 摂れ]
  • 暴く - [4 暴い][12 暴か][3 暴き][5 暴く][1 暴こ]
  • 癒える - [11 癒え][1 癒える]
  • 策する - [1 策し][1 策する]
  • 脱ぐ - [10 脱い][2 脱が][4 脱ぎ][4 脱ぐ]
  • 致す - [1 致し][1 致す]
  • 虐げる - [12 虐げ][3 虐げる]
  • 裁く - [1 裁い][6 裁か][1 裁き][10 裁く]
  • 要す - [3 要さ][42 要し][2 要す]
  • 見て取れる - [2 見て取れ][5 見て取れる]
  • 解く - [28 解い][37 解か][41 解き][35 解く][5 解こ]
  • 言い表す - [2 言い表さ][2 言い表し][1 言い表す]
  • 試す - [6 試さ][15 試し][15 試す][3 試そ]
  • 謀る - [5 謀っ][1 謀ら][2 謀り][2 謀る]
  • 譲り渡す - [1 譲り渡さ][3 譲り渡し]
  • 護る - [3 護っ][8 護り][8 護る][1 護れ][1 護ろ]
  • 起こす - [32 起こさ][400 起こし][155 起こす][2 起こせ][9 起こそ]
  • 遠い - [113 遠][77 遠い][2 遠かっ][7 遠き][58 遠く][1 遠し]
  • 闘う - [7 闘い][42 闘う][1 闘え][1 闘お][14 闘っ][4 闘わ]

Longest Tokens[edit]

Below are the tokens that are 25 characters or more.

The longest tokens in Latin characters are all reasonable—the first one is the English/basic Latin alphabet, while the next two are transposed versions of the alphabet (I smell a cipher). There are a few long German words, some chemical names, some English words run together as part of a URL, and a really long string that looks to be a transliteration of Sanskrit translated to Tibetan and back. (The corresponding English Wikipedia article breaks it up into multiple words.)

  • abcdefghijklmnopqrstuvwxyz
  • zabcdefghijklmnopqrstuvwxy
  • hijklmnopqrstuvwxyzabcdefg
  • luftwaffenausbildungskommando
  • polizeidienstauszeichnung
  • staedtischermusikvereinduesseldorf
  • chlorobenzalmalononitrile
  • dimethylmethylideneammonium
  • dinitrosopentamethylenetetramine
  • glycerylphosphorylcholine
  • hydroxydihydrochelirubine
  • hydroxydihydrosanguinarine
  • hydroxyphenylacetaldehyde
  • methylenedioxypyrovalerone
  • diggingupbutchandsundance
  • mahāvairocanābhisaṃbodhivikurvitādhiṣṭhānavaipulyasūtrendrarāja

The long Thai tokens look to be noun phrases in Thai, written without spaces in the usual Thai way. (The Japanese language analyzer isn't really supposed to know what to do with those.)

  • ที่ทําการปกครองอําเภอเทิง
  • ศูนย์เทคโนโลยีสารสนเทศและการสื่อสาร

The longest Japanese tokens can be broken up into two groups. The first group are in katakana. These are long tokens that are indexed both as one long string and as smaller parts (see "Kuromoji Tokenizer Modes" above). The breakdown of the tokens is provided beneath each long token. Notice that some of the sub-tokens are still pretty long ("クリテリウムイベント", "レーシングホールオブフェイムステークス", "カロッツェリアサテライトクルージングシステム").

  • ジャパンカップサイクルロードレースクリテリウムイベント
    • ジャパン - カップ - サイクル - ロードレース - クリテリウムイベント
  • ナショナルミュージアムオブレーシングホールオブフェイムステークス
    • ナショナル - ミュージアム - オブ - レーシングホールオブフェイムステークス
  • パイオニアカロッツェリアサテライトクルージングシステム
    • パイオニア - カロッツェリアサテライトクルージングシステム
  • パシフィックゴルフグループインターナショナルホールディングス
    • パシフィック - ゴルフ - グループ - インターナショナル - ホールディングス

The second group of Japanese tokens are all hiragana, and a lot of them start with "ょ". When submitted to the analyzer, these all come back as single tokens with no alternate breakdown. Based on Google translate, I think most or all of these are errors (I wouldn't be shocked if a few turned out to be the Japanese equivalent of antidisestablishmentarianism and supercalifragilisticexpialidocious).

  • ざいさんぎょうだいじんのしょぶんにかかるしんさきじゅんとう
  • ゃくかちょうみたてほんぞうふでつむしこえのとりどり
  • ゅういっかいせんばつちゅうとうがっこうやきゅうたいかい
  • ゅうとうがっこうゆうしょうやきゅうたいかいしこくたいかい
  • ょうがいをりゆうとするさべつのかいしょうにかんするほうりつ
  • ょうさつじんこういをおこなっただんたいのきせいにかんするほうりつ
  • ょうせいほうじんこくりつびょういんきこうかながわびょういん
  • ょうせいほうじんこくりつびょういんきこうもりおかびょういん
  • ょうせんとうきょくによってらちされたひがいしゃとうのしえんにかんするほうりつ
  • ょくかんれんさんぎょうろうどうくみあいそうれんごう
  • ょけんおよびぎじゅつじょうのちしきのこうりゅうをよういにするためのにほんこくせいふと
  • ょせいかんりょうとかぞくのよんひゃくろくじゅうごにち

One way to deal with these very long tokens is to change the tokenizer mode to "extended", which breaks up unknown words into unigrams (see "Kuromoji Tokenizer Modes" above). This would improve recall, but at the expense of precision.

For now, I say we let the pendulum swing in the other direction (away from indexing bigrams), but keep this potential problem in the back of our minds.

I still need to test the general tokenization of the language analyzer. If that's generally very good, I'll stick with this suggestion. If it's not great, we can reconsider the unigrams for unknown tokens.

Kuromoji Tokenization Analysis[edit]

One of the big benefits of a Japanese-specific language analyzer is better tokenization. Written Japanese generally doesn't use spaces, so the default CJK analyzer breaks everything up into bigrams (i.e., overlapping sets of two characters). Trying to find actual words in the text is best, if you can do a decent job.

A rough parallel in English would be to break English text up by some unit of meter (apologies to any poets, I'm gonna just wing it). So "the president of the united states" might be broken up into "the presi", "president", "dent of", "of the", "the unit", "united", and "ed states". You can't apply stop words (i.e., ignoring "the" and "of") and the matches you do get are not guaranteed to be what you intended. So, "..independent of the presiding consul's concerns, the unit belonging to Ed States, United Airlines president ..." matches all the pieces, relatively close together, but isn't at all about the president of the US.

Tools and Corpora[edit]

Using the SIGHAN framework I used to test Chinese segmentation, I set out to analyze the Japanese segmentation using Kuromoji.

I was able to extract tokenization information from the much more heavily annotated KNBC corpus. There are 4,186 sentences in the corpus. I had to drop a handful of them because my extracted tokenization did not match the original sentence after ""de-tokenizing" it. Some were missing, some mismatched. My final corpus had 4,178 sentences, so I don't think there is any major bias to the dropped sentences.

From the list of Longest Tokens above, we know that Kuromoji's tokenization is not 100% perfect—at least a small number of errors are expected.

I disabled stop words and tested the extended, search, and normal tokenizer modes (see Kuromoji Tokenizer Modes above). The extended mode is expected to have problems since unknown strings are split into individual characters. The search mode also has problems because it includes multiple tokens for the same string (strings recognized as compounds are indexed both as one longer string, and as constituent parts).

Punctuation Problems[edit]

I also had to deal with a somewhat unexpected problem of punctuation. The Kuromoji analyzer, rightly, drops punctuation and non-word symbols it isn't going to index. My script that extracts the tokenization fills in the dropped bits from the original string. That's mostly okay, but sometimes it causes problems when there are multiple punctuation and other symbols in a row. For example, when tokenizing "「そうだ 、京都 、行こう。」", Kuromoji returned the tokens そう, だ, 京都, 行こ, and う. My script filled in 「, 、, 、, and 。」. The problem is that the KNBC annotations split 。」 into separate tokens: 。 and 」. Disagreement here isn't terribly important, since we aren't indexing those tokens.

I wrote a script to identify correspondences in tokenization, and used it to identify where I should manually munge tokenization differences for punctuation and other symbols (e.g., "−−>" vs "− − >", "…。" vs "… 。", etc.). I also normalized a small number of fullwidth spaces.

Recall and Precision Results[edit]

It turns out that the differences are pretty small, and recall is roughly 82-83% and precision 72-76%, depending on the tokenizer:

Recall Precision F1
extended 81.8% 72.0% 76.6%
search 82.8% 75.1% 78.7%
normal 82.4% 75.4% 78.8%
norm munged 83.3% 76.1% 79.5%

These are okay results, but not great.

However, there are some systematic differences between the KNBC tokenization and the Kuromoji tokenization.

Common Tokenization Discrepancies[edit]

Below is a list of the most common alternations (10 or more occurrences) found between the KNBC and Kuromoji tokenizations. I've also provided Google translations for the ones with 20 or more occurrences. (I've added a bullet, •, between tokens because I have trouble seeing the spaces in fullwidth text, and I suspect others who don't read Japanese may as well.)

The Google translations are far from definitive and comments from speakers of Japanese would be helpful, but the translations do hint that the most common alternations are not really content words. I've bolded the ones that seem to have some content. The rest account for 3,010 out of 9,214 alternations (32.6%, including those with <10 occurrences, which are not shown here).

Alternations Tokenization Google Translation
Freq KNBC Kuromoji KNBC Kuromoji
374 して し • て do it Then. The
256 と • いう という When. Say It is called
236 ました まし • た Was Better. It was
131 した し • た did Then. It was
131 である で • ある Is so. is there
119 いた い • た Was there Yes. It was
103 れて れ • て Have been Re The
101 なって なっ • て Become Become The
93 のだ の • だ It was of. It is
89 のです の • です It is of. is
88 ように よう • に like Looks like. Into
80 と • か とか When. Or And
80 と • して として When. do it. As
76 だった だっ • た was So. It was
71 だろう だろ • う right Right. Cormorant
71 なかった なかっ • た There was not Not. It was
69 いて い • て Stomach Yes. The
63 行って 行っ • て go Go. The
60 ような よう • な like Looks like. What
56 見て 見 • て look You see. The
55 なった なっ • た became Become It was
55 んです ん • です It is Hmm. is
52 思って 思っ • て I thought to I thought. The
50 行った 行っ • た went Go. It was
47 れた れ • た Was done Re It was
47 使って 使っ • て Use Use. The
45 でした でし • た was It is. It was
43 清水 • 寺 清水寺 Shimizu. temple Kiyomizudera
41 きた き • た Came き It was
41 しまった しまっ • た Oops Oops. It was
40 的に 的 • に Specifically Target. Into
39 でしょう でしょ • う Oh, yeah. right. Cormorant
39 なくて なく • て I do not need it. Not. The
37 あった あっ • た there were Ah. It was
36 あって あっ • て There Ah. The
34 的な 的 • な Sophisticated Target. What
33 でも で • も But so. Also
32 いって いっ • て Go I say. The
32 に • とって にとって To Handle for
32 好きな 好き • な Favorite Like. What
31 いえば いえ • ば Speaking House. The
31 に • ついて について To about about
31 持って 持っ • て Wait Have. The
30 ので の • で Because of. so
30 やって やっ • て do it Do it. The
29 んだ ん • だ I Hmm. It is
29 出て 出 • て Came out Out The
28 のだろう の • だろ • う Would be of. Right. Cormorant
27 食べて 食べ • て eat eat. The
25 入って 入っ • て go in Enter. The
24 きて き • て come き The
24 したり し • たり Or Then. Or
24 考えて 考え • て think Thoughts. The
22 わけで わけ • で For that Why so
21 に • よって によって To Accordingly By
21 思った 思っ • た thought I thought. It was
20 のでしょう の • でしょ • う I guess of. right. Cormorant
20 みて み • て look Only. The
20 住んで 住ん • で Live Live. so
19 お • 寺 お寺
19 であった で • あっ • た
19 言って 言っ • て
18 いけない いけ • ない
18 そうです そう • です
18 修学 • 旅行 修学旅行
18 夏 • 休み 夏休み
18 来て 来 • て
18 様々な 様々 • な
18 確かに 確か • に
18 買って 買っ • て
17 であり で • あり
17 と • いった といった
17 みた み • た
17 書いて 書い • て
16 他の 他 • の
16 知って 知っ • て
15 これ • から これから
15 そこ • で そこで
15 なければ なけれ • ば
15 ひと • つ ひとつ
15 られた られ • た
15 一 • つ 一つ
15 有名な 有名 • な
15 来た 来 • た
15 買った 買っ • た
14 お • 茶 お茶
14 のである の • で • ある
14 られて られ • て
14 んじゃ ん • じゃ
14 歩いて 歩い • て
14 非常に 非常 • に
13 いった いっ • た
13 って • いう っていう
13 ついて つい • て
13 もの • の ものの
13 んだろう ん • だろ • う
13 聞いて 聞い • て
13 見た 見 • た
13 逆に 逆 • に
12 お • 金 お金
12 それ • で それで
12 できた でき • た
12 できて でき • て
12 ようです よう • です
12 何度 何 • 度
12 好きだ 好き • だ
12 河原 • 町 河原町
11 しよう しよ • う
11 わけです わけ • です
11 三 • 条 三条
11 入れて 入れ • て
11 目の前 目 • の • 前
11 聞いた 聞い • た
10 いつでも いつ • でも
10 お • 気に入り お気に入り
10 かけて かけ • て
10 このような この • よう • な
10 して • る し • てる
10 せて せ • て
10 それ • でも それでも
10 たかった たかっ • た
10 に • 対して に対して
10 みたいな みたい • な
10 よかった よかっ • た
10 一 • 度 一度
10 今では 今 • で • は
10 作った 作っ • た
10 四 • 条 四条
10 変わって 変わっ • て
10 始めて 始め • て
10 百人一首 百 • 人 • 一 • 首
10 簡単に 簡単 • に
10 置いて 置い • て

Below I have split out the tokens that participate in alternations, to help identify regular patterns across alternations. I've included those with >=50 occurrences.

On the Kuromoji side, the top 9 are single characters, and 8 of them are identified by English Wiktionary as being particles (very briefly: particles are typically small words that provide additional grammatical information). To get a sense of the scope here, this would be like deciding whether "have been" or "look up" should be tokenized as one word or two. Consistency is probably more important than choosing either option.

It seems that Kuromoji is more aggressive about separating particles, and these 8 account for 6,656 of the 17,510 Kuromoji tokens (38.0%) that appear in alternations.

Most Commonly Alternating Tokens
Freq KNBC Freq Kuromoji
468 2286 Request maker sentence-final particle.
466 して 1650 interrogative personal pronoun
271 いう 567 Conjunctive particle
238 ました 526 Particle meaning at/ or with
168 494 Several particle meanings
133 した 463 Several particle meanings
131 である 398 case marking particle
119 いた 296 ?
107 れて 272 nominal predicate particle
104 256 という
102 なって 240 まし
95 235 です
93 のだ 211
89 のです 203 よう
88 ように 184 ある
83 174
76 だった 168 なっ
71 だろう 166
71 なかった 138
69 いて 127 行っ
67 116 だろ
65 行って 114 だっ
60 ような 109 あっ
57 んです 100 たり
56 なった 84
56 見て 82 とか
53 思って 81 たら
51 行った 80 として
80 思っ
78
75 なかっ
72
72
68 好き
67 てる
65 でしょ
60 いっ
59 そう
58 でし
57
57 使っ
54
53 しまっ

Tokenization Analysis Summary[edit]

My sense is that a significant portion of the disagreements between KNCB and Kuromoji are based on aggressiveness in separating particular parts of speech. There are surely a fair number of errors in the Kuromoji tokenization, but I'm not so worried that I'd want to stop and not proceed to set up the test index in labs.

Further Review[edit]

Below is some additional analysis, done as the result of issues brought up by speaker review or elsewhere. In particular, check out the discussion on Phab with whym, starting here.

Some 1- and 2-Character Tokens[edit]

In light of the discussion with whym on Phab and the concern that 1- and 2-character tokens are often highly ambiguous and can be grammatical suffixes, I've taken all of the 1- and 2-character tokens in the Groups with no common prefix/suffix above and run some additional analysis. In the tables below we have:

  • token: the 1- or 2-character token from Groups with no common prefix/suffix above
  • char_freq: the number of times the token string occurs in my 10,000-article corpus.
  • omitted: the number of times Kuromoji omitted the string and did not index it.
  • omit%: omitted/char_freq as a percentage. Values below 95% are bolded.

The remaining columns come in triples, which are:

  • freq: the number of times the token was normalized in a particular way
  • %: freq/char_freq as a percentage. Values above 1% are bolded.
  • norm: the normalized version of the token.

The first table is the single-character tokens, which are generally much more common in the corpus. Many of these are indexed only vanishingly rarely, with 98% or more of the instances in the corpus being omitted from the index. Those with significant rates of indexing are relatively uncommon, occurring hundred to less than ten thousand times in the corpus, rather than one to two hundred thousand times.

token char_freq omitted omit% .. freq % norm .. freq % norm .. freq % norm
78465 78418 99.940% 47 0.060%
35541 33237 93.517% 5 0.014% きる 2299 6.469% くる
37657 37326 99.121% 220 0.584% 73 0.194% くい 38 0.101% くる
29808 29594 99.282% 8 0.027% 206 0.691%
59029 58317 98.794% 187 0.317% くる 494 0.837% 31 0.053% こい
71158 71122 99.949% 36 0.051%
169553 169549 99.998% 4 0.002%
62089 61890 99.679% 199 0.321%
9163 4236 46.229% 4927 53.771%
205766 205698 99.967% 68 0.033% たい
79632 79281 99.559% 351 0.441%
905 498 55.028% 407 44.972%
2891 2392 82.740% 84 2.906% 402 13.905% 13 0.450% ねる
43157 42158 97.685% 963 2.231% 32 0.074% よい 4 0.009%
71178 70666 99.281% 512 0.719%
218636 218282 99.838% 188 0.086% 166 0.076%
120917 120879 99.969% 38 0.031%
6659 6423 96.456% 2 0.030% 234 3.514%

The second table is the two-character tokens. I didn't bold higher % values since almost everything would be bolded. There are many fewer occurrences of these strings in the corpus overall, with some occurring fewer than 10 times and none more than 1500 times (compared to hundreds of thousands of occurrences above).

token char_freq omitted omit% .. freq % norm .. freq % norm .. freq % norm .. freq % norm .. freq % norm
きゃ 68 54 79.41% 14 20.59%
くっ 113 106 93.81% 5 4.42% くう 2 1.77% くる
くら 816 756 92.65% 53 6.50% くら 2 0.25% くらい 5 0.61% くる
くり 678 589 86.87% 79 11.65% くり 10 1.47% くる
くる 976 243 24.90% 733 75.10% くる
くれ 544 249 45.77% 10 1.84% くる 285 52.39% くれる
くろ 183 119 65.03% 13 7.10% くる 51 27.87% くろい
こい 162 78 48.15% 14 8.64% くる 56 34.57% こい 2 1.23% こう 5 3.09% こく 7 4.32% こぐ
ざり 23 19 82.61% 4 17.39%
ざる 160 14 8.75% 4 2.50% ざる 142 88.75%
ざれ 5 4 80.00% 1 20.00%
しゃ 488 407 83.40% 81 16.60%
ずん 14 9 64.29% 5 35.71%
たい 1497 480 32.06% 1015 67.80% たい 2 0.13% たく
たき 129 85 65.89% 12 9.30% たい 21 16.28% たき 11 8.53% たく
たく 522 339 64.94% 143 27.39% たい 40 7.66% たく
たし 1103 1092 99.00% 4 0.36% たい 7 0.63% たす
てぇ 3 0 0.00% 3 100.00% たい
とう 828 653 78.86% 20 2.42% たい 155 18.72% とう
りゃ 16 15 93.75% 1 6.25%
るる 47 11 23.40% 13 27.66% 23 48.94% るる
るれ 4 3 75.00% 1 25.00%

Overall, how these 1- and 2-character tokens are indexed is still a concern, but the numbers lean towards it not being a gigantic problem.

Non-Indexed Characters[edit]

I’ve noticed that the analyzer drops a lot of characters and just doesn’t index them. (This isn’t a disaster—we have “text” field with the analyzed text, but also the “plain” field, which is generally unchanged, so exact matches are always possible.)

As an example of text being dropped, I analyzed this sentence fragment. The characters in [square brackets] are not indexed. Running the text through Google translate, there don’t seem to be any egregious errors—lots of function words (or at least things that get translated to function words) are getting omitted.

  • グレート [・]アトラクター [が] 数億光年 [に] 渡る宇宙 [の] 領域内 [にある] 銀河 [とそれが] 属する銀河団 [の] 運動 [に] 及ぼす影響 [の] 観測 [から] 推定 [されたものである。]

To Do[edit]

  • ✓ Get some native/fluent speaker review of the groupings above (Done! Thanks whym!)
  • ✓ Test tokenization independently (Done—see above)
  • ✓ Figure out what to do about BM25 (Done—as with Chinese, we'll enable it in the labs version and if it is well received, we'll go enable it in production)
    • ✗ Enable BM25 for Japanese in prod if the labs review goes well.
      • It didn't go well...
  • ✓ Set up one or more of the configurations in labs (Done: http://ja-wp-kuromoji-relforge.wmflabs.org/w/index.php?search= )
    • ✓ Post request for feedback to the Village Pump (Done: got feedback—check it out!)
  • ✗ Do the deployment + reindexing dance! ♩♫♩

Abandon Ship![edit]

Unfortunately, the user/speaker review from the Village Pump didn't go well. There were some problems with scoring and configuration in Labs, but even with that settled the results were often not as good, and often had lots of extra extraneous results. (Extra results probably would have been okay if better results ended up at the top of the list, but that didn't happen.)

It's possible that better scoring and weighting would give better results, but there's no simple, obvious fix to try, and careful tuning would require significant time and significant help from a fluent speaker. Since we weren't specifically trying to fix a problem with Japanese, just offering a potential improvement, it's okay to abandon this change.

We can come back to Kuromoji or another analyzer in the future if it offers better accuracy, or if we think it would fix a problem for the Japanese language wikis.