User:TJones (WMF)/Notes/Nori Analyzer Analysis

August/September 2018 — See TJones_(WMF)/Notes for other projects. See also T178925. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
A new Korean analyzer, called Nori,  supported by Elastic, is part of Elasticsearch 6.4 / Lucene 7.4. Even if we don't upgrade to ES 6.4 as our first version of ES 6, it seems like it is worth it to wait a little longer for an ES-supported analyzer.

My goal is to install ES 6.4 on my local machine and test the analyzer there. If Nori is up to the job, then deployment can wait until we are at ES 6.4+.

Data
As usual, I'm doing my analysis on 10,000 random Korean Wikipedia articles and 10,000 random Korean Wiktionary entries, with most markup removed, and lines deduplicated to reduce unnaturally frequent wiki-specific phrases, like the Korean equivalents of "references", "see also", "noun", etc.

Deep Background
I went looking to see if anything new and interesting had happened in the world of Korean morphological analysis since I made my list of analyzers to look into in October 2017.

MeCab was still high on the list, and there are a couple of ES plugin wrappers around it on Github.

Then I found more discussion of an ES analyzer called Seunjeon, used by Amazon and available on BitBucket. Seunjeon is based on MeCab, so it looked really promising.

I found a blog post on the Elastic blog comparing Seunjeon and two other analyzers, Arirang, and Open-korean-text. I missed it last year because it was published 2 days after I wrote the first draft of my list. Ugh.

The author of the blog post, Kiju Kim, is a Korean-speaking engineer at Elastic, and he did a great job looking at the speed and memory usage of the three analyzers. He also did an analysis of the tokens found by each in a short span of text.

Since that blog post was so helpful, I looked at other blog posts he's written, and discovered a nice series on the basics of working with Korean, Japanese, and Chinese text.

But then there was his newest post—from this August!—announcing a new Korean analyzer in Lucene 7.4 and Elastic 6.4! It's called Nori.

An Elastic-supported analyzer, if it is linguistically adequate—is definitely the way to go. Supporting our own analyzers, like we do for Esperanto, Slovak, and Serbian, is good, but still a maintenance burden, and relying on third-party analyzers is great, except when they don't update to the right version of Elasticsearch quite as fast as we'd like them to.

A Brief Note on Hangeul
If you aren't familiar with the Korean writing system, Hangeul,† you can obviously read a lot more about it on Wikipedia. Very briefly, Hangeul characters are generally "syllabic blocks", that have an internal structure made up of individual consonant and vowel symbols, called jamo.

"† The name of the script can be transliterated in different systems as 'Hangul' (a simplification of the more careful transliteration 'Han'gŭl') or 'Hangeul' in English. Which form is the most correct? '한글', obviously."

The name of the writing system, Hangeul, is written as 한글... han + geul. But 한 (han) is made up of ㅎ + ㅏ + ㄴ (h + a + n) and 글 (geul) is made up of ㄱ + ㅡ + ㄹ (g + eu + l). There are different ways of arranging the jamo into a syllabic block, but they are generally a mix of left-to-right and top-to-bottom, depending on the number of jamo elements and their size and shape.

Knowing this makes it a little easier to follow some of the transformations that happen during stemming. When adding the present tense suffix —ᆫ다/-nda, for example, the ᆫ/n can join the final syllable of the word it is added to. Similarly, when it is removed, the final syllable left behind can lose the ᆫ/n. When stripping the suffix from 간다/ganda, we are left with 가/ga, which may not look like the original word. If you look closely, though, the first syllable 간/gan is 가/ga sitting on top of ᆫ/n.

The ᆫ/n in —ᆫ다/-nda can also replace a final ㄹ/l, so the stem of 안다/anda (note that ㅇ here indicates there is no initial consonant and ㅏ is a) is 알/al. Even if you can't remember the sounds of the individual jamo, knowing the structure of the syllabic blocks make it easier to see the relationship between 안 and 알—only the final consonant of the syllable changed.

In addition to Hangeul, Korean writing also uses Hanja, which are Chinese characters that have been borrowed into Korean.

Status Quo: The CJK Analyzer
Korean-language wikis currently use the CJK analyzer, which does some normalization (like converting Ｆｕｌｌｗｉｄｔｈ characters to halfwidth characters), but its most obvious feature is the way tokenization is done. Strings of CJK characters are broken up into overlapping bigrams. If we did that in English, the word bigram would be indexed as bi, ig, gr, ra, and am. Korean 위키백과 ("Wikipedia") is tokenized as 위키, 키백, and 백과. Spaceless word boundaries (i.e., without a space or some punctuation) are ignored. It's not great, but it mostly works. Better systems are definitely better, but also a lot more complex. (Here's hoping the Nori Korean analyzer is better!)

CJK Solo Analysis
When I investigated the Kuromoji analyzer for Japanese last summer, I neglected to do an analysis of the CJK analyzer on its own. (I keep discovering new and weird ways for analyses to go wrong, so my checklist of things to look for keeps growing.)

The most obvious things I see in this solo analysis are:


 * There are very few tokens that get normalized to be the same as other tokens. Essentially, all CJK bigrams are unique.
 * Full width forms like "１２０", "ＩＭＦ", and "：" get normalized to their halfwidth (i.e., "normal" for English) forms, "120", "IMF", and ":".
 * Most longer tokens, especially in the 10 to 20 characters length range, are Latin—including English and German words, domain names, underscored_phrases, or long file names—with a bit of Thai and some long numbers with commas in them thrown in. There are also the occasional multi-byte characters that Elastic can't handle as-is, so "𐰜𐰇𐰚" gets converted at a 12:1 ratio into "\uD803\uDC1C\uD803\uDC07\uD803\uDC1A", and thus shows up internally as a 36-character token.
 * Above 25 characters, the longer tokens in the Wikipedia corpus are actually mostly Korean script!
 * The most common case is Korean words or phrases separated by a middle dot or interpunct (·, U+00B7), which is used like a comma or in-line bullet in lists: "교리·부응교·수원부사·이조참의·병조참판·도승지·대사간·대사헌".
 * It looks like any change in script from Hangeul to Latin or numbers will block the bigramming, so I see tokens like "는1988년11월24일부터1999년8월8일까지미국ktma" and "안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판".
 * The presence of any non-CJK letter seems to block tokenization and bigramming. I replaced the middle dot with a Latin, Cyrillic, Armenian, Devanagari, Hebrew, or Arabic character and got the same result. Unusual space characters (six-per-em space, figure space, no-break space) did not break the tokenization.
 * A general solution would be complex and detailed, but a 95% solution would be to replace middle dots with spaces. We should do that if we don't adopt the Nori analyzer.
 * The Wiktionary data doesn't show any middle dots used for lists, and the vast majority of longer tokens are IPA phonetic transcriptions of phrases, which are often pleasantly detailed, so they have lots of diacritics that up the character count. For example, "s͈ɛ̝ɡɯnba̠ɭt͈a̠ks͈ɛ̝ɡɯnba̠ɭt͈a̠kʰa̠da̠"—which is 26 letters + 12 diacritics. Some are just really long, like "kum.beŋ.i.do.bal.bɨ.mjən.k’um.tʰɨl.han.da"—which uses periods to mark syllable boundaries.

A New Contender Emerges: The Nori Analyzer
The Nori analyzer consists of several parts:


 * A Korean tokenizer: based on the MeCab dictionary. It can also optionally break up compounds into parts (with an option to keep or discard the original compound; discarding is the default), and it can make use of a user dictionary of additional nouns. If the tokenization works well, it should give much more accurate search results than the CJK bigrams!
 * A part of speech filter: Unsurprisingly, proper tokenization can be a little easier if you take into account parts of speech. In English, for example, the blank in "the ____ is ..." is going to be a noun phrase, which can help you figure out how to parse it. In "the building is ...", building is a noun, and so maybe we don't want to strip off the final -ing because it is not a verbal ending, as it would be in "She is building a fort." Anyway, since we have the parts of speech, we can filter out affixes, particles, and other low-information elements.
 * A reading form filter: This converts Hanja (Chinese characters) into their equivalent Hangeul. The Hangeul is more ambiguous, but may occur instead of the Hanja in some contexts. For common Hangeul equivalents, this can conflate a lot of Hanja.
 * A lowercasing filter: For the stray words and letters that show up in Latin, Cyrillic, Greek, or Armenian scripts.

Nori Solo Analysis
The most common cause for input tokens to be stemmed the same seems to be the Hanja-to-Hangeul (Chinese-to-Korean) normalization. Since the tokens often have no characters in common, my automatic detection of potential problem stems goes crazy and almost every stemming group is a "potential problem" (i.e., there is no common beginning or ending substring across all terms). I will randomly sample some Hanja-to-Hangeul groups for native speaker review.

Some tokens that stem together are indeed inflections, but because of the syllabic blocks, it's hard to see that they are related. It's possible to decompose them (using NFD normalization). Doing so reveals our friend -ᆫ다/-nda as a likely Korean suffix, but since most stemming groups are Hanja/Hangeul groups, it didn't do too much for narrowing the range of potential problem suffixes.

Other tokens that stem together come from compounds. The default behavior for Nori is to break a compound into parts, discard the original and keep the parts. So, 위키백과 ("Wikipedia") gets divided into 위키 (transliteration of wiki) and 백과 (an analog of the "encyclo" part of encyclopedia, meaning "all subjects").

Generally, I'm in favor of indexing the original compound (for increased precision) and the individual parts (for increased recall), and letting the scoring sort it out.

Feeding examples to the Nori stemmer on the command line also makes it clear that the context of a token affects its stemming and status as a compound (probably mediated by the part of speech tagging). For example, when I tokenize the string "기다리. 기다림."—both are forms of 기다리다, meaning "to wait for"—I get back two instances of the stem "기다리". With just a space between them—as "기다리 기다림"—the stems are 기다리 and 다리, with initial 기- apparently removed from the second token. (Though my very poorly-educated guess is that the tokenizer may sometimes ignore spaces and in this case is interpreting -기 as a suffix, since it has several suffixed forms. Tokenizing as "기다리기다림" gives the 기다리 and 다리 stems, too.)

The compound processing results in some input tokens generating multiple output tokens. Out of 128,352 pre-analysis tokens, 126,342 generated only one output token. 1,001 generated 2, 9 generated 3, 2 generated 4, and 1 generated 0! I definitely need to see where that empty token came from, and double check on those potential three- and four-part compounds.

Frequency of number of tokens generated and examples for 2+ Oddly, the empty output token maps back to an empty input token (with length zero). It is triggered by the presence of the four characters "그레이맨" (part of the name of a manga character 디 그레이맨). The four characters in the name are parsed as an input token, followed by a zero-length token. It's weird. There's only one in my 10K Wikipedia sample, and none in the Wiktionary sample, but there could potentially by dozens or even hundreds in the full Korean Wikipedia, so an empty-token filter seems to be called for.

Some other things I see in this solo analysis are:


 * The longest non-Korean tokens are similar, though there are no domain names, words_with_underscores, numbers with commas, or long IPA phonetic transcriptions.
 * Nori doesn't have the middle dot problem CJK does, but the longest Korean tokens still look similar: "ㆍ도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구". Some people use an obsolete character called "arae-a" (ㆍ, U+318D) in place of a middle dot. Visually they are very similar— ㆍ vs · —though YMMV, depending on your fonts.
 * Nori doesn't have the non-CJK letter problem the CJK analyzer does, so it breaks on numbers, Latin, Cyrillic, Greek, Armenian, Hebrew, or Arabic characters. It also breaks on Japanese Hiragana and Katakana characters. Chinese characters can be the same as the Hanja based on them, so they seem to be tokenized according to the internal workings of the Nori tokenizer, which often splits them off as separate tokens.
 * Note: The character arae-a (ㆍ, U+318D) mentioned above is technically the "HANGUL LETTER ARAEA", and there is another form, the "HANGUL JUNGSEONG ARAEA" (ᆞ, U+119E), which looks pretty much identical and is also, probably incorrectly, used in lists ("새로운 생각ᆞ 배려하는 마음ᆞ 커가는 꿈"). Its more proper use is as a "jungseong" character, which is the medial character of a syllabic block. If there is no precomposed Unicode character for a given syllabic block, you can specify the block as initial/medial/final parts (choseong/jungseong/jongseong) and if your fonts and operating system are up to the challenge, you get nice-looking syllabic blocks as a result. Below is an example of the same characters in Arial Unicode (which shows them individually) and in Noto Sans CJK (which is a newer font and knows how to do the right thing). There are even fewer of these used incorrectly, and some used correctly for historical syllabic blocks, so the right solution is probably to find them when used as bullet points, and replace them with a different character.



CJK vs Nori (monolithic)
Since the tokenization of Korean is wildly different between CJK and Nori, comparing the two is also about how they treat everything else—normalization, treatment of non-Korean CJK text, treatment of non-CJK text, etc.

The token count differences for the Wikipedia corpus are huge: 3,698,656 for CJK and 2,382,379 for Nori. For a string of, say, 8 Korean characters without spaces, CJK will give 7 bigrams, while Nori will likely give only 2 to 4 tokens (the median length of Nori tokens is 3, and the mode is 2). The Wiktionary data shows less of a difference (105,389 tokens for CJK and 103,702 tokens for Nori), but Nori splits up certain non-Korean tokens that CJK doesn't, and there many more non-Korean tokens in Wiktionary that both treat the same.

Other differences of note (✓=Good for Nori; —=Neutral for Nori; ✗=Bad for Nori):


 * ✓ Nori splits up mixed-Korean/non-CJK and mixed-script sequences. The most common are ones like date elements like "3월" ("third month/March") and "1990년" ("the year 1990"), and measurements like "0.875톤이" ("0.875 tons"). With Nori we can still match phrases, but CJK would have trouble matching "1990" to "1990년".
 * ✓ Another common case is "의" which seems to mean "of", stuck to the end of a non-Korean word, like "allegory의" or "ангел의". Mixed-script CJK bigrams (like "に醤", in which the first character is Japanese and the second is Chinese) are split up, too.
 * ✗ Nori kindly offers the option to get some of internal details, and it seems to divide non-CJK characters into type "SL(Foreign language)" and "SY(Other symbol)" (among others). It seems to generally break between characters from different character sets, though which Unicode blocks count as "symbols" is weird. Most Latin or Latin-based characters are "foreign", but IPA extensions are "symbols".
 * ✗ The "Greek Extended" block is also treated as "symbols" rather than as Greek, so "εἰμί" gets split up into "ε" + "ἰ" + "μί", because ἰ ("GREEK SMALL LETTER IOTA WITH PSILI" in the "Greek Extended" block) is treated as a symbol.
 * ✗ Our friend "s͈ɛ̝ɡɯnba̠ɭt͈a̠ks͈ɛ̝ɡɯnba̠ɭt͈a̠kʰa̠da̠" gets tokenized by Nori as "s" + "͈ɛ̝ɡɯ" + "nba" + "̠ɭ" + "t" + "͈" + "a" + "̠" + "ks" + "͈ɛ̝ɡɯ" + "nba" + "̠ɭ" + "t" + "͈" + "a" + "̠" + "k" + "ʰ" + "a" + "̠" + "da" + "̠".
 * — Numbers are a different category for Nori, so "UPC600" gets split into "UPC" and "600".
 * ✗ Nori splits on combining characters (also treated as "symbols"), so that "Ба̀лтичко̄" gets split into four tokens: Ба + ̀ + лтичко + ̄. This happens a lot in Cyrillic, where combining accents are used to show stress.
 * ✗ Nori splits on apostrophes and some other apostrophe-like characters, including the curly apostrophe (’) and the Hebrew ׳ (U+05F3, "HEBREW PUNCTUATION GERESH").
 * ✗ A lot of the combining and modifying characters are indexed on their own by Nori, resulting in low-quality and relatively high-volume tokens: ̀ • ́ • ̂ • ̃ • ̄ • ̍ • ̪ʲ • ̪ˠ • ʷ • ʻ • ʼ • ʾ • ʿ • ˀ ... etc.
 * ✗ CJK leaves soft hyphens in place, Nori splits tokens on them. Neither is ideal. Stripping them before tokenization might be better.
 * ✓ Nori strips bidirectional markers (used when switching between left-to-right and right-to-left scripts), CJK leaves them in place, which is wrong.
 * ✗ CJK leaves zero-width non-joiner characters in place, while Nori splits on them, both of which are usually wrong, since the usual purpose of the character is to prevent ligatures. Stripping them seems to be the right thing to do.
 * — Nori also splits on periods (example.com and 3.14159), colons (commons:category:vienna), underscores (service_file_system_driver), commas (in numbers), where CJK does not. These are generally good, though it's not great for acronyms.
 * ✓ CJK eats encircled numbers (①②③), "dingbat" circled numbers (➀➁➂), parenthesized numbers (⑴⑵⑶), fractions (¼ ⅓ ⅜ ½ ⅔ ¾), superscript numbers (¹²³), and subscript numbers (₁₂₃); Nori keeps them. Normalizing them is probably best.
 * ✓ It's a minor thing, but CJK eats some characters, like "𝄞", "🎄", and some private use area characters, while Nori keeps them. They both eat other characters, like "♡".
 * — Strings of Japanese Hiragana or Katakana are kept whole. This is probably good for titles and other short phrases, but not good for extended strings of Japanese that would get tokenized as one long token.
 * ✗ Oddly, the string "튜토리얼" gets tokenized with an extra space at the end, as "튜토리얼 ". It's the only token like that I've found. It seems to be stemmed correctly from the forms in the text ("튜토리얼에" and "튜토리얼을"), it just has an extra space. Weird.

Nori (monolithic) vs Nori (unpacked)
I unpacked Nori according to the Elasticsearch 6.4 documentation and the results were identical to the monolithic Nori analyzer. So that's good. Now we can test other variations, like changing the compound processing and introducing ICU normalization and custom character filters to address some of the problems Nori has.

Nori: Enabling "Mixed" Compounds
Nori has three options for dealing with compounds: break the compound into pieces and discard the original compound (the default), leave the compound as is, or index both the compound and its sub-parts, which they call "mixed". I prefer the "mixed" option if the compound splitting is good, as I noted above, because it allows for more precise matches on the whole compound, but also reasonable matches on parts of the compound.

We'll need native speaker review to judge the quality of the compound splitting, but we can still get a sense of the size of the impact on our 10K corpora.

For the Wikipedia corpus, the "default" Nori config generated 2,382,379 unique tokens, while "mixed" Nori generates 2,659,656 tokens. The extra ~277K (11.6%) tokens should be the original compounds that were split into sub-parts and discarded in the "default" config. The Wiktionary corpus gave a similar though smaller increase: 103,702 vs 111,331 (~7.6K / 7.4%).

New collisions are very rare—on the order of 0.1% or less. This makes sense, because a new collision in this case means that a compound "AB" that had previously been indexed only as "A" and "B" is indexed as "AB", but there are already existing tokens indexed as "AB".

For the Wikipedia corpus, 45 pre-analysis types (0.035% of pre-analysis types) / 3644 tokens (0.153% of tokens) were added to 45 groups (0.041% of post-analysis types), affecting a total of 121 pre-analysis types (0.094% of pre-analysis types) in those groups.

For the Wiktionary corpus: 12 pre-analysis types (0.05% of pre-analysis types) / 67 tokens (0.065% of tokens) were added to 12 groups (0.058% of post-analysis types), affecting a total of 28 pre-analysis types (0.116% of pre-analysis types) in those groups.

So, the impact on increased ambiguity (new collisions) is very low, but the number of compounds indexed, which increases precision when searching for those compounds, is high!

Barring negative speaker feedback, indexing in "mixed" mode seems to be the way to go.

Other notes:


 * I did find one token out of the entire 10K Wikipedia corpus that has a middle dot in it, "학동·증심사입구역", where the middle dot seems to playing the role of a hyphen in the name of a subway station with two names. The English title uses an en dash ("Hakdong–Jeungsimsa Station") but the opening sentence uses the original middot ("Hakdong·Jeungsimsa Station")! A very brief search did not turn up any other instances of titles with middle dots that are tokenized as one long token.
 * Note that this kind of thing may be less of a problem for Nori when it does happen because the bigger token is broken down into smaller tokens for both "default" Nori and "mixed" Nori.

Nori: Enabling ICU Normalization
I compared the Nori "mixed compounds" config against the same config, but with ICU normalization enabled instead of simple lowercasing.

Since this involves normalizing strings after tokenization, the number of tokens found in each corpus remains unchanged (2,659,656 for Wikipedia; 111,331 for Wiktionary).

The impact is again very small, on the order of 0.1% or less.

For the Wikipedia corpus: 44 pre-analysis types (0.028% of pre-analysis types) / 396 tokens (0.015% of tokens) were added to 32 groups (0.024% of post-analysis types), affecting a total of 90 pre-analysis types (0.058% of pre-analysis types) in those groups.

For the Wiktionary corpus: 13 pre-analysis types (0.049% of pre-analysis types) / 887 tokens (0.797% of tokens) were added to 10 groups (0.043% of post-analysis types), affecting a total of 25 pre-analysis types (0.093% of pre-analysis types) in those groups.

New collisions are mostly other versions of letters and numbers (superscript, subscript, fullwidth, encircled, parenthesized), along with ligatures "ﬁ"/"fi", precomposed Roman numerals, German ß -> ss, Greek ς -> σ. The only Korean collisions are the letters ㅔ and ㆍ being converted to their jungseong counterparts.

Other changes include the usual ICU normalizations.

One (familiar) problem is that dotted capital I (İ) is converted to lowercase i with an extra dot (i̇), as in İtalya -> i̇talya. (Though indexing already lowercase i̇talya gives three tokens, since there is no precomposed character for i̇, which is a regular i and a combining dot. This can be fixed with a character filter converting Turkish İ to I early on.

The primary effect on Korean text is to convert individual Unicode "letters" into the corresponding choseong/jungseong/jongseong. So, "ㅍㅇㅎㄴㅌ" is converted to "ᄑᄋᄒᄂᄐ", which may look the same, unless you have clever fonts that can try to correctly format the choseong/jungseong/jongseong as syllabic blocks.

Early ICU Normalization
The lowercase filter part of the Nori analysis chain happens last. Early on I thought that was a bit odd, so after unpacking Nori, I moved the ICU Normalizer (which replaced the lowercase filter) to be first among the token filters. It didn't make any difference for Wikipedia or Wiktionary, with the default or mixed compound processing.

There is an ICU Normalization character filter (which applies before tokenization) which could have a positive impact on the tokenization of non-Korean text.

I didn't run a full analysis, because after testing some examples ("εἰμί", "Ба̀лтичко̄", "s͈ɛ̝ɡɯnba̠ɭt͈a̠ks͈ɛ̝ɡɯnba̠ɭt͈a̠kʰa̠da̠", "kum.beŋ.i.do.bal.bɨ.mjən.k’um.tʰɨl.han.da"), it didn't actually do anything useful. There were minor changes, like "tʰɨl" becoming th + ɨ + l instead of t + hɨ + l.

More aggressive ICU folding, rather than mere normalization, would probably convert "tʰɨl" to "thil", but ICU folding is only available as a token filter, after tokenization.

Since most of the "weird" problems are for non-Korean text, they aren't show-stoppers if we can't fix them.

Nori + Custom Filters
I added some custom character and token filters to the unpacked, mixed-compound, ICU-normalizing Nori config:


 * A mapping character filter to:
 * convert middle dot (·, U+00B7), and letter arae-a (ㆍ, U+318D) to spaces
 * convert dotted-I (İ) to I
 * remove soft hyphens and zero-width non-joiner
 * A pattern_replace character filter to strip combining diacritic characters from U+0300 to U+0331.
 * An minimum length token filter to remove empty strings

I'm not sure what to do about the apostrophes and apostrophe-like characters, so I've left that alone for now.

Some stats:


 * The net effect of the filters on tokenization in the Wikipedia corpus was small: 2,659,656 tokens before, 2,659,650 tokens after; presumably the effects of merging tokens that were broken up by diacritics was offset by splitting up tokens joined by dots. The Wiktionary corpus had a bigger net decrease in tokens, from 111,331 to 106,283; pronunciations are still getting split by character type, but no longer also on every diacritic.
 * The impact on the Wikipedia corpus was small, with on the order of 0.1% of tokens or less affected.
 * New collisions: 33 pre-analysis types (0.021% of pre-analysis types) / 37 tokens (0.001% of tokens) were added to 32 groups (0.024% of post-analysis types), affecting a total of 77 pre-analysis types (0.05% of pre-analysis types) in those groups.
 * New splits: 13 pre-analysis types (0.008% of pre-analysis types) / 30 tokens (0.001% of tokens) were lost from 13 groups (0.01% of post-analysis types), affecting a total of 216 pre-analysis types (0.139% of pre-analysis types) in those groups.
 * The impact on the Wiktionary corpus was a bit larger, with up to almost 2% of tokens being affected.
 * New collisions: 202 pre-analysis types (0.754% of pre-analysis types) / 1940 tokens (1.743% of tokens) were added to 181 groups (0.77% of post-analysis types), affecting a total of 405 pre-analysis types (1.512% of pre-analysis types) in those groups.

Observations:


 * The one token that might have been okay with a middle dot (학동·증심사입구역) does not get tokenized as one token anymore. (With the mixed compound config, its parts were getting indexed before, and still are.)
 * There are some momentarily confusing results, such as Cyrillic "Его" is no longer in the same group as "его"—because it's actually part of "Его́ров" (which had been tokenized as "его + ́ + ров", but is now kept together as as "егоров".
 * Lots of good results like "M­B­C" (with soft hyphens) indexed with "MBC" and "Ви́ктор" with "Виктор".
 * Wiktionary has lots of additional collisions, with phonetic transcription bits grouping with plain text.

Overall, this seems like a reasonable improvement, though the exact list of combining characters to ignore is unclear.

Speaker Review
There's a lot going on with the Korean analyzer beyond stemming, which has often been the focus of my analyses. Tokenization and compound processing are also important. There are also the Hanja-to-Hangeul transformation.

Tokenization and Compounds
Below are ten random example sentences pulled from Korean Wikipedia, and seven example sentences with specific phrases that get processed as compounds and are split into three or more tokens.

Speaker Notes: Please review the 17 examples below for proper tokenization, which is the process of breaking up text into words or other units. There can be some disagreement about the exact way to break up a particular text, so it doesn't have to be perfect, just reasonable. Some words are identified as "compounds", and are also broken up into smaller pieces. For example, 양재시민의숲역 is broken up into 양재, 시민, 숲, and 역. For the purposes of search, searching for any of these five tokens would match a document that contains the full form, 양재시민의숲역.

In the examples below, each token is in [brackets]. When multiple tokens come from the same phrase, they are bracketed together, like this: [양재시민의숲역 • 양재 • 시민 • 숲 • 역].

We are generally only worried about really bad tokenization and compound processing. As in example in English, "football" could be broken up into "foot" and "ball"—i.e., [football • foot • ball]—or it could just occur as one word, [football]. Those are both acceptable. Something like [football • foo • tball] would be bad.

Note: some words or endings may be missing from the tokenization. Nori also removes words/characters/jamo that it determines are in the categories verbal endings, interjections, ending particles, general adverbs, conjunctive adverbs, determiners, prefixes, adjective suffixes, noun suffixes, verb suffixes, and various kinds of punctuation. Words have also been stemmed—that is, reduced to their base forms—which may introduce some additional errors or unexpected words.

Hanja-to-Hangul
One of the unique features of the Nori analyzer is that it converts Hanja (Chinese characters borrowed into Korean) to Hangeul (the syllabic Korean script) to make them easier to search for. We want to make sure the conversion seems reasonable.

Speaker Notes: Below are 55 random examples of Chinese tokens, which are presumably Hanja, being grouped together with Korean tokens. Searching for either would find the other. Are these groupings reasonable? (Note that the last 14 examples have more than one Chinese token.


 * [森] [삼]
 * [聯] [련]
 * [核] [핵]
 * [柳] [류]
 * [略] [략]
 * [五絃] [오현]
 * [五道] [오도]
 * [交子] [교자]
 * [分派] [분파]
 * [分配] [분배]
 * [可能] [가능]
 * [單光] [단광]
 * [奇形] [기형]
 * [奉戴] [봉대]
 * [婦家] [부가]
 * [媽媽] [마마]
 * [山中] [산중]
 * [平地] [평지]
 * [形態] [형태]
 * [心術] [심술]
 * [快感] [쾌감]
 * [政變] [정변]
 * [時用] [시용]
 * [武陵] [무릉]
 * [溪湖] [계호]
 * [獨孤] [독고]
 * [現代] [현대]
 * [稅務] [세무]
 * [紀傳] [기전]
 * [聰明] [총명]
 * [西江] [서강]
 * [解放] [해방]
 * [讀券] [독권]
 * [赤核] [적핵]
 * [野生] [야생]
 * [鎔范] [용범]
 * [陽刻] [양각]
 * [雲臺] [운대]
 * [韓日] [한일]
 * [鬪爭] [투쟁]
 * [黃鍾] [황종]
 * [人天] [仁川] [인천]
 * [全國] [戰國] [전국]
 * [孤山] [高山] [고산]
 * [家口] [架構] [가구]
 * [將相] [長上] [장상]
 * [小師] [素砂] [소사]
 * [正式] [程式] [정식]
 * [飛鳥] [鼻祖] [비조]
 * [勇] [庸] [茸] [踊] [용]
 * [假想] [假象] [嘉祥] [가상]
 * [元定] [元正] [遠征] [원정]
 * [大寶] [大輔] [대보] [대본]
 * [刺繡] [字數] [紫綬] [自修] [자수]
 * [代償] [大商] [大相] [大賞] [對象] [隊商] [대상]

Stemming
Speaker Notes: Below are 50 random samples of "stemming groups", which are words grouped together by trying to reduce them to their base form. In English, this groups words like hope, hopes, hoped, and hoping. These would be indicated as "hope: [hope] [hoped] [hopes] [hoping]".

Another example, from below: "빠져나오: [빠져나온] [빠져나왔]" means that searching for either of "빠져나온" or "빠져나왔" will find the other. Both are stored internally as the stemmed form "빠져나오". The stemmed form is usually close to the most basic form of a word, but does not need to be correct. The important question is whether it is good that searching for one form in [brackets] will find the others.

Note that some lists may include compounds, which can be broken up into parts. So, you might see something like "ball: [ball] [football] [baseball] [basketball]" because "football" could be stored internally as "football", "foot", and "ball"; "baseball" as "baseball", "base", and "ball"; etc.


 * 가르다: [가르다] [가르다호]
 * 갈라서: [갈라서] [갈라섰]
 * 귄: [귄] [르귄]
 * 끌어당기: [끌어당겨서] [끌어당겨져] [끌어당기] [끌어당긴다]
 * 눈부시: [눈부시] [눈부신]
 * 다스리: [다스려] [다스렸] [다스리] [다스린] [다스린다] [다스릴] [다스림]
 * 달리: [달려] [달려나간다] [달려라] [달려서] [달려야] [달려져] [달렸] [달리] [달린] [달린다] [달릴]
 * 덤벼들: [덤벼드] [덤벼들]
 * 독하: [독하] [독한]
 * 뒤흔들: [뒤흔드] [뒤흔든] [뒤흔들]
 * 들뜨: [들떠] [들뜨] [들뜸]
 * 링: [링] [바이링]
 * 매달리: [매달려] [매달려서] [매달렸] [매달리] [매달린] [매달릴]
 * 매사추세츠: [매사추세츠] [매사추세츠주]
 * 멋지: [멋져] [멋졌] [멋지] [멋진]
 * 몸부림치: [몸부림쳤] [몸부림치]
 * 무덥: [무더운] [무덥]
 * 무르만스크: [무르만스크] [무르만스크주]
 * 바덴뷔르템베르크: [바덴뷔르템베르크] [바덴뷔르템베르크주]
 * 부러뜨리: [부러뜨렸] [부러뜨리]
 * 불러일으키: [불러일으켰] [불러일으키] [불러일으킨] [불러일으킨다] [불러일으킬]
 * 빙: [리빙] [빙]
 * 빠뜨리: [빠뜨려] [빠뜨렸] [빠뜨리] [빠뜨릴]
 * 빠져나오: [빠져나온] [빠져나왔]
 * 뻗치: [뻗쳐] [뻗쳐서] [뻗치] [뻗친]
 * 사라: [사라] [사라코너]
 * 사우스다코타: [사우스다코타] [사우스다코타주]
 * 사우스캐롤라이나: [사우스캐롤라이나] [사우스캐롤라이나주]
 * 슐레스비히홀슈타인: [슐레스비히홀슈타인] [슐레스비히홀슈타인주]
 * 싹트: [싹터] [싹트] [싹튼] [싹튼다고]
 * 아디: [리아디] [아디]
 * 아키타: [아키타] [아키타현]
 * 애쓰: [애써] [애써도] [애썼] [애쓰] [애쓴다]
 * 야단치: [야단치] [야단친다]
 * 열리: [열려] [열려라] [열려야] [열려져] [열렸] [열렸으며] [열리] [열린] [열린다] [열린다는] [열릴]
 * 오래되: [오래된] [오래됨]
 * 우르: [우러] [우르]
 * 웨스턴오스트레일리아: [웨스턴오스트레일리아] [웨스턴오스트레일리아주]
 * 위안장: [위안장] [위안장강]
 * 유프라테스: [유프라테스] [유프라테스강]
 * 잘츠부르크: [잘츠부르크] [잘츠부르크주]
 * 잠기: [잠겨] [잠겨서] [잠겼] [잠기] [잠긴] [잠긴다]
 * 지내: [지내] [지낸] [지낸다] [지낼] [지냄] [지냈] [지냈으나] [지냈으며] [지어내]
 * 쫓기: [쫓겨] [쫓겨간] [쫓겼] [쫓기]
 * 추하: [추하] [추한]
 * 테네시: [테네시] [테네시주]
 * 펜: [비제이펜] [펜]
 * 후려치: [후려쳐] [후려쳤] [후려치]
 * 후쿠시마: [후쿠시마] [후쿠시마현]
 * 휴: [손휴] [휴]

Large Groups
Very large groups of tokens that are grouped together are sometimes a sign of something going wrong. Sometimes it just means there are a lot of common related forms or a lot of ambiguity. If there are a relatively small number of really bad stems—as might happen with a statistical model—then we can specifically filter the worst ones (like we do for Polish), or add other filters, say, based on part-of-speech tags.

Speaker Notes: Below are some of the largest "stemming groups", which are words grouped together by trying to reduce them to their base form. In English, this groups words like hope, hopes, hoped, and hoping. These would be indicated as "hope: [hope] [hoped] [hopes] [hoping]".

Note that some lists may include compounds, which can be broken up into parts. So, you might see something like "ball: [ball] [football] [baseball] [basketball]" because "football" could be stored internally as "football", "foot", and "ball"; "baseball" as "baseball", "base", and "ball"; etc.

If it is too difficult to understand why some tokens are grouped with others without context, I can try to provide context for these tokens by tracking the specific sentences they came from.

I've also included some notes from my own investigations for the first two. I'm only listing the top three from Wikipedia until we get a sense of what's going on.

[Note that these large groups are not necessarily indicative of the general overall performance of the Nori analyzer.]


 * 지: [之] [地] [志] [摯] [智] [池] [知] [至] [芷] [가까워져] [가까워졌] [가까워진] [가까워진다] [가까워질] [가려져] [가려졌] [가려진] [가려진다] [가려짐] [가르쳐진] [가벼워졌] [가벼워진] [가해져야] [가해졌] [가해진] [가해질] [갈라져서] [갈라졌] [갈라진다] [갈라짐] [감춰질] [강해져] [강해져서] [강해졌] [강해진다] [강해질] [갖춰져] [갖춰진] [건진] [걸려졌] [걸쳐져] [걸쳐졌] [걸쳐진] [겹쳐져] [고쳐졌] [곱해진] [구워진] [구해진다] [그려져] [그려졌] [그려졌으며] [그려진] [그려진다] [그려진다고] [그려질] [그리워질] [길들여진] [길러졌] [길러졌으며] [길러진다] [꺼려졌] [꺼져] [꺼짐] [꾸며져] [꾸며졌] [꾸며진] [끌어당겨져] [끼워져] [나눠져] [나빠져] [나빠져서] [나빠졌] [나빠진] [나빠진다] [나빠질] [나진] [남겨져] [남겨진] [남겨질] [내던져져] [내려진] [내려짐] [넘겨졌] [넘겨진] [넘겨진다] [놓여져] [놓여졌] [놓여진] [느껴졌] [느껴진] [느껴진다] [느껴질] [느려졌] [느려진다] [느려질] [늦춰졌] [늦춰진] [다뤄져] [다뤄져야] [다뤄졌] [다뤄진] [다뤄진다] [달궈진] [달라져야] [달라졌] [달라진다] [달라진다는] [달려져] [담겨져] [담겨진] [더럽혀진] [더렵혀져] [더워질] [더해져] [던져져] [던져진] [덧붙여진] [덮여져] [데워진] [돌려졌] [되돌려졌] [두꺼워져] [두꺼워진다] [둘러져] [드리워진] [들여진] [따라진] [뜨거워진] [뜨거워질] [뜸해졌] [로워질] [말려져] [맞춰져] [맞춰졌] [맞춰진] [맞춰진다] [맡겨져] [맡겨졌] [매겨진] [매겨진다] [매겨질] [멈춰진] [메워져] [모셔져] [모셔졌] [모셔진] [모셔진다] [모아졌] [모아진] [무거워졌] [무거워진] [뭉쳐진] [미뤄져] [미뤄졌] [바쳐진] [받아들여져] [받아들여졌] [받아들여진] [받아들여진다] [받쳐진] [발라져] [밝혀져] [밝혀졌] [밝혀졌으나] [밝혀졌으며] [밝혀졌으므로] [밝혀진] [밝혀진다] [밝혀질] [밝혀짐] [버려져] [버려졌] [버려진] [버려진다] [벌려진] [벌여졌] [벗겨졌] [벗겨진다] [보여졌] [보여진다] [봉해져] [봉해졌] [봉해졌으나] [봉해진] [부드러워졌] [부드러워진] [불려져] [불태워져] [불태워졌] [붙여져] [붙여졌] [붙여진다] [붙여질] [비워졌] [비춰졌] [빨라져서] [빨라졌] [빨라진] [빨라진다] [뿌려졌] [뿌려진] [뿌려진다] [산지] [살려진] [새겨져] [새겨졌] [새겨진] [새겨질] [새로워진] [세워져] [세워져야] [세워졌] [세워졌으며] [세워진] [세워진다] [세워질] [세워짐] [숨겨져] [숨겨져온] [숨겨진] [쉬워졌] [쉬워진다] [스러워진] [시끄러워졌] [심해져] [심해져서] [심해졌] [심해진] [심해진다는] [심해질] [쌓여져] [쌓여졌] [쌓여진] [써져] [써졌] [써진] [쓰여져] [쓰여져야] [쓰여졌] [쓰여진] [쓰여진다] [쓰여진다면] [씌여졌] [씌여진] [안지] [알려져] [알려져서] [알려져야] [알려졌] [알려졌었] [알려졌으나] [알려졌으며] [알려진] [알려진다] [알려질] [알려짐] [앞당겨진] [약해져] [약해졌] [약해진] [약해진다] [어두워진] [어두워질] [어려워졌] [어려워진] [어려워진다] [어려워질] [얹혀져] [얹혀진] [여겨져] [여겨졌] [여겨졌었] [여겨졌으나] [여겨진] [여겨진다] [여겨진다는] [여져] [여져서] [여져야] [여졌] [여진] [여진다] [여짐] [연지] [열려져] [예뻐질] [올려져] [올려졌] [올려진] [옮겨져] [옮겨져서] [옮겨졌] [옮겨진] [옮겨질] [옮겨짐] [워진] [이뤄졌] [읽혀진] [읽혀진다] [입혀져] [잊혀져] [잊혀졌] [잊혀졌으며] [잊혀진] [잊혀진다] [잘려졌] [저질러졌으며] [적혀져] [전해져] [전해져온다] [전해져왔] [전해졌] [전해진] [전해진다] [전해질] [전해짐] [정해져] [정해져야] [정해졌] [정해진] [정해진다] [제이지] [져] [져감] [져갔] [져나와] [져도] [져라] [져버린] [져본] [져서] [져야] [져준] [졌] [졌어도] [졌었] [졌으나] [졌으며] [졌을] [좁혀진] [지] [지기] [지면] [지워져] [지워졌] [지워질] [지질] [지켜졌] [지켜진다] [진] [진다] [진다고] [진다는] [진다면] [진단] [질] [질까] [질수록] [짐] [짜여져] [짜여졌] [짜여진] [찢겨진] [차가워진다] [채워져] [채워졌] [채워진] [채워진다] [처해진다] [취해졌] [취해진] [치러져서] [치러졌] [치러졌으며] [치러진] [치러진다] [치뤄졌] [치뤄진] [친해져] [친해졌] [친해진다] [칠해져] [칭해졌] [커져] [커져서] [커졌] [커진] [커진다는] [커질] [커질수록] [커짐] [태워져] [태워졌] [튕겨져] [파여져] [편해진다] [펼쳐져] [합쳐저] [합쳐져] [합쳐져서] [합쳐져야] [합쳐졌] [합쳐진] [해져] [해져갈] [해져갔] [해져서] [해져야] [해졌] [해졌으나] [해졌으며] [해진] [해진다] [해진다고] [해진다는] [해질] [해짐] [행해져] [행해졌으며] [행해진] [행해진다] [행해질] [행해짐] [흩뿌려져]
 * 지/ji has 6 etymologies and 6 meanings on English Wiktionary, so there's bound to be some ambiguity and some errors. Some of the Hanja that are converted, like "智", are listed in Wiktionary as just 지]/ji, while others, like "知", have multiple Hangeul versions (in this case, 알/ai or 지/ji), and it looks like Nori picked this one. In several other cases, especially where the token ends with -진, the part of speech tagger is marking 지 as an auxiliary verb, which is maybe another category of parts of speech we should filter.
 * 이: [伊] [彝] [異] [离] [갠] [거나] [거든요] [건] [건가] [건데] [건지] [건진] [걸까] [걸까요] [겁니다] [게] [겐지] [겠] [겨] [고] [곤] [구] [구나] [구마] [군] [군데] [그래서인지] [기] [긴] [긴고] [긴데] [까] [까진] [꺼] [꼬] [나라] [남인데] [냐] [냐고] [냐는] [냐며] [냐면] [네] [뇨] [누군가] [누군데] [누군지] [니] [니까] [니다] [다] [다고] [다냐] [다는] [다니] [다라고] [다란] [다만] [단가] [단데] [답] [답니다] [대해서] [더라] [더라도] [던가] [데] [덴] [덴지] [도록] [돈] [돼] [드니] [드라] [든] [든지] [디] [디요] [라] [라고] [라곤] [라기] [라나] [라네] [라뇨] [라는] [라는데] [라니] [라도] [라로] [라며] [라면] [라면서] [라서] [라야] [라오] [라요] [라우] [라지만] [락] [란] [란다] [랄] [람] [랍니다] [래] [래나] [래도] [랜] [러] [러니] [런] [려] [련] [로] [로군] [로다] [론가] [론지] [륜] [리] [마] [머] [먼] [며] [면] [면서] [면은] [모리] [몬데] [몬지] [무어] [무언가] [문지] [뭔가] [뭘까] [므로] [반데] [부턴가] [서] [선가] [선지] [세] [세요] [센가] [센터] [셔] [셨] [소] [쇼] [슈] [신] [신가] [신지] [십니까] [써서] [야] [얘깁니다] [어딘가] [어째서인지] [언고] [언젠간] [에선지] [에요] [엔지] [여] [여도] [여서] [여서라도] [여선] [여야] [열] [였] [였었] [였으나] [였으니] [였으리라] [였으며] [였으므로] [였을] [였을지라도] [였음에도] [였음을] [였음이] [예] [예요] [옌지] [온가] [온데] [올] [왠] [요] [요린데] [우] [원이었다는] [위해서] [유] [의해서] [이] [이고] [이택림] [인] [인가] [인가라는] [인기] [인다] [인데] [인데다] [인데요] [인들] [인듯] [인디] [인즉] [인지] [인진] [일] [일까] [일까요] [일리] [일수록] [일지] [일지라] [임] [입] [입니까] [입니다] [잊어버린다] [작인] [잔] [잖] [잖아] [저인] [전환] [정반대] [제] [제조업체인] [젠] [죠] [쥬] [지] [지만] [짼] [차인] [찬가] [키론] [키지] [테] [테니] [텐] [텐데] [틴디] [프론] [한건] [할지] [함인] [함인데] [합니다] [해서] [해서인지] [해줄테] [형산] [후에]
 * 이/i has 11 etymologies and 16 meanings—one of which has 37 sup-parts?!?—on English Wiktionary, so there's bound to be lots of ambiguity and some errors. Only 4 of 5 Hanja are in English Wiktionary, but all have 이/i as their Hangeul counterpart. For the rest, some are hard to track down—without any other context, the tokens shown here, like "답", don't generate 이 when analyzed.
 * Other examples: "반데" is analyzed as 반데 • 바 • 이, where 이 is marked as a "positive designator". "였을지라도" is analyzed as a series of "verbal endings" with 이 as a "positive designator" in the middle. All the verbal endings are dropped, which is kind of weird.
 * 하: [下] [夏] [河] [거든] [겠다] [겠다고] [겠다는] [고마워했] [기뻐할] [기뻐했] [꺼려한] [더니] [두려워했] [따라한] [래라] [렸으나] [미워할] [부러워한다] [스러워한] [스러워한다] [스러워했] [슬퍼한다] [시고] [시네] [시는] [시던] [아파했] [야] [야겠다] [열] [줘야] [지켜야] [치] [칠] [카] [케한] [키지] [하] [한] [한건] [한걸] [한다] [한다거나] [한다고] [한다는] [한다는데] [한다던가] [한다던지] [한다든지] [한다며] [한다면] [한다면서] [한다지만] [한데] [한데다가] [한들] [한지] [한지라] [할] [할까] [할까요] [할라] [할려고] [할려면] [할수록] [할지] [할지라도] [함] [함인] [합니다] [합니다만] [합시다] [해] [해가] [해댄다] [해도] [해라] [해버렸] [해서] [해서인지] [해선] [해야] [해야겠네] [해옴] [해와] [해왔] [해왔었] [해왔으며] [해요] [해준다] [해준다고] [해준다면] [해줄래] [해줌] [해줘] [해한다] [해했] [해했었] [했] [했어도] [했었] [했었으나] [했었으며] [했으나] [했으니] [했으며] [했으므로] [헀] [허] [헤]

Next Steps

 * Get speaker review of the samples and examples above. (IN PROGRESS see talk page)
 * If that goes well, then:
 * Determine whether we need to change the config for plain field and the completion suggester.
 * Implement the configs in AnalysisConfigBuilder and add tests.
 * Figure out how re-indexing Korean with a very different analyzer interacts with LTR
 * Re-index Korean-language wikis
 * If speaker review is unclear:
 * Consider setting up an instance with Nori on RelForge for people to test
 * If speaker review is generally negative:
 * Unpack CJK for Korean and add middle-dot-to-space conversion, strip soft hyphens and zero-width non-joiners.


 * CJK follow up:
 * test CJK with Japanese mixed-script tokens, including middle dots
 * look for easily-fixed anomalies with other long Japanese tokens
 * Look at soft hyphens and zero-width non-joiners
 * possibly unpack CJK for Japanese and add fixes
 * possibly consolidate fixes for CJK across Japanese and Korean; may need to test and include Chinese, even though we don't use CJK for Chinese.


 * Open upstream tickets
 * CJK bugs: (DONE)
 * mixed-script tokens are treated as one long token
 * leaves soft hyphens in place
 * leaves bidi markers in place
 * leaves zero-width non-joiner in place
 * eats encircled numbers (①②③), "dingbat" circled numbers (➀➁➂), parenthesized numbers (⑴⑵⑶), fractions (¼ ⅓ ⅜ ½ ⅔ ¾), superscript numbers (¹²³), and subscript numbers (₁₂₃)
 * Nori bugs: (DONE: Elasticsearch & Lucene)
 * arae-a used as middle dot creates one long token
 * empty token after "그레이맨"
 * tokens split on different "types", including IPA extensions, Extended Greek, and diacritics, and apostrophes.
 * splits on soft hyphens
 * splits on zero-width non-joiner
 * "튜토리얼" gets tokenized with an extra space at the end