Topic on User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

Stemming

2 comments • 21:36, 9 October 2018 5 years ago

2

-revi (talkcontribs)

가르다

original: 가르다: [가르다] [가르다호]
Have no idea where 가르다호 is coming from.

갈라서

Fine

귄

Original: 귄: [귄] [르귄]
Have no idea where 르귄 is coming from.

끌어당기

LGTM

다스리

LGTM

달리

LGTM

덤벼들

LGTM

독하

LGTM

뒤흔들

LGTM

들뜨

LGTM

링

Original: 링: [링] [바이링]
Have no idea where 바이링 is coming from.

매달리

LGTM

매사추세츠

LGTM

멋지

LGTM

몸부림치

LGTM

무덥

LGTM

무르만스크

LGTM

바덴뷔르템베르크

LGTM

부러뜨리

LGTM

불러일으키

LGTM

빙

Original: 빙 [리빙] [빙]
리빙 is direct translation of "Living". Unrelated.

빠뜨리

LGTM

빠져나오

LGTM

사라

Original: 사라: [사라] [사라코너]
사라코너 is a name: 사라 코너. Unrelated.

사우스다코타

LGTM

사우스캐롤라이나

LGTM

슐레스비히홀슈타인

LGTM

리아디

Original: 리아디: [리아디] [아디]
Looks unrelated.

아키타

LGTM

애쓰

LGTM

야단치

LGTM

열리

LGTM

오래되

LGTM

우르

Original: 우르: [우러] [우르]
if if meant to say 울어, it should've been 울으/울어.

웨스턴오스트레일리아

LGTM

위안장

LGTM

유프라테스

LGTM

잘츠부르크

LGTM

잠기

LGTM

지내

LGTM

쫓기

LGTM

추하

LGTM

테네시

LGTM

펜

Original: 펜: [비제이펜] [펜]
Does not look related.

후려치

LGTM

후쿠시마

LGTM

휴

Original: 휴: [손휴] [휴]
손휴 looks like a name, unrelated.

Reply 07:38, 9 October 2018 5 years ago

TJones (WMF) (talkcontribs)

Sorry, the stemming list includes some compounds, which are divided into parts and will be searchable by any of the parts, though exact matches are best. So, the compound cases are fine, assuming the tokenization (breaking into words) is reasonable. Because there's a parser involved, context can change the way characters are treated, which adds to the complexity.

[르귄 / 르 / 귄, a compound, with 르 and 귄 tagged as proper nouns.
빙 / 리빙—in isolation, 리빙 comes out as a single token. There are three instances of 리빙 in my Wikipedia corpus, and two of them are treated correctly. However, in "태양의 아이들 (2011, 웅진리빙하우스) ISBN 9788901136059", it gets indexed as a compound. Probably still a parsing error.
사라 / 사라코너—yep, I see it. But for some reason the name 사라코너 is also being treated as a compound [사라코너 • 사라 • 코너].
리아디 / 아디—again, 리아디 is treated as a compound, and the part 아디 is indexed under the whole
우러 is interpreted as 우르/VV(Verb)+어/E(Verbal endings), so it gets grouped with other instances of 우르.
비제이펜—interpreted as a compound, all proper nouns: "비/NNP(Proper Noun)+제이/NNP(Proper Noun)+펜/NNP(Proper Noun)", and so grouped under each of the parts.
손휴—again, proper nouns.

This brings up the possibility that we should not index compounds by their parts. The default setting throws away the original compound and only keeps the parts. I thought keeping the original would increase precision when you know exactly what you are looking for. Not keeping the parts would get rid of some of these errors, but also make it harder to match when you have part of a compound. For example, right now, many-part compounds can match a shorter compound that is part of it. So a four-part compound, ABCD, can match the three-part compound, ABC, because A, B, and C are all indexed separately.

Based on the general review of Tokenization and Compounds, though, I think we are okay, with more correct tokenizations than errors.

Thanks again, revi, for all the help! Any more comments on anything would be welcome!

Reply 21:36, 9 October 2018 5 years ago

Reply to "Stemming"