User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

Jump to navigation Jump to search

About this board

Speaker Review -> Tokenization and Compounds -> 10 random sentences

9
Bmansurov (WMF) (talkcontribs)

I've checked the tokenization of 10 random sentences. The results look good.

input 김대중 대통령은 2003년까지 학급당 학생수를 35명 이하로 감축한다는내용의 '7.20 교육여건 개선계획' 을 발표했다.
tokens [김대중] — [대통령] — [2003] — [년] — [학급] — [학생] — [수] — [35] — [명] — [이하] — [감축] — [내용] — [7] — [20] — [교육] — [여건] — [개선] — [계획] — [발표]
my tokens [김대중 • 김 • 대중] (person's name which consists of the last name and the first name) — [대통령] — [2003] — [년] — [학급] — [학생] — [수] — [35] — [명] — [이하] — [감축] — [내용] — [7] — [20] — [교육] — [여건] — [개선] — [계획] — [발표]
input 모든 모델은 MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, 향상된 인텔 스피드스텝 기술(EIST), EM64T(Extended Memory 64 Technology), XD 비트, 가상화 기술, 스마트 캐시, 인텔 터보 부스트 지원
tokens [모델] — [mmx] — [sse] — [sse] — [2] — [sse] — [3] — [ssse] — [3] — [sse] — [4] — [1] — [sse] — [4] — [2] — [향상] — [인텔] — [스피드스텝 • 스피드 • 스텝] — [기술] — [eist] — [em] — [64] — [t] — [extended] — [memory] — [64] — [technology] — [xd] — [비트] — [가상] — [기술] — [스마트] — [캐시] — [인텔] — [터보] — [부스트] — [지원]
my tokens [모든] (missing word) — [모델] — [mmx] — [sse] — [sse] — [2] — [sse] — [3] — [ssse] — [3] — [sse] — [4] — [1] — [sse] — [4] — [2] — [향상] — [인텔] — [스피드스텝 • 스피드 • 스텝] — [기술] — [eist] — [em] — [64] — [t] — [extended] — [memory] — [64] — [technology] — [xd] — [비트] — [가상화] (missing ending; with the ending the word means "virtualization", without it something different) — [기술] — [스마트] — [캐시] — [인텔] — [터보] — [부스트] — [지원]
input 다 자라면 몸길이는 61 cm, 몸무게는 1.4~2.7 kg 정도가 된다.
tokens [자라] — [몸길이 • 몸 • 길이] — [61] — [cm] — [몸무게 • 몸 • 무게] — [1] — [4] — [2] — [7] — [kg] — [정도] — [된다 • 되]
my tokens [다] (missing word) — [자라] — [몸길이 • 몸 • 길이] — [61] — [cm] — [몸무게 • 몸 • 무게] — [1] — [4] — [2] — [7] — [kg] — [정도] — [된다 • 되]
input 7월 14일에는 태항산에 있던 조선청년연합회 소속 병사들이 하북성에 도착하자, 당일 하북성 섭현에서 김두봉, 박효삼 등과 함께 조선의용군을 발족시키고 총사령관에 취임했다.
tokens [7] — [월] — [14] — [일] — [태항] — [산] — [있] — [조선] — [청년] — [연합회 • 연합 • 회] — [소속] — [병사] — [하북성 • 하북 • 성] — [도착] — [당일] — [하북성 • 하북 • 성] — [섭] — [현] — [김두봉] — [박] — [효] — [삼] — [등] — [조선] — [용군] — [발족] — [총사령관 • 총 • 사령 • 관] — [취임]
my tokens [7] — [월] — [14] — [일] — [태항] — [산] — [있] — [조선] — [청년] — [연합회 • 연합 • 회] — [소속] — [병사] — [하북성 • 하북 • 성] — [도착] — [당일] — [하북성 • 하북 • 성] — [섭현] (should be one word) — [김두봉 • 김 • 두봉] (person's last and first name) — [박효삼 • 박 • 효삼] (person's name) — [등] — [함께] (missing word) — [조선] — [용군] — [발족] — [시키] (missing word) — [총사령관 • 총 • 사령 • 관] — [취임]
input 연합감리교회의 조직은 미국 이외에도 캐나다와 유럽, 아프리카와 필리핀의 교회들을 포함한다.
tokens [연합] — [감리] — [교회] — [조직] — [미국] — [이외] — [캐나다] — [유럽] — [아프리카] — [필리핀] — [교회] — [포함]
my tokens same as above
input 2006년 중화인민공화국에서는 단백질의 함량을 속여서, 미국으로 수출할 가축 사료의 원료인 밀글루텐 등 조단백 함량이 높은 사료 원료의 단백질양을 과장하여 부풀리는 데 이용하였다.
tokens [2006] — [년] — [중화] — [인민공화국 • 인민 • 공화국] — [단백질 • 단백 • 질] — [함량] — [속여서 • 속이] — [미국] — [수출] — [가축] — [사료] — [원료] — [인 • 이] — [밀] — [글루텐] — [등] — [조단] — [백] — [함량] — [높] — [사료] — [원료] — [단백질 • 단백 • 질] — [양] — [과장] — [부풀리] — [데] — [이용]
my tokens [2006] — [년] — [중화] — [인민공화국 • 인민 • 공화국] — [단백질 • 단백 • 질] — [함량] — [속이] (first form is just 속이+어서) — [미국] — [수출] — [가축] — [사료] — [원료] — (removed [인 • 이] as it's a noun maker and doesn't have a meaning by itself) — [밀] — [글루텐] — [등] — [조단] — [백] — [함량] — [높] — [사료] — [원료] — [단백질 • 단백 • 질] — [양] — [과장] — [부풀] (removed ending) — [데] — [이용]
input 일본 요리는 쇼군 치하 동안에 엘리트주의를 없애려 했던 중세 시대가 출현하며 변화하였다.
tokens [일본] — [요리] — [쇼군] — [치하] — [동안] — [엘리트주의 • 엘리트 • 주의] — [없애] — [했 • 하] — [중세] — [시대] — [출현] — [변화]
my tokens same as above
input 『산릉도감의궤』 등 문헌에 의하면 세종 영릉(英陵), 명종 강릉(康陵), 인조 장릉(長陵), 효종 영릉(寧陵)의 정자각이 팔작지붕이었으나, 후대에 모두 맞배지붕으로 교체되어 현재는 숭릉의 정자각만 팔작지붕으로 남아 있다.
tokens [산릉도감 • 산릉 • 도감] — [궤] — [등] — [문헌] — [의하] — [세종] — [영릉] — [영릉] — [명종] — [강릉] — [강릉] — [인조] — [장릉] — [장릉] — [효종] — [영릉] — [寧] — [릉] — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — [이] — [후대] — [맞배지붕 • 맞배 • 지붕] — [교체] — [현재] — [숭릉] — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — [남] — [있]
my tokens [산릉도감 • 산릉 • 도감] — [궤] — [등] — [문헌] — [의하] — [세종] — [영릉] — [영릉] — [명종] — [강릉] — [강릉] — [인조] — [장릉] — [장릉] — [효종] — [영릉] — [영릉](hanja should be correctly detected) — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — (removed [이]) — [후대] — [모두] (was missing) — [맞배지붕 • 맞배 • 지붕] — [교체] — [현재] — [숭릉] — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — [남] — [있]
input 1934년 파울 폰 힌덴부르크 대통령이 사망한 후 히틀러는 수상과 대통령직을 겸무해서 국방국 최고 지휘권을 손에 넣게 되었다.
tokens [1934] — [년] — [파울] — [폰] — [힌덴부르크] — [대통령] — [사망] — [후] — [히틀러] — [수상] — [대통령] — [직] — [겸무] — [국방] — [국] — [최고] — [지휘] — [손] — [넣] — [되]
my tokens [1934] — [년] — [파울] — [폰] — [힌덴부르크] — [대통령] — [사망] — [후] — [히틀러] — [수상] — [대통령] — [직] — [겸무] — [국방] — [국] — [최고] — [지휘권 • 지휘 • 권] (compound word) — [손] — [넣] — [되]
input 부산지방법원와 서울형사지방법원 등에서 부장판사를 하다가 부산지방법원, 제주지방법원, 춘천지방법원, 광주고등법원에서 법원장을 역임하였으며 이후 공직에서 물러나 변호사 활동을 했다.
tokens [부산] — [지방] — [법원] — [서울] — [형사] — [지방] — [법원] — [등] — [부장] — [판사] — [하] — [부산] — [지방] — [법원] — [제주] — [지방] — [법원] — [춘천] — [지방] — [법원] — [광주] — [고등] — [법원] — [법원장 • 법원 • 장] — [역임] — [이후] — [공직] — [물러나 • 물러나] — [변호사 • 변호 • 사] — [활동] — [했 • 하]
my tokens [부산] — [지방] — [법원] — [서울] — [형사] — [지방] — [법원] — [등] — [부장] — [판사] — [하] — [부산] — [지방] — [법원] — [제주] — [지방] — [법원] — [춘천] — [지방] — [법원] — [광주] — [고등] — [법원] — [법원장 • 법원 • 장] — [역임] — [이후] — [공직] — [물러나 ] (removed duplicate) — [변호사 • 변호 • 사] — [활동] — [했 • 하]
-revi (talkcontribs)

Input: 모든 모델은 MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, 향상된 인텔 스피드스텝 기술(EIST), EM64T(Extended Memory 64 Technology), XD 비트, 가상화 기술, 스마트 캐시, 인텔 터보 부스트 지원

Mine: 모든/모델/MMX/SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2/향상/인텔/스피드/스텝/기술/EIST/EM64T/Extended/Memory/64/Technology/XD/비트/가상화/기술/스마트/캐시/인텔/터보/부스트/지원

Diff: I have 모든.

Input: 2006년 중화인민공화국에서는 단백질의 함량을 속여서, 미국으로 수출할 가축 사료의 원료인 밀글루텐 등 조단백 함량이 높은 사료 원료의 단백질양을 과장하여 부풀리는 데 이용하였다.

Mine: 2006/년/중화/인민/공화국/단백질/함량/함량/속여서(속이)/미국/수출/가축/사료/원류/밀/글루텐/조단백/함량/높은/사료/원료/단백질/양/과장/부풀리는(부풀리)/이용

Diff: I did not split 조단백.

Input: 『산릉도감의궤』 등 문헌에 의하면 세종 영릉(英陵), 명종 강릉(康陵), 인조 장릉(長陵), 효종 영릉(寧陵)의 정자각이 팔작지붕이었으나, 후대에 모두 맞배지붕으로 교체되어 현재는 숭릉의 정자각만 팔작지붕으로 남아 있다.

Mine: 산릉/도감/의궤/문헌/의하/세종/영릉/英陵(translates to 영릉)/명종/강릉/康陵(translates to 강릉)/인조/장릉/長陵(translates to 장릉)/효종/영릉/寧陵(translates to 영릉)/정자각/팔작지붕(can be split to 팔작/지붕)/이/후대/모두/맞배지붕(can be split to 맞배/지붕)/교체/현재/숭릉/정자각/팔작지붕/남아(남).

Diff: 의궤 (ko:의궤) is its own word. Should not omit 의 here.

Input: 1934년 파울 폰 힌덴부르크 대통령이 사망한 후 히틀러는 수상과 대통령직을 겸무해서 국방국 최고 지휘권을 손에 넣게 되었다.

Mine: 1934/년/파울/폰/힌덴부르크/대통령/사망/후/히틀러/수상/대통령/직/겸무/국방/국/최고/지휘/권/손/넣/되

Diff: 권 means right. Should not be omitted.

Otherwise LGTM.

-revi (talkcontribs)

Seems most of my stuff is also covered below but 조단백 (I don't know how it was created (as I am not good at Biology or Chemical stuff), but it's IMO obviously not 조단/백. Maybe 조/단백)? and 의궤 still stands.

TJones (WMF) (talkcontribs)

Re: 조단백—it looks like 백 was interpreted as a number (Wiktionary says 100) and 조단 was just kind of left over as a "general noun". Is it a rare or very technical term? It gets only 7 hits on Korean Wikipedia at the moment. It's not surprising if some rare scientific terms are processed oddly. Fortunately, splitting it up incorrectly won't keep it from being found (it may just increase irrelevant results—but scoring should bring the good ones, including exact matches, to the top).

Re: 의궤—yeah, that's an error. It's reading 의 as an "ending particle" which then gets filtered, and 궤 as a "general noun". (I'm starting to think "general noun" means "some leftover characters.) There's something about the phrase "산릉도감의궤" that is causing it, because 의궤 by itself comes out as one word.

-revi (talkcontribs)

I'm not a biology expert, but it does sound like a technical term. 단백 is the protein, so I guess 조 is something to be omitted or it just makes separate word.

Garam (talkcontribs)
TJones (WMF) (talkcontribs)

Thanks a lot Baha!

I forgot to mention that some words or endings may be intentionally missing from the tokenization. Nori also removes words/characters/jamo that it determines are in the categories verbal endings, interjections, ending particles, general adverbs, conjunctive adverbs, determiners, prefixes, adjective suffixes, noun suffixes, verb suffixes, and various kinds of punctuation.

I can re-do the tokenization without the part-of-speech filtering, if you think that would help.

For now, I'll just look into the specific ones that you mentioned are missing.

  • 김대중 대통령은 2003년까지 학급당 학생수를 35명 이하로 감축한다는내용의 '7.20 교육여건 개선계획' 을 발표했다.
    • I'm not terribly surprised it didn't split the name 김대중 correctly, though if it was going to know about any Korean surname, it seems like it would know 김. It did recognize it as a proper noun, though. Are there any other names that are split up like you propose?
  • 모든 모델은 MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, 향상된 인텔 스피드스텝 기술(EIST), EM64T(Extended Memory 64 Technology), XD 비트, 가상화 기술, 스마트 캐시, 인텔 터보 부스트 지원.
    • 모든 is filtered as a determiner; based on the English Wiktionary entry, that seems reasonable.
    • 가상화: it is pulling off 화 as a noun suffix.
  • 다 자라면 몸길이는 61 cm, 몸무게는 1.4~2.7 kg 정도가 된다.
    • 다 is filtered as a general adverb.
  • 7월 14일에는 태항산에 있던 조선청년연합회 소속 병사들이 하북성에 도착하자, 당일 하북성 섭현에서 김두봉, 박효삼 등과 함께 조선의용군을 발족시키고 총사령관에 취임했다.
    • 섭현 is split as two "general nouns", so that's an error.
    • 박효삼 is split with 박 as a proper noun, 효 as a general noun, and 삼 as a numeral, which Wiktionary agrees with. Recognizing ambiguous names is hard, but this is an error. However, it shouldn't prevent search matches, but it will allow potential false matches.
    • 함께 is filtered as a "general adverb".
    • 시키 is filtered as a verb suffix
  • 2006년 중화인민공화국에서는 단백질의 함량을 속여서, 미국으로 수출할 가축 사료의 원료인 밀글루텐 등 조단백 함량이 높은 사료 원료의 단백질양을 과장하여 부풀리는 데 이용하였다.
    • 속여서 seems to be treated as a compound and is actually tokenized as [속여서 • 속이 • 어서], but 어서 is dropped. As long as 속이 is the correct stemmed form and is present, it's okay. Though I've noticed this happening elsewhere, and I think it may be a bug. If it was just [속이 • 어서], then 어서 would be dropped as a verbal ending and we'd get the desired result.
    • 인 and 이 are tagged as "positve designators"; we could filter those if this comes up a lot.
    • 부풀리, looks like a stemming error, as it is just tagged as a verb.
  • 『산릉도감의궤』 등 문헌에 의하면 세종 영릉(英陵), 명종 강릉(康陵), 인조 장릉(長陵), 효종 영릉(寧陵)의 정자각이 팔작지붕이었으나, 후대에 모두 맞배지붕으로 교체되어 현재는 숭릉의 정자각만 팔작지붕으로 남아 있다.
    • 寧陵/영릉 — looks like it detected 陵 as Hanja, but not both characters together. Weird.
    • 팔작지붕/이 — another "positive designator".
    • 모두 — another victim of the adverb filter.
  • 1934년 파울 폰 힌덴부르크 대통령이 사망한 후 히틀러는 수상과 대통령직을 겸무해서 국방국 최고 지휘권을 손에 넣게 되었다.
    • 지휘권 — looks like 권 was parsed as a noun suffix, and then filtered.
  • 부산지방법원와 서울형사지방법원 등에서 부장판사를 하다가 부산지방법원, 제주지방법원, 춘천지방법원, 광주고등법원에서 법원장을 역임하였으며 이후 공직에서 물러나 변호사 활동을 했다.
    • 물러나 somehow gets parsed as 물러나:"물러나/Verb+아/Verbal endings" • 물러나:Verb • 아:Verbal ending (the verbal ending gets dropped). It's weird, but okay in terms of search that it gets duplicated.


Thanks again for all the detail!

Sounds like I might need to ask upstream about verbs getting treated as compounds if that is a more widespread problem, and we might want to consider filtering the "positive designator" part of speech, but I'd have to look at other instances to make sure they are mostly as useless as these.

Does filtering out the adverbs make sense, by the way?

Bmansurov (WMF) (talkcontribs)

> I can re-do the tokenization without the part-of-speech filtering, if you think that would help.

Given your explanation above, I don't think we should re-do the tokenization.

> I'm not terribly surprised it didn't split the name 김대중 correctly, though if it was going to know about any Korean surname, it seems like it would know 김. It did recognize it as a proper noun, though. Are there any other names that are split up like you propose?

I think, in general, Korean names are written like 김대중, but sometimes person's title may follow the last name. For exmaple, 김 대통령 (President Kim). Sometimes the first name appears by itself (in colloqual speech, usually). That's why any name maybe split like above in my view.

> 시키 is filtered as a verb suffix

My bad, you're correct.

> 부풀리, looks like a stemming error, as it is just tagged as a verb.

I may have made a mistake here. I thought we should take the stem from 부풀다 (become swollen) and not from 부풀리다 (make swollen).

> Does filtering out the adverbs make sense, by the way?

Yes, it does.

TJones (WMF) (talkcontribs)

Okay, so everything is looking pretty good! A tolerable number of minor mistakes, and no absurd mistakes, so far.

I wish I had a better answer on the names and titles. I'll keep an eye out for problems related to that.

Reply to "Speaker Review -> Tokenization and Compounds -> 10 random sentences"
-revi (talkcontribs)
가르다
  • original: 가르다: [가르다] [가르다호]
  • Have no idea where 가르다호 is coming from.
갈라서
  • Fine
  • Original: 귄: [귄] [르귄]
  • Have no idea where 르귄 is coming from.
끌어당기
  • LGTM
다스리
  • LGTM
달리
  • LGTM
덤벼들
  • LGTM
독하
  • LGTM
뒤흔들
  • LGTM
들뜨
  • LGTM
  • Original: 링: [링] [바이링]
  • Have no idea where 바이링 is coming from.
매달리
  • LGTM
매사추세츠
  • LGTM
멋지
  • LGTM
몸부림치
  • LGTM
무덥
  • LGTM
무르만스크
  • LGTM
바덴뷔르템베르크
  • LGTM
부러뜨리
  • LGTM
불러일으키
  • LGTM
  • Original: 빙 [리빙] [빙]
  • 리빙 is direct translation of "Living". Unrelated.
빠뜨리
  • LGTM
빠져나오
  • LGTM
사라
  • Original: 사라: [사라] [사라코너]
  • 사라코너 is a name: 사라 코너. Unrelated.
사우스다코타
  • LGTM
사우스캐롤라이나
  • LGTM
슐레스비히홀슈타인
  • LGTM
리아디
  • Original: 리아디: [리아디] [아디]
  • Looks unrelated.
아키타
  • LGTM
애쓰
  • LGTM
야단치
  • LGTM
열리
  • LGTM
오래되
  • LGTM
우르
  • Original: 우르: [우러] [우르]
  • if if meant to say 울어, it should've been 울으/울어.
웨스턴오스트레일리아
  • LGTM
위안장
  • LGTM
유프라테스
  • LGTM
잘츠부르크
  • LGTM
잠기
  • LGTM
지내
  • LGTM
쫓기
  • LGTM
추하
  • LGTM
테네시
  • LGTM
  • Original: 펜: [비제이펜] [펜]
  • Does not look related.
후려치
  • LGTM
후쿠시마
  • LGTM
  • Original: 휴: [손휴] [휴]
  • 손휴 looks like a name, unrelated.
TJones (WMF) (talkcontribs)

Sorry, the stemming list includes some compounds, which are divided into parts and will be searchable by any of the parts, though exact matches are best. So, the compound cases are fine, assuming the tokenization (breaking into words) is reasonable. Because there's a parser involved, context can change the way characters are treated, which adds to the complexity.

  • [르귄 / 르 / 귄, a compound, with 르 and 귄 tagged as proper nouns.
  • 빙 / 리빙—in isolation, 리빙 comes out as a single token. There are three instances of 리빙 in my Wikipedia corpus, and two of them are treated correctly. However, in "태양의 아이들 (2011, 웅진리빙하우스) ISBN 9788901136059", it gets indexed as a compound. Probably still a parsing error.
  • 사라 / 사라코너—yep, I see it. But for some reason the name 사라코너 is also being treated as a compound [사라코너 • 사라 • 코너].
  • 리아디 / 아디—again, 리아디 is treated as a compound, and the part 아디 is indexed under the whole
  • 우러 is interpreted as 우르/VV(Verb)+어/E(Verbal endings), so it gets grouped with other instances of 우르.
  • 비제이펜—interpreted as a compound, all proper nouns: "비/NNP(Proper Noun)+제이/NNP(Proper Noun)+펜/NNP(Proper Noun)", and so grouped under each of the parts.
  • 손휴—again, proper nouns.

This brings up the possibility that we should not index compounds by their parts. The default setting throws away the original compound and only keeps the parts. I thought keeping the original would increase precision when you know exactly what you are looking for. Not keeping the parts would get rid of some of these errors, but also make it harder to match when you have part of a compound. For example, right now, many-part compounds can match a shorter compound that is part of it. So a four-part compound, ABCD, can match the three-part compound, ABC, because A, B, and C are all indexed separately.

Based on the general review of Tokenization and Compounds, though, I think we are okay, with more correct tokenizations than errors.

Thanks again, revi, for all the help! Any more comments on anything would be welcome!

Reply to "Stemming"
-revi (talkcontribs)

I didn't look at the whole group. I looked at your notes.

(snip) "智", are listed in Wiktionary as just 지]/ji, while others, like "知", have multiple Hangeul versions (in this case, 알/ai or 지/ji), and it looks like Nori picked this one. In several other cases, especially where the token ends with -진, the part of speech tagger is marking 지 as an auxiliary verb, which is maybe another category of parts of speech we should filter.

知 just has 지/ji, 알/ai is the 'meaning' part. enwiktionary is wrong there then.

-진 is -지+ㄴ.

-이- is often used as suffix, so there's lots of example used as a suffix there.

TJones (WMF) (talkcontribs)

> 知 just has 지/ji, 알/ai is the 'meaning' part. enwiktionary is wrong there then.

No, it was just me. I didn't understand the concept of "eumhun" so I interpreted "(eumhun 알 지 (al ji))" incorrectly, and it's hard to find documentation on Wiktionary (there are no links), and it is confusing if you aren't familiar with it.

I'll update my notes.

TJones (WMF) (talkcontribs)

For -이-, I think we need to filter the "positive designator" parse. It shows up a lot, doesn't seem to carry a lot of valuable meaning, and links a lot of otherwise unlinked tokens.

Reply to "Large groups"
-revi (talkcontribs)

IMO hanja sucks except for differentiation, and Koreans these day uses less and less hanja itself as part of daily language use, but maybe worth doing it. It's all LGTM for me.

TJones (WMF) (talkcontribs)

I'm going to reply hear first since this is the easiest one! The Hanja-to-Hangeul conversion is turned on by default, so I left it in. Also, because of the way our search is configured, exact matches still get a boost, so if you search for Hanja, you are more likely to get Hanja, and if you search for Hangeul you are more likely to get Hangeul, all other things being equal. But for rarer terms, the Hanja-to-Hangeul match could be the only thing that matches, which is good.

Reply to "Hanja to Hangul"

Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds

3
Bmansurov (WMF) (talkcontribs)
input 양재역 - 양재시민의숲역 - 양재 나들목 (제부여객으로 이관)
tokens [양재역 • 양재 • 역] — [양재시민의숲역 • 양재 • 시민 • 숲 • 역] — [양재] — [나들목 • 나들 • 목] — [제부여객 • 제부 • 여객] — [이관]
my tokens same as above
input 제41권 《비틀스를 위기에서 건진 노란 잠수함》
tokens [41] — [권] — [비틀스] — [위기] — [건진 • 것 • 이 • 지] — [노란 • 노랗] — [잠수함 • 잠수 • 함]
my tokens [41] — [권] — [비틀스] — [위기] — [건지] (건지다 - pull up, 건진 - pulled up) — [노란 • 노랗] — [잠수함 • 잠수 • 함]
input 17번트랙 <좋은날> 브라운아이드걸스 버전을 편곡
tokens [17] — [번] — [트랙] — [좋] — [날] — [브라운아이드걸스 • 브라운 • 아이드 • 걸스] — [버전] — [편곡]
my tokens same as above
input 사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
tokens [사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] — [전해져왔 • 전하 • 지 • 오]
my tokens [사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] (not sure if dividing 동남 further is a good idea. 동 - east, 남 - south, but 동남 - southeast) — [전해져왔 • 전하] (removed 지 • 오, as they are not informative: 전하다 + 아지다 + 오다 + 았다 => 전해져왔다)
input 미국군이 처음으로 라인강을 도하한다
tokens [미국] — [군] — [처음] — [라인강 • 라인 • 강] — [도하]
my tokens same as above
input 전 구간 야마구치현에 소재.
tokens [구간] — [야마구치현 • 야마구치 • 현] — [소재]
my tokens same as above
input 당신이 뭔데 여기서 큰소리를 치는거야.
tokens [당신] — [뭔데 • 뭐 • 이] — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] — [야 • 이]
my tokens [당신] — [뭔데 • 뭐] (removed 이: 뭐 + 인 + 데 => 뭔데) — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] (removed verb ending)
TJones (WMF) (talkcontribs)

Thanks again, Baha!

  • 제41권 《비틀스를 위기에서 건진 노란 잠수함》
    • 건진 / 건지 — this was interpreted as "것/NNB(Dependent noun)+이/VCP(Positive designator)+ᆫ/E(Verbal endings)+지/NNB(Dependent noun)+ᆫ/J(Ending Particle)" — the verbal ending and particle were filtered. This looks like a likely error, since it doesn't seem like a verbal ending should go on noun. Looks like the parser made a mistake.
  • 사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
    • 동남 — I agree that splitting them doesn't seem necessary for directions, but at least it makes sense why it happened.
    • 전해져왔 / 지 / 오 — The parser seems to agree with you, because these are both marked as "Auxiliary Verb or Adjective", which sounds imminently ignorable.
  • 당신이 뭔데 여기서 큰소리를 치는거야.
    • 뭔데/이 — another "Positive designator", looking like it should be filtered.
    • 야 — oddly, to me, this gets split into "이/VCP(Positive designator)+야/E(Verbal endings)", where the whole thing, 야, is also an ending. So it originally would have been [[야 • 이] • 야], where the first "야" is a compound, and the second "야" is a verb ending (which did get filtered). Weird. Another vote for "positive designator" to get filtered.

Again, nothing that seems horrible, given the overall complexity of the task. More votes for filtering "positive designator" and "auxiliary verb or adjective", and possibly looking into filtering "negative designator", too.

Generally, though, I'm hopeful this will work out with only minor tweaks.

-revi (talkcontribs)

Baha's one LGTM.

Reply to "Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds"
There are no older topics