Thread:Talk:Search/Works for Japanese/reply (5)

The Kuromoji plugin looks to be an effort to integrate this which claims support for lemmatization and readings for kanji. I'm playing with the default setup for it and I don't see any kanji normalization, but it does a much better job with word segmentation then the one that is deployed on jawiki now. The one deployed on jawiki now is Lucene's StandardAnalyzer which implements unicode word segmentation. I haven't dove into that deeply enough to explain it, but some examples.


 * 日本国 becomes


 * 日本 and 国 in kuromoji
 * 日 and 本 and 国 in standard


 * にっぽんこく becomes


 * にっぽん and こい in kuromoji
 * に and っ and ぽ and ん and こ and い in standard

From that it looks like kuromoji should be better but standard is saved by executing the search for all the characters as a phrase search which makes everything line up _reasonably_ well. It won't perform as well, but that should be ok too.

And it looks like my fancy highlighter chokes on kuromoji, which isn't cool. Look here. There are results without any highlighted anything which isn't good.

With regards to lsearchd: I'm not sure what it uses. It doesn't have the api that lets me see how text is analyzed so I have to guess from reading the code and there is a lot of it.