Thread:Talk:Search/Works for Japanese/reply (5)

The Kuromoji plugin looks to be an effort to integrate this which claims support for lemmatization and readings for kanji. I'm playing with the default setup for it and I don't see any kanji normalization, but it does a much better job with word segmentation then the one that is deployed on jawiki now. The one deployed on jawiki now is Lucene's StandardAnalyzer which implements unicode word segmentation. I haven't dove into that deeply enough to explain it, but some examples.


 * 日本国 becomes


 * 日本 and 国 in kuromoji
 * 日 and 本 and 国 in standard


 * にっぽんこく becomes


 * にっぽん and こい in kuromoji
 * に and っ and ぽ and ん and こ and い in standard

From that it looks like standard should be better, but it looks like it executes these are phrase queries so what you get is actually relatively decent.

And it looks like my fancy highlighter chokes on kuromoji, which isn't cool. Look here. There are results without any highlighted anything which isn't good.

With regards to lsearchd: I'm not sure what it uses. It doesn't have the api that lets me see how text is analyzed so I have to guess from reading the code and there is a lot of it.