User:TJones (WMF)/Notes/Nori Analyzer Analysis/Nori Config

From MediaWiki.org
Jump to navigation Jump to search

Below is the final command line config I used for testing the Nori Korean analyzer: Nori is unpacked, icu_normalization is added, the tokenizer is configured for "mixed" compound processing (keeping the original and the parts), "VCP", "VCN", and "VX" are added to the part-of-speech filter, a minimum length filter is added to eliminate empty tokens, a character map is added for characters that cause problems in tokenization and to fix the regression on dotted I (İ) from ICU normalization, and another character filter to strip the most common problem combining diacritics.

   curl -X PUT "localhost:9200/nori_mixed_icu_custom_pos?pretty" -H 'Content-Type: application/json' -d'
   {
     "settings": {
       "index": {
         "analysis": {
           "tokenizer": {
             "nori_tok": {
               "type": "nori_tokenizer",
               "decompound_mode": "mixed"
             }
           },
           "filter": {
              "nori_posfilter": {
                "type": "nori_part_of_speech",
                "stoptags": [
                  "E", "IC", "J", "MAG", "MAJ", "MM", "SP", "SSC", "SSO", "SC", "SE", "XPN", "XSA", "XSN", "XSV", "UNA", "NA", "VSV", "VCP", "VCN", "VX"
                ]
              },
               "nori_length": {
                   "type": "length",
                   "min" : 1
               }
           },
           "char_filter": {
               "nori_charfilter": {
                   "type": "mapping",
                   "mappings": [
                       "\\u0130=>I",
                       "\\u00B7=>\\u0020",
                       "\\u318D=>\\u0020",
                       "\\u00AD=>",
                       "\\u200C=>"
                   ]
               },
               "nori_combo_filter": {
                   "type": "pattern_replace",
                   "pattern" : "[\\u0300-\\u0331]",
                   "replacement" : ""
               }
           },
           "analyzer": {
             "text": {
               "type": "custom",
               "char_filter" : [ "nori_charfilter", "nori_combo_filter" ],
               "tokenizer": "nori_tok",
               "filter" : [ "nori_posfilter", "nori_readingform", "icu_normalizer", "nori_length" ]
             }
           }
         }
       }
     }
   }
   '

I still need to convert this appropriate config in AnalysisConfigBuilder and test there.