User:TJones (WMF)/Notes/Serbian Analyzer Analysis

From MediaWiki.org
Jump to navigation Jump to search

March 2018 — See TJones_(WMF)/Notes for other projects. See also T183015. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background[edit]

I'd previously looked at four stemmers (first three "Serbian" stemmers and later a "Croatian" stemmer—all from collection called SCStemmers), and decided with help from the developer and a native speaker (thanks Vuk and Željko!) that the "Croatian" stemmer, with a new Cyrillic-to-Latin filter in front of it—was actually probably the best one for Serbian. (See Bosnian-Croatian-Montenegrin-Serbian on EnWiki for more on the relationship among these languages.

I wrapped the fourth stemmer into an Elasticsearch plugin, and built a basic analysis chain around it. As of this writing, it is still waiting for review, but the plugin will be available on Gerrit and GitHub as part of the search/extra-analysis repository (for this and any future GPL Elasticsearch plugins). (Thanks to Guillaume and David for helping me navigate Maven, Java, and Elastic to get the plugin repository into reasonable shape.)

Because my analysis analysis tools are designed to work with plugin output, I did my previous stemmer analysis on just the unique tokens from two wiki corpora—10,000 Serbian Wikipedia articles, and 10,000 Serbian Wiktionary entries—because that was just easier to handle with the command line stemmer. Here I used both the "unique tokens" corpora to compare to the original command line stemmer, and the "wiki" corpora for analysis of the full analysis chain, to get a sense of the frequency of any particular tokens. (For comparison—in English, the unique tokens corpora would have "the", "and", "is", etc. in it only once, while the wiki corpora would have them each many, many times.)

Stemmer vs Plugin[edit]

Using the unique tokens file, I compared the output of the command line stemmer to the output of a minimal analysis chain with just the standard tokenizer and the new stemmer.

All the differences seem to be related to the way the command line wrapper around the stemmer does tokenization. Tokens with "divider" characters in them—colon, period, mid-dot, apostrophe—are divided into pieces by the command line wrapper, stemmed, and then put back together. The Elasticsearch plugin just stems the whole token.

Some examples:

token command line plugin
20mjesnim.pdf 20mjesn.pdf 20mjesnim.pdf
bagnolssurceze.com bagnolssurcez.com bagnolssurceze.c
ipa:t̪khə ip:t̪khə ipa:t̪khə
jau·la ja·la jau·l
luwalhati't luwalha't luwalhati't
jusqu'à jusq'à jusqu'à

Since there is a minimum stem length, short bits like "la" and "com" are not stemmed to "l" and "c" when processed separately.

Breaking up tokens into pieces, stemming them, and then putting them back together is an interesting idea in general, though we have generally approached it with word_break_helper, which just breaks some such strings into smaller tokens, and then relies on phrase matching to find good matches.

Adding ICU Normalization[edit]

The Serbian stemmer doesn't lowercase anything (which can cause problems with ALL CAPS), but all of the tokens in the unique tokens file had already been lowercased, so I didn't enable the lowercase filter when comparing the command line stemmer to the plugin. However, our default config upgrades the "lowercase" filter to the "icu_normalizer" filter, which does some minimal folding.

I disabled the lowercase-to-icu_normalizer upgrade and ran a baseline with just lowercasing and the stemmer as the analysis chain, and compared that to the icu_normalization with the stemmer.

The results were the usual ICU normalization changes:

  • lots of RTL tokens with bidirectional markers had the the bidi chars stripped
  • "weird" whitespace characters (non-breaking spaces, zero-width non-joiners) and soft hyphens were stripped
  • the usual suspects among unicode characters got regularized: ς → σ, º → o (which is why you should use the degree sign!), 1 → 1, ß → ss, ˤ → ʕ, µ → μ, etc.

ICU normalization is already happening on Serbian Wiki, so this was just a sanity check that it didn't interfere with the stemmer in any bad way.

Adding Accent Removal[edit]

I had previously noticed ácute, gràve, double grȁve, mācron, and inverted brȇve accents, and got confirmation that those are there to indicate pitch accent (a useful guide to pronunciation but not part of normal spelling) and so should be normalized away.

I introduced an "asciifolding" filter (which gets automagically upgraded to "icu_folding"), with exceptions for the Serbian-specific Latin characters (Đđ Žž Ćć Šš Čč). None of the Serbian Cyrillic characters are modified by ICU folding (and all are converted to Latin by the stemmer anyway).

A Wee Bit of Testing[edit]

I tested precomposed Serbian Latin characters with the above pitch accent diacritics (áÁ àÀ āĀ ȁȀ ȃȂ / éÉ èÈ ēĒ ȅȄ ȇȆ / íÍ ìÌ īĪ ȉȈ ȋȊ / óÓ òÒ ōŌ ȍȌ ȏȎ / úÚ ùÙ ūŪ ȕȔ ȗȖ), precomposed Cyrillic characters (ѐЀ / ѝЍ ӣӢ / ӯӮ), Latin characters with combining diacritics (áÁ àÀ āĀ ȁȀ ȃȂ / éÉ èÈ ēĒ ȅȄ ȇȆ / íÍ ìÌ īĪ ȉȈ ȋȊ / óÓ òÒ ōŌ ȍȌ ȏȎ / úÚ ùÙ ūŪ ȕȔ ȗȖ), and Cyrillic characters with combining diacritics (а́А́ а̀А̀ а̄А̄ а̏А̏ а̑А̑ / е́Е́ ѐЀ е̄Е̄ е̏Е̏ е̑Е̑ / и́И́ ѝЍ ӣӢ и̏И̏ и̑И̑ / о́О́ о̀О̀ о̄О̄ о̏О̏ о̑О̑ / у́У́ у̀У̀ ӯӮ у̏У̏ у̑У̑), and all came out as expected (as plain Latin characters a, e, i, o, or u).

All the plain upper and lowercase Serbian Latin and Cyrillic characters (Аа/Aa Бб/Bb Вв/Vv Гг/Gg Дд/Dd Ђђ/Đđ Ее/Ee Жж/Žž Зз/Zz Ии/Ii Јј/Jj Кк/Kk Лл/Ll Љљ/Ljlj Мм/Mm Нн/Nn Њњ/Njnj Оо/Oo Пп/Pp Рр/Rr Сс/Ss Тт/Tt Ћћ/Ćć Уу/Uu Фф/Ff Хх/Hh Цц/Cc Чч/Čč Џџ/Dždž Шш/Šš) also come out as expected, as their lowercase Latin versions—so no desirable Serbian diacritics are lost.

Accent Removal/Folding Effects[edit]

Comparing the icu_normalized output to the icu_folded output for the wiki corpora showed a lot of changes. About 10% of tokens and 43% of types in the Wiktionary corpus ended up being merged; in the Wikipedia corpus, only 0.05% of tokens and 2% of types were merged.

Categories of changes include:

  • Accented Cyrillic words that were only partially converted (and possibly not stemmed). For example, "атѐгӣрам" was previously converted to mixed-script and accented "atѐgӣra", (ѐ and ӣ are still Cyrillic) but now is converted to "ategir", which is both properly Latinized and stemmed.
  • Latin words with either combining diacritics and/or diacritics have the diacritics removed.
  • Arabic with diacritics or non-standard characters had diacritics dropped and standard forms replaced.
  • Devanagari that gets simplified—conjuncts broken up or diacritics removed.
  • A small number of miscellaneous simplifications, normalizations, or diacritic removal in Greek, Hebrew, Japanese, Kannada, Tamil, and a few others.

The vast majority of newly merged forms are accented versions of words, with a small number of the typical Unicode folding examples, as listed above.

Conclusion[edit]

Rounding out the Serbian analysis chain with folding and greatly improves the performance of the stemmer and allows for the unification of forms that a reader would not consider different (e.g., with pitch accent diacritics). The different tokenization of the command line stemmer and plugin has a small impact.

Based on the previous review of the original command line stemmer, this new analysis config is ready to be deployed to production. We will need to:

  • deploy the search/extra-analysis plugin (DONE see T183015)
  • deploy the new analyzer config to use the plugin (DONE see T183015)
  • re-index Serbian-language wikis (sr/603K Wikipedia articles) (see T189265)

A near-term follow-up will be to evaluate and then activate the same or similar config for the Serbo-Croatian (sh/442K Wikipedia articles), Croatian (hr/185K Wikipedia articles), and Bosnian (bs/77K Wikipedia articles) wikis.