User:TJones (WMF)/Notes/Slovak Analyzer Analysis

March/April 2018 — See TJones_(WMF)/Notes for other projects. See also T190815.

Intro
Stemmer: I previously looked at a family of related stemmers adapted to Slovak from a Czech stemmer, and evaluated two versions—"light" and "aggressive"—of one implementation. The aggressive stemmer was too aggressive, but the light stemmer seemed to do a good job, based on feedback from a Slovak speaker.

Data: For this analysis I used the same corpora I used for the stemmer analysis: 10,000 Slovak Wikipedia articles and 10,000 Slovak Wiktionary entries with extra markup and deduplicated individual lines. The Wiktionary corpus has 60,893 tokens and 26,372 pre-analysis types in it. The Wikipedia corpus has 1,611,909 tokens and 221,174 pre-analysis types in it. (Note that types refer to unique tokens (in English, the only counts once), and tokens refer to all instances of a word (so, the would count many, many times).)

Plugin: To build the Slovak stemmer into an Elasticsearch plugin, I adapted the code for the Czech Elasticsearch stemmer, which has a very efficient implementation for manipulating suffixes. I also added prefix stripping for the superlative prefix naj-, based on a review of Slovak morphology, the presence of similar processing for Polish, and speaker feedback. The new Slovak stemmer will live in the search/extra plugin.

Baseline—Stemmer vs Plugin: With an update to my analysis analysis tools, I was able to use the existing Slovak analysis chain (which is just the "default" analysis), using the standard tokenizer and the ICU normalizer to test the original stemmer (with naj- upgrade). After implementing the plugin (and correcting a couple small bugs in the stemming implementation) there were no differences between the command line stemmer results and the baseline Slovak stemmer plugin results, except for a small bug in my analysis analysis tool which dropped "0" as a stem (For the CS nerds: ! == true, but also !0 == true; oops.)

So, at this point we have a working Slovak Elasticsearch analysis chain that uses the same tokenization and normalization as the current default Slovak processing, with the addition of the Slovak stemmer.

Adding ICU Folding
I'm a fan of enabling ICU Folding when working in an analysis chain, with exceptions for the letters that are in the alphabet of the host language. Fun fact: the Slovak alphabet "has 46 letters which makes it the longest Slavic and European alphabet," so there are a lot of letters to be put in the exception list: Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž.

Enabling ICU folding causes new mergers in 3.27% of types (unique words) but only 0.187% of tokens (instances of words) in my 10K Wikipedia sample, but 9.442% of types and 1.744% of tokens in my 10K Wiktionary sample. That makes sense, because Wiktionary has a lot more non-Slovak words, particularly IPA pronunciations.

A quick review of changes on the Wikipedia and Wiktionary corpora:
 * Stripping vowel marks in Arabic and Hebrew.
 * Removal of non-Slovak diacritics in Cyrillic, Greek, Latin, which is especially nice for removing stress marks from Cyrillic and Greek.
 * Normalized forms in Devanagari, Arabic, Katakana, Hiragana, Telugu, Thaana, Indic numbers, and IPA.
 * There are a lot of affected IPA forms in the Wiktionary corpus, in particular. Because Slovak has a shallow orthography many of the IPA forms merge with a form of the word they provide the pronunciation for.
 * Normalization of curly quotes/apostrophes—there are a lot of these that create new, good mergers!

Overall, these look like either low-impact changes, or good changes.

To Fold Before or After
For Serbian, it was clear that folding should occur before stemming to strip the pitch accent diacritics commonly used in Serbian as a guide to pronunciation. For Slovak, I copied that model, but realized it wasn't strictly required. The current English config folds before stemming, while the French config stems before folding. There's not always an obvious right way to do it.

I tested moving the folding to after stemming. The impact on normal text (i.e., Wikipedia articles) is minimal, with 0.034% of types and 0.003% of tokens participating in new mergers, and 0.74% of types and 0.021% of tokens participating in new splits.

The changes are good when a diacritic blocks stemming and happens to be right, such as the case of ångström not being stemmed to ångstr because the folded -om ending looks like a Slovak suffix. This allows ångström to match ångströmoch, a inflected form of ångström in Slovak.

On the other hand, names with and without diacritics that get stemmed even though they shouldn't end up not indexing together. I'm willing to bet that the songwriter Pierre Delanoe and the songwriter Pierre Delanoë are the same person, but stemming treats them differently without folding.

The effect on Wiktionary entries is bigger: with 0.268% of types and 0.044% of tokens participating in new mergers, and a whopping 5.923% of types but only 0.872% of tokens participating in new splits.

The vast majority of Wikitionary changes are IPA pronunciations.

The unifying cause of the weird conditions here is applying stemming to things that ought not be stemmed—names, foreign words, and IPA pronunciations. But those are unavoidable when stemming—at least without some very heavy processing to detect such things.

Another option would be to bring in the "preserve" option, in which both the folded and unfolded version of a token are indexed. Using "preserve" before stemming would allow ångström to match ångströmoch and Delanoe to match Delanoë. However it might be overkill.

For now, I suggest sticking with what we have and seeing how people react to it.

Conclusion
It makes sense to enable the non-Slovak ICU folding for Slovak-language projects when we enable the stemmer.

Based on the previous review of the original command line stemmer, the new analysis config is ready to be deployed to production.

We will need to:


 * Deploy the updated search/extra plugin (see T190815)
 * Deploy the new analyzer config to use the plugin (see T190815)
 * Since this relies on an update to the search/extra plugin, the plugin should be deployed before the analyzer update is deployed; otherwise any incidental re-indexing in the interim could fail.
 * Re-index Slovak-language wikis (phab ticket TDB)