User:TJones (WMF)/Notes/Slovak Analyzer Analysis

March/April 2018 — See TJones_(WMF)/Notes for other projects. See also T190815.

Intro
Stemmer: I previously looked at a family of related stemmers adapted to Slovak from a Czech stemmer, and evaluated two versions—"light" and "aggressive"—of one implementation. The aggressive stemmer was too aggressive, but the light stemmer seemed to do a good job, based on feedback from a Slovak speaker.

Data: For this analysis I used the same corpora I used for the stemmer analysis: 10,000 Slovak Wikipedia articles and 10,000 Slovak Wiktionary entries with extra markup and deduplicated individual lines. The Wiktionary corpus has 60,893 tokens and 26,372 pre-analysis types in it. The Wikipedia corpus has 1,611,909 tokens and 221,174 pre-analysis types in it. (Note that types refer to unique tokens (in English, the only counts once), and tokens refer to all instances of a word (so, the would count many, many times).)

Plugin: To build the Slovak stemmer into an Elasticsearch plugin, I adapted the code for the Czech Elasticsearch stemmer, which has a very efficient implementation for manipulating suffixes. I also added prefix stripping for the superlative prefix naj-, based on a review of Slovak morphology, the presence of similar processing for Polish, and speaker feedback. The new Slovak stemmer will live in the search/extra plugin.

Baseline—Stemmer vs Plugin: With an update to my analysis analysis tools, I was able to use the existing Slovak analysis chain (which is just the "default" analysis), using the standard tokenizer and the ICU normalizer to test the original stemmer (with naj- upgrade). After implementing the plugin (and correcting a couple small bugs in the stemming implementation) there were no differences between the command line stemmer results and the baseline Slovak stemmer plugin results, except for a small bug in my analysis analysis tool which dropped "0" as a stem (For the CS nerds: ! == true, but also !0 == true; oops.)

So, at this point we have a working Slovak Elasticsearch analysis chain that uses the same tokenization and normalization as the current default Slovak processing, with the addition of the Slovak stemmer.

Adding ICU Folding
I'm a fan of enabling ICU Folding when working in an analysis chain, with exceptions for the letters that are in the alphabet of the host language. Fun fact: the Slovak alphabet "has 46 letters which makes it the longest Slavic and European alphabet," so there are a lot of letters to be put in the exception list: Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž.

Enabling ICU folding causes new mergers in 3.27% of types (unique words) but only 0.187% of tokens (instances of words) in my 10K Wikipedia sample, but 9.442% of types and 1.744% of tokens in my 10K Wiktionary sample. That makes sense, because Wiktionary has a lot more non-Slovak words, particularly IPA pronunciations.

A quick review of changes on the Wikipedia and Wiktionary corpora:
 * Stripping vowel marks in Arabic and Hebrew.
 * Removal of non-Slovak diacritics in Cyrillic, Greek, Latin, which is especially nice for removing stress marks from Cyrillic and Greek.
 * Normalized forms in Devanagari, Arabic, Katakana, Hiragana, Telugu, Thaana, Indic numbers, and IPA.
 * There are a lot of affected IPA forms in the Wiktionary corpus, in particular. Because Slovak has a shallow orthography many of the IPA forms merge with a form of the word they provide the pronunciation for.
 * Normalization of curly quotes/apostrophes—there are a lot of these that create new, good mergers!

Overall, these look like either low-impact changes, or good changes.

Conclusion
It makes sense to enable the non-Slovak ICU folding for Slovak-language projects when we enable the stemmer.

Based on the previous review of the original command line stemmer, the new analysis config is ready to be deployed to production.

We will need to:


 * Deploy the updated search/extra plugin (see T190815)
 * Deploy the new analyzer config to use the plugin (see T190815)
 * Since this relies on an update to the search/extra plugin, the plugin should be deployed before the analyzer update is deployed; otherwise any incidental re-indexing in the interim could fail.
 * Re-index Slovak-language wikis (phab ticket TDB)