User:TJones (WMF)/Notes/Folding Diacritics in Slovak

June 2019 — See TJones_(WMF)/Notes for other projects. See also T223787. For help with the technical jargon used in Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
In March 2018 I did an analysis of potential Slovak Stemmers and the use of the best stemmer in an analysis chain.

As part of my usual process for new analysis chains—after my experience with doing it exactly wrong for Swedish (see T155822)—I enabled ICU folding (i.e., fairly aggressive normalization of unicode characters, including diacritic removal), with exceptions for letters in the alphabet of the wiki's language (i.e., the Slovak alphabet)—in this case, Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž.

In a clever bit if foreshadowing, I looked briefly at the question of whether to enable folding before or after stemming, though at the time it didn't seem to matter much because the differences were very slight if you exempt the Slovak letters from the folding. I also cleverly reminded my future self that we have a "preserve" option, which allows us to index both the folded and unfolded version of a token.

At the 2019 Hackathon in Prague, Jetam2 and I talked about Slovak search, and he told me why it sucks... he expressed a concern that people don't always have access to a Slovak keyboard, so I said I'd look into the impact of removing the exceptions from ICU folding (and here we are). I looked into the Universal Language Selector and there is already a Slovak keyboard mapping for touch typists, and it would be possible to create a keyboard that could convert more widely available characters into diacritical characters. (For example, á as, ô as , č as  , ä as  .)

However, in the discussion on Phabricator (T155822) and on the Slovak Wikipedia Teahouse, Teslaton pointed out that Slovak search usually ignores diacritics and it usually doesn't cause any problems.

I was still worried (in the abstract) that Wikipedia and Wiktionary in particular have lots of text from other languages (or in IPA) which could cause weird results (though this should be mitigated in many cases by matching in the plain field). There's also the possible interaction of folding and stemming, which might be mitigated by changes to the stemmer, since we maintain the code for it.

Data
I pulled my usual collection of 10,000 Wikipedia articles and 10,000 Wiktionary entries to get a sample of standard formal language (from Wikipedia) and a larger sample of non-Slovak text (from Wiktionary). I deduplicated by line, primarily so headings (local versions of "References", "Noun", etc.) are not over-represented.

I also pulled a random collection of 50,000 user queries on Slovak Wikipedia over a couple of months and 9,266 (~9k) user queries on Slovak Wiktionary (that's pretty much everything that was available).

Analyzing the user queries will be a bit of a new thing, since I usually use the Wikipedia article text as a reference for the way people write in a language. Some of the info from there will probably be a bit less detailed compared to the usual analysis.

Query Data: Inspection
I started out by looking at the most common queries and most common words in queries on Slovak Wikipedia. Two of the top results were Zuzana Čaputová and Maroš Šefčovič, the two candidates in the recent Slovak presidential election. There were many variants on their names. Ignoring leading or trailing spaces, single-word queries include:

48 čaputová        29 šefčovič 33 caputova        29 Sefcovic 31 Caputova        27 Šefčovič 26 Čaputová        24 sefcovic 17 Čaputova         3 Šefčovic 11 čaputova         1 šefcovic 1 CaputovA         1 ŠEFČOVIČ 1 Sefčovič 1 Sefcovič

If we ignore case, the lowercased lists look like this:

74 čaputová        57 šefčovič 65 caputova        53 sefcovic 28 čaputova         3 šefčovic 1 šefcovic 1 sefčovič 1 sefcovič

Clearly, at least for these two presumably very well-known names, searching without diacritics is common.

I searched for other common words with diacritics and then searched for variants without diacritics, and in many cases that seem relatively unambiguous—such as čím, článok, Kočner, Košice, planéta, škola, štáty, voľby, Žilina, and živí—the diacriticless version is also common, often equally common, as above. (It seems that the length-marking diacritic ´ is the a bit more likely to be dropped—especially for the vowels áéíóúý, but also the consonants ĺŕ—but it seems to happen frequently with any diacritic.)

So, clearly Slovak searchers are expecting diacriticless searches to get results, contrary to the expectations of the Swedish searchers.

I have a future concern for Slovak Wiktionary. Right now it only has about 26K entries so there aren't as many other languages represented. However, on English Wiktionary, there are often diacriticless versions of Slovak words in other Slavic languages. (Google Translate also often suggests Czech (and a couple of times Slovenian and Swedish) for the diacriticless versions of words.

On the other hand, (a) English Wiktionary folds all diacritics, and it usually works okay, (b) if Slovak-speaking searchers are used to diacriticless search, they at least won't be caught off guard, and (c) quotes are always available, and they aren't as restrictive on Wiktionary as they are on Wikipedia (because (i) you are more likely to be looking for an exact form of a word, and (ii) all forms of a word are much more likely to be on the page for a given base form).

Option 1: Enabling Folding
The first thing I tried was disabling the ICU folding exception for the Slovak diacritical letters (Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž).

Interestingly, this lead to an increase in the number of post-analysis tokens in the Wikipedia corpus (i.e., the number of distinct words coming out of the analysis chain), from 131,091 to 137,538.

There were a lot of new collisions—words that would be indexed the same: 12,207 pre-analysis types (5.484% of pre-analysis types) / 168,977 tokens (10.305% of tokens) were added to 4,863 groups (3.710% of post-analysis types), affecting a total of 26,744 pre-analysis types (12.014% of pre-analysis types) in those groups.

Collisions are what we expect—words getting folded together. The impact is pretty high, though, 5% of distinct words and 10% of all words got folded together with something new.

There were also a lot of splits: 9,220 pre-analysis types (4.142% of pre-analysis types) / 41,111 tokens (2.507% of tokens) were lost from 4,475 groups (3.414% of post-analysis types), affecting a total of 29,339 pre-analysis types (13.180% of pre-analysis types) in those groups.

That's 4% of distinct words and 2.5% of all words would no longer be indexed together.

The main cause of the splits seems to be interference with the stemmer.

The Wiktionary corpus had a roughly similar number of collisions: 5% of distinct words and 4.7% of all words, and very differently balanced splits: <1% of distinct words, but still 5% of all words. The difference seems to come down to a much smaller sample size—the Wikipedia corpus has approximately 25x as many tokens in it—and many more distinct words in the corpus.

[Note: I've collapsed the fold-first examples since we didn't get any speaker review, and stemming first is probably the right way to go.]

Speaker Review: Folding Groups that Lost Members
The question for speakers of Slovak reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "lost" words no longer found the remaining words, and vice versa?

Random Sample
Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that lost members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.)

All of these examples seem to be caused by the presence of the following suffixes (with counts of how many times they are lost in the sample below—the total is more than 25 because some groups lost multiple members): 1  -ách 1  -aný 12 -ého 3  -ému 1  -í 7  -ú 11 -ých 5  -ým 6  -ými Other obvious suffixes (-á, -é, -ý) don't have problems because unaccented versions (-a, -e, -y) are also suffixes in Slovak (though they have different meanings). -é and -e can cause differences in the way the rest of the word root is normalized, though there's no evidence of that here.

Key:


 * gazdovsk << 1
 * gazdovsk indicates that all of these words were stemmed to gazdovsk. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * << 1 indicates that from "old" to "new", this stemming groups lost 1 member.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with Slovak letters folded before stemming
 * [2 gazdovský] — gazdovský occurs 2 times in our corpus (of 10K articles)

Lost members are bolded. gazdovsk << 1 o: [1 Gazdovská][1 Gazdovské][1 gazdovské][2 gazdovský][1 gazdovskými] n: [1 Gazdovská][1 Gazdovské][1 gazdovské][2 gazdovský] odlet << 1 o: [1 odletom][1 odletovou][1 odletových] n: [1 odletom][1 odletovou] implementovan << 1 o: [1 implementovaná][3 implementované][1 implementovanú][6 implementovaný] n: [1 implementovaná][3 implementované][6 implementovaný] vyzdvihovan << 1 o: [1 vyzdvihované][1 vyzdvihovaný][1 vyzdvihovaných] n: [1 vyzdvihované][1 vyzdvihovaný] novonaroden << 1 o: [1 Novonarodená][1 novonarodených] n: [1 Novonarodená] rytmick << 2 o: [1 Rytmická][3 rytmickej][1 rytmickou][2 rytmicky][8 rytmická][2 rytmické] [1 rytmickému][3 rytmickú] n: [1 Rytmická][3 rytmickej][1 rytmickou][2 rytmicky][8 rytmická][2 rytmické] angarsk << 1 o: [1 Angarského][2 Angarský] n: [2 Angarský] komunikuj << 1 o: [6 komunikuje][5 komunikujú] n: [6 komunikuje] balneologick << 1 o: [6 Balneologické][2 Balneologického][1 balneologické] n: [6 Balneologické][1 balneologické] vyjadren << 1 o: [1 vyjadrenou][14 vyjadrená][4 vyjadrené][1 vyjadreného][5 vyjadrení] [2 vyjadrený] n: [1 vyjadrenou][14 vyjadrená][4 vyjadrené][5 vyjadrení][2 vyjadrený] domorod << 4 o: [3 domorodé][2 domorodého][1 domorodí][6 domorodých][1 domorodým] [2 domorodými] n: [3 domorodé][1 domorodí] divadl << 1 o: [14 Divadla][12 Divadle][48 Divadlo][1 Divadlom][4 Divadlá][128 divadla] [1 divadlami][51 divadle][127 divadlo][10 divadlom][15 divadlá][10 divadlách] n: [14 Divadla][12 Divadle][48 Divadlo][1 Divadlom][4 Divadlá][128 divadla] [1 divadlami][51 divadle][127 divadlo][10 divadlom][15 divadlá] hamersk << 1 o: [2 Hamerského][1 Hamerský] n: [1 Hamerský] karpatsk << 6 o: [1 KARPATSKÁ][6 Karpatskej][1 Karpatsko][2 Karpatská][10 Karpatské] [2 Karpatského][1 Karpatskí][1 Karpatskú][3 Karpatský][1 Karpatskými] [2 karpatskej][2 karpatsko][1 karpatské][9 karpatského][2 karpatskému] [1 karpatskí][3 karpatský][5 karpatských] n: [1 KARPATSKÁ][6 Karpatskej][1 Karpatsko][2 Karpatská][10 Karpatské] [1 Karpatskí][3 Karpatský][2 karpatskej][2 karpatsko][1 karpatské] [1 karpatskí][3 karpatský] samotn << 9 o: [11 Samotná][16 Samotné][1 Samotného][1 Samotnému][1 Samotní][1 Samotnú] [17 Samotný][35 samotnej][14 samotnom][6 samotnou][28 samotná][33 samotné] [36 samotného][2 samotnému][6 samotní][15 samotnú][36 samotný][12 samotných] [14 samotným][4 samotnými] n: [11 Samotná][16 Samotné][1 Samotní][17 Samotný][35 samotnej][14 samotnom] [6 samotnou][28 samotná][33 samotné][6 samotní][36 samotný] krewsk << 1 o: [1 Krewská][1 krewská][1 krewskú] n: [1 Krewská][1 krewská] slienit << 2 o: [1 slienité][1 slienitého][1 slienitých] n: [1 slienité] madridsk << 1 o: [1 Madridským][1 madridskom] n: [1 madridskom] rastr << 2 o: [1 Rastrová][1 rastra][2 rastri][1 rastrovej][1 rastrového][1 rastrový] [1 rastrových] n: [1 Rastrová][1 rastra][2 rastri][1 rastrovej][1 rastrový] ontogenetick << 1 o: [1 ontogenetického][1 ontogenetický] n: [1 ontogenetický] pondelk << 1 o: [3 pondelka][1 pondelkového] n: [3 pondelka] nitovan << 1 o: [1 nitované][1 nitovaných] n: [1 nitované] zostupuj << 1 o: [1 Zostupuje][3 zostupuje][1 zostupujú] n: [1 Zostupuje][3 zostupuje] umiestnen << 5 o: [1 Umiestnený][6 umiestnenej][6 umiestnenou][62 umiestnená][92 umiestnené] [4 umiestneného][11 umiestnení][4 umiestnenú][50 umiestnený][13 umiestnených] [4 umiestneným][4 umiestnenými] n: [1 Umiestnený][6 umiestnenej][6 umiestnenou][62 umiestnená][92 umiestnené] [11 umiestnení][50 umiestnený] rovnocenn << 3 o: [1 rovnocennej][6 rovnocenné][1 rovnocenní][4 rovnocenných][2 rovnocenným] [1 rovnocennými] n: [1 rovnocennej][6 rovnocenné][1 rovnocenní]

High-Impact Groups
There are thirteen stemming groups that lost 10 or more members. Many of the same suffixes as above show up, so I've excluded the groups that only appear in the list because they are more common words that have more variants of the same list as above.

There are three new phenomena here:


 * novším and najbohatším—the lack of accent on the í in ím blocks the suffix stripping, similar to the suffixes above. I guess these are just rarer.
 * bohatá and bohatý—the lack of accent on the final vowel allows these to be interpreted as -ata and -aty suffxes, which are removed, giving a stem of boh.
 * angličtina and turečtine—after stripping certain suffixes, čt is converted to ck, but ct is not.

I've bolded the "interesting" examples that are lost below. anglick << 11 o: [48 Anglicka][3 Anglickej][47 Anglicko][10 Anglickom][1 Anglickou] [8 Anglická][2 Anglické][1 Anglickí][3 Anglický][2 Angličtina] [359 anglickej][6 anglicko][19 anglickom][5 anglickou][1073 anglicky] [30 anglická][16 anglické][75 anglického][2 anglickému][4 anglickú] [190 anglický][16 anglických][9 anglickým][1 anglickými][8 angličtina] [45 angličtine][1 angličtinou][20 angličtiny] n: [48 Anglicka][3 Anglickej][47 Anglicko][10 Anglickom][1 Anglickou] [8 Anglická][2 Anglické][1 Anglickí][3 Anglický][359 anglickej] [6 anglicko][19 anglickom][5 anglickou][1073 anglicky][30 anglická] [16 anglické][190 anglický] bohat << 10 o: [2 Bohaté][1 Bohatému][1 Bohatý][12 bohatej][8 bohato][2 bohatom] [13 bohatou][20 bohatá][22 bohaté][7 bohatého][5 bohatí][15 bohatú] [21 bohatý][12 bohatých][10 bohatým][3 bohatými][4 najbohatším] n: [2 Bohaté][12 bohatej][8 bohato][2 bohatom][13 bohatou][22 bohaté] [5 bohatí] nov << 14 o: [1 NOV][1 NOVA][1 NOVÁ][8 Nov][14 Nova][1 Nove][66 Novej][9 Novi] [1 Novo][54 Novom][1 Novou][153 Nová][181 Nové][1 NovéHO][47 Nového] [2 Novému][3 Noví][10 Novú][103 Nový][14 Nových][15 Novým][1 Novými] [1 nov][2 nova][3 nove][119 novej][34 novo][34 novom][27 novou] [2 novus][1 novy][98 nová][282 nové][152 nového][15 novému] [12 noví][84 novú][209 nový][156 nových][50 novým][30 novými] [1 novším] n: [1 NOV][1 NOVA][1 NOVÁ][8 Nov][14 Nova][1 Nove][66 Novej][9 Novi] [1 Novo][54 Novom][1 Novou][153 Nová][181 Nové][3 Noví][103 Nový] [1 nov][2 nova][3 nove][119 novej][34 novo][34 novom][27 novou] [2 novus][1 novy][98 nová][282 nové][12 noví][209 nový] tureck << 10 o: [26 Turecka][1 Tureckej][31 Turecko][6 Tureckom][5 Turecká][2 Turecké] [1 Tureckí][1 Tureckú][3 Turecký][1 Tureckých][34 tureckej][1 turecki] [1 turecko][1 tureckom][3 tureckou][9 turecky][8 turecká][4 turecké] [9 tureckého][4 tureckému][2 tureckú][16 turecký][18 tureckých] [9 tureckým][3 tureckými][7 turečtine][1 turečtiny] n: [26 Turecka][1 Tureckej][31 Turecko][6 Tureckom][5 Turecká][2 Turecké] [1 Tureckí][3 Turecký][34 tureckej][1 turecki][1 turecko][1 tureckom] [3 tureckou][9 turecky][8 turecká][4 turecké][16 turecký]

High-Frequency Words
I also looked for high-frequency words that were lost from a group, but there weren't any in the Wikipedia corpus. The Wiktionary corpus had one example, podstatného, which means "of the noun", and so occurs very frequently in Wiktionary. The other lost words in that group have the now-familiar suffixes. podstatn << 4 o: [8 podstatné][1909 podstatného][1 podstatných][1 podstatným][1 podstatnými] n: [8 podstatné]

Speaker Review: Folding Groups that Gained Members
The question for speakers of Slovak reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Random Sample
Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.)

Key:


 * amali >> 4
 * amali indicates that all of these words were stemmed to amali. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >> 4 indicates that from "old" to "new", this stemming groups gained 4 members.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with Slovak letters folded before stemming
 * [12 Amália] — Amália occurs 12 times in our corpus (of 10K articles)

Note that which group is shown as "gaining" new members is always in favor of the stem with no accents. In the case of Putna "gaining" Pútny, pútne, and pútny, you could argue just as well that Pútny, pútne, and pútny, added Putna to their group. What actually happened is that the stem putn and the stem pútn merged. Similarly with the new additions to the budapest group.

A lot of the changes here are the kinds we'd expect to see, with accented versions of words (especially names) being merged. Some notes.


 * For longer words that aren't names, it's very likely that the words are related. For example, it's hard to imagine that pospolitosti and pospolitosť are not related, though whether searching for one should find the other is a different question.
 * Some of the groups, as always, are weird. The name Snipes loses its p when stemmed because the stemmer lops off a letter before an -es ending. Since sní matches that stem after folding, they are grouped together. This isn't great, but it is expected.
 * Similarly, niób after folding just happens to match the stemmed form of Nioba.

amali >> 4 o: [3 Amalia][1 Amalie] n: [3 Amalia][1 Amalie][12 Amália][4 Amálie][1 Amáliina][2 Amáliou] pospolitost >> 1 o: [1 pospolitosti] n: [1 pospolitosti][2 pospolitosť] polen >> 1 o: [2 Polen][1 polene] n: [2 Polen][2 Poleň][1 polene] sni >> 1 o: [1 Snipes] n: [1 Snipes][1 sní] niob >> 1 o: [1 Nioba][1 Niobe][1 Nioby] n: [1 Nioba][1 Niobe][1 Nioby][1 niób] dal >> 5 o: [4 Dal][2 Dala][8 Dale][2 Dalo][1 Dalou][1 Dalího][158 dal][47 dala] [57 dali][23 dalo] n: [4 Dal][2 Dala][8 Dale][2 Dalo][1 Dalou][1 Dalího][158 dal][47 dala] [57 dali][23 dalo][12 dál][1 najďalej][2 Ďale][103 Ďalej][340 ďalej] ruben >> 1 o: [1 Ruben] n: [1 Ruben][2 Rubén] ilov >> 3 o: [1 Ilové] n: [1 Ilové][1 ílov][1 ílovou][2 ílové] ultim >> 2 o: [1 Ultima][1 ultimo] n: [1 Ultima][1 ultimo][2 ultimáta][1 Última] taih >> 1 o: [2 Taiho] n: [2 Taiho][2 Taihó] ods >> 1 o: [6 ODS][6 ods] n: [6 ODS][6 ods][1 odsať] spas >> 4 o: [4 SPAS][2 Spas] n: [4 SPAS][2 Spas][14 spása][1 spásať][4 spáse][2 spásy] hellad >> 1 o: [2 Hellados] n: [2 Hellados][1 Helládos] evoqu >> 1 o: [1 Evoque] n: [1 Evoque][1 évoque] bedarieux >> 1 o: [1 Bedarieux] n: [1 Bedarieux][2 Bédarieux] parizek >> 1 o: [1 Parizek] n: [1 Parizek][1 Pařízek] hojnost >> 1 o: [2 hojnosti] n: [1 Hojnosť][2 hojnosti] giap >> 1 o: [1 GIAP] n: [1 GIAP][1 Giáp] bazin >> 1 o: [2 Bazin] n: [2 Bazin][1 bažin] styri >> 2 o: [1 Styria] n: [1 Styria][6 Štyria][13 štyria] baton >> 1 o: [2 Baton] n: [2 Baton][4 Batón] putn >> 3 o: [1 Putna] n: [1 Putna][1 Pútny][1 pútne][1 pútny] budapest >> 4 o: [7 Budapest] n: [7 Budapest][1 Budapesť][78 Budapešti][35 Budapešť][1 Budapešťi] zp >> 1 o: [1 ZP] n: [1 ZP][3 ŽP] partizan >> 8 o: [1 Partizanom] n: [1 Partizanom][2 Partizán][3 Partizáni][2 partizán][1 partizána] [5 partizáni][1 partizánmi][11 partizánom][3 partizánov]

High-Impact Groups
There were 140 groups with 10 or more additions, so I raised the threshold to 15 or more additions, which gave 29 groups. I've removed the groups that are like the putn or budapest groups above, where a large group with diacritics merged with a one or two distinct words (after ignoring upper- and lowercase) that have a stem without diacritics.

The remaining 16 groups are shown below. These represent large groups with diacritics merging with medium to large groups without diacritics. The converse—large groups without diacritics merging with smaller groups with diacritics—is not represented. I can go looking for examples if anyone thinks they would be significantly different from the ones here.

The stal and pol groups look to be made up of the largest distinct groups that merged. byval >> 16 o: [1 ByVal][1 byvalá][1 byvalé][1 byvalý] n: [1 ByVal][1 Býval][4 Bývalá][2 Bývalé][4 Bývalí][9 Bývalý][1 byvalá] [1 byvalé][1 byvalý][27 býval][9 bývala][46 bývalej][17 bývali] [5 bývalo][20 bývalom][7 bývalou][60 bývalá][13 bývalé][12 bývalí] [189 bývalý] desperad >> 1 o: [1 Desperado] n: [1 Desperado][1 desperádmi] lud >> 19 o: [1 Lud][1 Lude][1 Ludiès][1 Ludo][1 Ludus] n: [1 Lud][1 Lude][1 Ludiès][1 Ludo][1 Ludus][1 luďom][1 ĽUDÍ][1 Ľud] [1 Ľuda][1 Ľudo][1 Ľudoví][1 Ľudí][1 Ľuďom][46 ľud][1 ľude][6 ľudi] [1 ľudmi][7 ľudom][3 ľudy][421 ľudí][42 ľuďmi][13 ľuďoch][25 ľuďom] [1 ľuďí] narodn >> 18 o: [1 Narodna][1 Narodni][1 narodne] n: [1 Narodna][1 Narodni][2 NÁRODNÁ][92 Národnej][22 Národnom][13 Národnou] [44 Národná][37 Národné][16 Národní][1 Národního][43 Národný][1 narodne] [10 národne][103 národnej][6 národno][37 národnom][16 národnou] [96 národná][171 národné][9 národní][103 národný] nas >> 17 o: [77 NASA][1 NaS][1 Nas][1 Naso][4 nas][1 nasi][1 naso] n: [77 NASA][1 NaS][1 Nas][1 Naso][11 Naša][8 Naše][1 Našej][5 Naši][1 Našou] [9 Náš][4 nas][2 nasatý][1 nasi][1 naso][10 naša][19 naše][30 našej] [4 naši][37 našich][15 našom][2 našou][1 naší][84 nás][19 náš] plan >> 19 o: [12 Plan][3 Planina][1 plan][2 plane][1 planej][14 planina][6 planine] [3 planinou][15 planiny][1 plané] n: [12 Plan][3 Planina][11 Plán][2 Plánom][3 Plány][1 Pláň][1 plan][2 plane] [1 planej][14 planina][6 planine][3 planinou][15 planiny][1 plané] [12 planéte][1 planín][49 plán][16 pláne][3 pláni][4 plánmi][5 plánoch] [7 plánom][20 plánov][3 plánovať][37 plány][1 plání][3 pláň][2 pláňami] [1 pláňou] pol >> 16 o: [1 POLE][5 Pol][6 Pola][24 Pole][2 Poli][5 Polo][1 Polom][1 Polus] [69 pol][81 pole][56 poli][5 polo][2 polom][8 polos][2 poly][19 polí] [1 polích] n: [1 POLE][5 Pol][6 Pola][24 Pole][2 Poli][5 Polo][1 Polom][1 Polus] [1 Poľa][1 Póly][69 pol][81 pole][1 poletí][56 poli][5 polo][2 polom] [8 polos][2 poly][19 polí][1 polích][1 polôch][39 poľ][57 poľa] [4 poľami][13 poľom][13 pól][5 póla][16 póle][2 pólmi][12 pólo] [2 póloch][5 pólom][1 póly] polsk >> 17 o: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][1 polski][1 polsko] n: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][93 Poľska] [20 Poľskej][138 Poľsko][20 Poľskom][15 Poľská][11 Poľské][2 Poľskí] [10 Poľský][1 polski][1 polsko][49 poľskej][22 poľsko][12 poľskom] [6 poľskou][37 poľsky][32 poľská][22 poľské][7 poľskí][82 poľský] post >> 18 o: [41 Post][102 post][25 poste][4 postoch][1 postom][1 postov][2 posty] n: [41 Post][2 Pošta][1 Poštovou][2 Poštová][2 Poštové][1 Poštový][102 post] [25 poste][4 postoch][1 postom][1 postov][2 posty][12 pošta][2 pošte] [5 poštou][2 poštovej][1 poštovou][4 poštová][1 poštové][1 poštoví] [4 poštový][15 pošty][1 pôst][1 pôsty][3 pôšt] povodn >> 16 o: [9 povodne][3 povodni][2 povodní] n: [127 Pôvodne][1 Pôvodnou][25 Pôvodná][17 Pôvodné][1 Pôvodní][29 Pôvodný] [9 povodne][3 povodni][2 povodní][5 povodňami][3 povodňou][307 pôvodne] [77 pôvodnej][17 pôvodnom][8 pôvodnou][19 pôvodná][61 pôvodné][6 pôvodní] [55 pôvodný] premier >> 15 o: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][3 première] n: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][1 Premiér][22 Premiéra] [2 Premiérom][1 Premiérový][3 première][33 premiér][26 premiéra] [14 premiére][10 premiérom][2 premiérou][1 premiérov][4 premiérovo] [1 premiérovom][3 premiérový][6 premiéry][1 pre­miér] seri >> 17 o: [23 Serie][1 Serio][1 seria] n: [23 Serie][1 Serio][9 Séria][11 Série][4 Sérii][4 Sériová][3 Sériové] [1 seria][65 séria][1 sériami][102 série][52 sérii][9 sériou][10 sériovej] [11 sériovo][2 sériovou][8 sériová][5 sériové][5 sériový][11 sérií] stal >> 19 o: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi] [16 Stalo][862 stal][338 stala][1 stale][150 stali][173 stalo] n: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi] [16 Stalo][3 Stál][4 Stála][16 Stále][3 Stálej][2 Stáli][1 Stálo][1 Stálou] [2 Stály][862 stal][338 stala][1 stale][2 staletí][150 stali][173 stalo] [67 stál][41 stála][249 stále][7 stálej][28 stáli][19 stálo][2 stálom] [3 stálou][13 stály][1 Štál] stat >> 20 o: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí] n: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí] [71 stať][4 stát][12 stáť][2 sťatá][2 sťatí][1 sťatý][1 sťať][14 Štát] [1 Štátoch][3 Štátov][3 Štáty][1 štatom][1 štatov][186 štát][124 štáte] [37 štátmi][116 štátoch][66 štátom][238 štátov][91 štáty] studi >> 21 o: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][2 studie][7 studio] [1 studií] n: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][1 Stúdió][2 studie] [7 studio][1 studií][12 Štúdia][3 Štúdie][6 Štúdio][2 Štúdiom][3 Štúdiové] [1 Štúdioví][2 Štúdiá][140 štúdia][4 štúdiami][57 štúdie][12 štúdii] [18 štúdio][17 štúdiom][3 štúdiou][3 štúdiovom][1 štúdiová][5 štúdiové] [67 štúdiový][16 štúdiá][50 štúdií] system >> 17 o: [1 SYSTEM][52 System][17 system][1 systema][1 systeme] n: [1 SYSTEM][52 System][65 Systém][1 Systémová][1 Systémové][10 Systémy] [17 system][1 systema][1 systeme][415 systém][16 systémami][67 systéme] [37 systémoch][70 systémom][86 systémov][3 systémovej][1 systémovom] [2 systémová][9 systémové][4 systémový][102 systémy][1 sýstéma]

High-Frequency Words
I also looked for high-frequency words that were added to a group.

I dropped groups that are easily interpreted as a small number of words without diacritics being added to a larger group of words with diacritics, one of which is high-frequency. For example, 1 instance of ktoru would be indexed with 1063 instances of ktorú, which isn't actually very interesting (and may just be a typo). (Though, see the mixed groups below for more on ktorú.)

I kept groups where the group being added to had at least 3 different words in it, or at least one of the words had 10 or more instances. The remaining 8 groups with high-frequency words are below.

The most interesting collisions (ignoring case) seem to be:


 * co and čo
 * kedy and keď
 * su, sú, šu, and šú

az >> 2 o: [84 AZ][1 Az][6 az] n: [84 AZ][1 Az][101 Až][6 az][2357 až] cast >> 23 o: [2 Cast][2 Castles][6 Castres][1 caste][1 casti] n: [2 Cast][2 Castles][6 Castres][1 caste][1 casti][1 Časti][67 Často] [1 Častou][7 Častá][5 Časté][5 Častý][37 Časť][1 častej][1171 časti] [1 častich][541 často][1 častom][3 častou][7 častá][39 časté] [240 častí][6 častý][1 čas­to][1012 časť][18 časťami][122 časťou] [5 část][1 části] co >> 3 o: [3 CO][28 Co][7 Comes][7 co][2 comes] n: [3 CO][28 Co][7 Comes][26 Côtes][7 co][2 comes][50 Čo][1418 čo] ked >> 2 o: [1 Kedy][1 ked][461 kedy] n: [1 Kedy][297 Keď][1 ked][461 kedy][1084 keď] podl >> 3 o: [4 Podla][1 Podle][9 podla][3 podle][1 podlete] n: [1 PODĽA][4 Podla][1 Podle][531 Podľa][9 podla][3 podle][1 podlete] [1254 podľa] su >> 5 o: [6 SU][141 Su][2 Sü][9 su][3 sü] n: [6 SU][141 Su][146 Sú][2 Sü][9 su][3855 sú][3 sü][19 ŠÚ][1 šu][3 šú] wikipedi >> 5 o: [2 Wikipedia][1 Wikipedie][2 wikipedia] n: [2 Wikipedia][1 Wikipedie][6 Wikipédia][10 Wikipédie][1477 Wikipédii] [1 Wikipédiou][2 wikipedia][1 wikipédie] ze >> 2 o: [4 Ze][14 ze] n: [4 Ze][14 ze][4 Že][3268 že]

Speaker Review: Folding Groups that Lost and Gained (Mixed) Members
The question for speakers of Slovak reviewing these sections (Random Sample and High-Frequency Words) is this: would it be bad if searching for the the new groups of words found each other, in stead of the old groups? (That's a bit clunky, but after looking separately at groups that lost and gained members, the idea should be clear enough.)

Random Sample
I don't see a lot of differences here, other than we happen to have both gains and losses applying at once. However, I'm including them in case there is something non-obvious.

Below is a sample of 10 randomly selected stemming groups (words that would all be indexed together) that both lost and gained members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.)

Key:


 * otcov >< 3
 * otcov indicates that all of these words were stemmed to otcov. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
 * >< 4 indicates that from "old" to "new", this stemming groups lost some members and gained some members and the total lost or gained is 3.
 * o: — the "old" group, in this case, the current behavior
 * n: — the "new" group, in this case, with Slovak letters folded before stemming
 * [9 otcov] — otcov occurs 9 times in our corpus (of 10K articles)

Lost and gained members are bolded. otcov >< 3 o: [6 Otcov][3 Otcovy][9 otcov][13 otcovej][1 otcových][1 otcovým] n: [6 Otcov][3 Otcovy][1 Otcové][9 otcov][13 otcovej] braln >< 2 o: [2 Bralná][1 bralných] n: [2 Bralná][2 bralnatý] odtrhnut >< 2 o: [1 odtrhnutí][1 odtrhnutých] n: [1 odtrhnutí][1 odtrhnúť] pohyb >< 4 o: [5 Pohyb][2 Pohybová][3 Pohyby][106 pohyb][24 pohybe][4 pohybmi] [20 pohybom][11 pohybov][4 pohybovej][1 pohybovo][3 pohybová] [4 pohybové][2 pohybového][4 pohybovú][1 pohybový][2 pohybových] [19 pohyby] n: [5 Pohyb][2 Pohybová][3 Pohyby][106 pohyb][24 pohybe][4 pohybmi] [20 pohybom][11 pohybov][17 pohybovať][4 pohybovej][1 pohybovo] [3 pohybová][4 pohybové][1 pohybový][19 pohyby] stop >< 4 o: [12 Stop][1 Stopové][3 Stopy][5 stop][9 stopa][2 stopami][7 stope] [2 stopom][2 stopou][2 stopový][3 stopových][44 stopy][10 stopách] [1 stopám] n: [12 Stop][1 Stopové][3 Stopy][5 stop][9 stopa][2 stopami][7 stope] [2 stopom][2 stopou][2 stopový][44 stopy][26 stôp] tatr >< 3 o: [41 Tatra][2 Tatrami][2 Tatre][2 Tatro][2 Tatrou][60 Tatry] [47 Tatrách] n: [41 Tatra][2 Tatrami][2 Tatre][2 Tatro][2 Tatrou][60 Tatry] [1 Tatrín][1 Tátra] lisk >< 11 o: [8 Liskovej][5 Lisková][1 Lištinou] n: [8 Liskovej][5 Lisková][1 Liška][1 Lišková][2 Líška][1 Líšková] [2 Líščí][1 liška][1 liščí][4 líška][2 líšky][2 líščí] konkol >< 2 o: [2 Konkol][1 Konkolových][5 Konkoly] n: [2 Konkol][5 Konkoly][2 Konkoľ] stol >< 12 o: [1 STOL][3 Stolová][2 Stolového][3 Stolový][1 Stoly][1 stol][5 stola] [8 stole][2 stoloch][2 stolom][4 stolová][1 stolové][1 stolových] [1 stoly][1 stolé] n: [1 STOL][3 Stolová][3 Stolový][1 Stoly][1 Stół][1 Stôl][1 stol][5 stola] [8 stole][13 století][2 stoloch][2 stolom][4 stolová][1 stolové] [1 stoly][1 stolé][1 stół][18 stôl][1 Štola][1 Štóla][1 štola][5 štóla] [1 štôl] reform >< 3 o: [2 Reform][2 Reforma][12 reforma][3 reformami][8 reforme][4 reformou] [29 reformy][5 reformách][3 reformám] n: [2 Reform][2 Reforma][12 reforma][3 reformami][8 reforme][4 reformou] [3 reformovať][29 reformy]

High-Impact Groups
There were 68 groups with 10 or more changes, so I raised the threshold to 18 or more additions, which gave only 7 groups, which are shown below. Changes are bolded. elektrick >< 18 o: [1 Elektrickej][10 Elektrická][3 Elektrické][2 Elektrickú][19 Elektrický] [45 elektrickej][3 elektrickom][3 elektrickou][9 elektricky] [31 elektrická][51 elektrické][37 elektrického][5 elektrickému] [17 elektrickú][36 elektrický][19 elektrických][11 elektrickým] [1 elektrickými] n: [1 Elektrickej][10 Elektrická][3 Elektrické][19 Elektrický][3 Električka] [3 Električková][1 Električky][45 elektrickej][3 elektrickom] [3 elektrickou][9 elektricky][31 elektrická][51 elektrické] [36 elektrický][8 električka][2 električkami][2 električkou] [6 električkovej][12 električková][3 električkové][5 električkový] [21 električky] horn >< 19 o: [6 Horn][17 Hornej][3 Hornina][10 Horniny][13 Hornom][130 Horná] [48 Horné][21 Horného][23 Horní][1 Horních][1 Horního][2 Hornú] [22 Horný][6 Horných][3 Horným][57 hornej][19 hornina][18 horninami] [6 hornine][3 horninou][4 horninové][91 horniny][26 horninách] [2 horninám][48 hornom][4 hornou][15 horná][7 horné][15 horného] [1 hornému][6 hornú][5 horný][17 horných][7 horným][1 hornými] n: [6 Horn][2 Hornatý][17 Hornej][3 Hornina][10 Horniny][13 Hornom] [130 Horná][48 Horné][23 Horní][1 Horních][1 Horního][22 Horný] [1 Hôrny][3 hornatá][2 hornatý][57 hornej][19 hornina][18 horninami] [6 hornine][3 horninou][91 horniny][48 hornom][4 hornou][15 horná] [7 horné][97 hornín][5 horný][7 hôrny] mlad >< 20 o: [3 Mladej][30 Mladá][19 Mladé][1 Mladého][2 Mladí][2 Mladú][12 Mladý] [1 Mladých][1 Mladým][2 mlada][1 mlade][24 mladej][1 mlado][7 mladom] [3 mladou][14 mladá][26 mladé][20 mladého][5 mladému][16 mladí] [7 mladú][45 mladý][63 mladých][6 mladým][6 mladými][9 mladším] [11 najmladším] n: [3 Mladej][30 Mladá][19 Mladé][2 Mladí][12 Mladý][1 Mláďa][19 Mláďatá] [2 mlada][1 mlade][24 mladej][1 mlado][7 mladom][3 mladou][14 mladá] [26 mladé][16 mladí][45 mladý][3 mládí][8 mláďa][3 mláďat][23 mláďatá] [1 mláďať][1 mláďaťa] pas >< 19 o: [1 PASO][119 Pas][1 Paso][1 Passes][1 Pasú][5 pas][1 pasy][1 pasú] n: [1 PASO][119 Pas][1 Paso][1 Passes][5 Paša][2 Pás][1 Páse][2 Pásy] [5 pas][1 pasy][1 pasátoch][1 paša][1 paše][1 paši][41 pás][1 pása] [1 pásami][18 páse][8 pásmi][3 pásoch][11 pásom][5 pásy][1 páší] platn >< 18 o: [1 Platná][1 Platné][1 Platným][28 platne][2 platnej][4 platni] [5 platnom][1 platnou][5 platná][11 platné][2 platného][5 platní] [1 platnú][2 platný][11 platných][3 platným][3 platnými] n: [1 Platná][1 Platné][3 Platňa][1 Platňová][28 platne][2 platnej] [4 platni][5 platnom][1 platnou][5 platná][11 platné][5 platní] [2 platný][25 platňa][1 platňami][10 platňou][1 platňovej][1 platňové] [7 plátna][6 plátne][6 plátno][3 plátnom][1 plátnová] svat >< 20 o: [1 Svatom][5 Svatá][4 Svaté][10 Svatého][12 Svatý][2 svaté][2 svatého] [1 svatý][1 svatých] n: [1 Svatom][1 Svatoš][5 Svatá][4 Svaté][12 Svatý][1 Sváti][39 Svätej] [1 Sväto][10 Svätom][1 Svätou][43 Svätá][6 Sväté][3 Svätí][94 Svätý] [2 svaté][1 svatý][51 svätej][2 svätom][5 svätou][8 svätá][4 sväté] [2 svätí][21 svätý] velk >< 21 o: [1 Velkom][1 Velkou][13 Velká][4 Velké][2 Velkého][15 Velký][1 Velkým] [2 velkou][2 velké][1 velkého] n: [1 Velkom][1 Velkou][13 Velká][4 Velké][15 Velký][1 Veľk][1 Veľka] [151 Veľkej][18 Veľkom][16 Veľkou][163 Veľká][83 Veľké][1 Veľkí] [180 Veľký][2 velkou][2 velké][136 veľkej][1 veľko][38 veľkom] [1 veľkos][57 veľkou][96 veľká][306 veľké][2 veľkí][223 veľký]

High-Frequency Words
Again, I looked for groups with mixed losses and gains that involve high-frequency words. The five examples are below. byt >< 7 o: [1 Bytom][1 Byty][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom][5 bytové] [12 byty] n: [1 Bytom][1 Byty][4 Byť][1 Být][15 byt][9 byte][7 bytmi][2 bytoch] [4 bytom][12 byty][1931 byť][4 být][1 býti][2 býť] ktor >< 7 o: [2 Ktorá][1 Ktoré][1 Ktorý][746 ktorej][524 ktorom][117 ktorou] [3655 ktorá][3353 ktoré][580 ktorého][80 ktorému][672 ktorí] [1063 ktorú][3250 ktorý][692 ktorých][255 ktorým][104 ktorými] n: [2 Ktorá][1 Ktoré][1 Ktorý][746 ktorej][524 ktorom][117 ktorou] [3655 ktorá][3353 ktoré][672 ktorí][1 ktoróm][3250 ktorý] ma >< 8 o: [5 MA][16 Ma][2 Makes][1 Manes][1 Mates][1 mA][61 ma][1 makes] [1 malém][1 mares] n: [5 MA][16 Ma][4 Maheš][1 Mahéš][2 Makes][2 Mamés][1 Manes][1 Mareš] [1 Mates][476 Má][2 Mánes][1 mA][61 ma][1 makes][1 mares][3428 má] neskor >< 8 o: [1 Neskoro][1 Neskorším][13 neskorej][13 neskoro][3 neskorom] [1 neskorou][1 neskory][2 neskoré][10 neskorého][7 neskorý] [6 neskorých][9 neskorším] n: [1 Najneskôr][1 Neskoro][270 Neskôr][9 najneskôr][13 neskorej] [13 neskoro][3 neskorom][1 neskorou][1 neskory][2 neskoré][7 neskorý] [1075 neskôr] ponuk >< 5 o: [1 Ponuka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][1 ponukách] n: [1 Ponuka][8 Ponúka][9 ponuka][9 ponuke][3 ponukou][20 ponuky] [3 ponúk][2285 ponúka][5 ponúkať]

Wiktionary Notes
The Wiktionary corpus is generally similar in terms of words lost and gained from stemming groups. The most obvious difference other than the smaller size of the corpus is the presence of pronunciations in IPA—e.g., slɔvniː stems with slovné because it gets folded to slovni before stemming. These generally aren't changed by the folding changes.

Interlude: Some Stemmer Struggles
Ugh. While looking into Option 2—Stem Before Folding—I ran into some unexpected changes.

I noticed that francúzskeho and Francúzského got split up. That makes sense, since the -ého suffix is stripped, but not -eho. However, the numbers were backwards from what I expected: 62 francúzskeho, but only 1 Francúzského, making it look like Francúzského was the typo. A little research later, and I discover that some adjectives take the -eho suffix instead of the -ého suffix, and the stemmer doesn't strip it.

I pulled some Slovak declension and conjugation tables from English Wiktionary and discovered that a lot of Slovak suffixes are not handled by the stemmer, including some unaccented varieties. There are a lot of potential reasons for this, like some suffixes being too ambiguous. For example, in English -ing can be a verbal suffix (hoping, talking, thinking) or just the way a word ends (ceiling, sibling, lightning), which makes stripping -ing harder than it could be. Another likely source of the problem is that -ého could be more common than -eho—though a very rough search on Slovak Wikipedia gives a similar number of instances.

We didn't detect this when looking at the stemmer because the process doesn't really focus on false negatives. As long as everything grouped together is supposed to be together (true positives), it's "right". Plus, you can't always infer that a missing form is a stemmer deficiency (e.g., if you have hope, hoped, and hoping together, but not hopes, is that because hopes isn't processed properly, or because it isn't in your corpus?).

In the future when looking at stemmers, I'll try to pull some relevant data from Wiktionary inflection tables and look at least a little bit for false negatives, too.

I've gathered a few (probably unrepresentative) examples of Slovak adjectives, nouns, and verbs with inflection tables on English Wiktionary, and run all the inflections through the stemmer. The stems are collected on a sub-page for future reference. The first few are perfect—every form has the same stem—but some of the later ones are all over the place.

For now, I'll open a Phab ticket (T227924) and hold off on trying to improve the stemmer for a future project.

Option 2: Stem Before Folding
THIS IS A PLACEHOLDER FOR FUTURE WORK.

The most obvious solution to the problem of the unexpectedly large number of lost tokens is to stem first, then fold. I don't have time to do the full analysis before being out of the office for a few weeks—I'll be back in July—so I've put this up for speaker review, in case there are any other concerns (particularly with the gained tokens).

Some likely results of stemming first will include:


 * We won't lose tokens with diacritical suffixes (and forms involving čt and št will be treated correctly), which seems desirable.
 * Many of the merged groups will still merge, because their stems will merge after stemming.
 * e.g., Amalia will stem to amali, while Amália will stem to amáli, and then be folded to amali, so Amalia and Amália will still be grouped together—for better or worse).
 * We won't get false positives on suffix removal, so -áta won't be treated as a a suffix.

There will likely be some more complex interplay between the stemming and the folding, but the big changes should be positive.

Option 3: Modify the Stemmer
THIS IS A PLACEHOLDER FOR POSSIBLE FUTURE WORK.

And, getting waaaaay ahead of myself, another option to consider if stemming before folding doesn't work is to modify the stemmer to include an option to work on words without diacritics. This could be a fair amount of work to minimize the number of inaccurate stems.