User:TJones (WMF)/Notes/Folding Diacritics in Slovak

From MediaWiki.org
Jump to navigation Jump to search

June/July/October 2019 — See TJones_(WMF)/Notes for other projects. See also T223787 and T235561. For help with the technical jargon used in Analysis Chain Analysis, see the Language Analysis section of the Search Glossary.

Background[edit]

In March 2018 I did an analysis of potential Slovak Stemmers and the use of the best stemmer in an analysis chain.

I followed my usual process for new analysis chains, which I developed after my experience with doing it exactly wrong for Swedish (see T155822). I enabled ICU folding (which is fairly aggressive normalization of unicode characters, including diacritic removal), with exceptions for letters in the alphabet of the wiki's language (the Slovak alphabet)—in this case, Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž.

In a clever bit of foreshadowing, I looked briefly at the question of whether to enable folding before or after stemming. At the time it didn't seem to matter much because the differences were very slight if you exclude the Slovak letters from folding. I also cleverly reminded my future self that we have a "preserve" option, which allows us to index both the folded and unfolded version of a token.

At the 2019 Hackathon in Prague, Jetam2 and I talked about Slovak search, and he told me why it sucks... He expressed a concern that people don't always have access to a Slovak keyboard, so I said I'd look into the impact of removing the exceptions from ICU folding (and here we are). I looked into the Universal Language Selector and there is already a Slovak keyboard mapping for touch typists, and it would be possible to create a keyboard that could convert more widely available characters into diacritical characters. (For example, á as a~/, ô as o~^, č as c~v, ä as a~:.)

However, in the discussion on Phabricator (T155822) and on the Slovak Wikipedia Teahouse, Teslaton pointed out that Slovak search usually ignores diacritics and it usually doesn't cause any problems.

I was still worried (in the abstract) that Wikipedia and Wiktionary in particular have lots of text from other languages (or in IPA) which could cause weird results (though this should be mitigated in many cases by matching in the plain field). There's also the possible interaction of folding and stemming, which might be mitigated by changes to the stemmer, since we maintain the code for it.

Data[edit]

The usual process for creating a sample of documents (for testing language analysis modifications) is to retrieve 10,000 Wikipedia articles and 10,000 Wiktionary entries for the language in question. Sometimes we get fewer than 10,000 if there aren’t that many articles available in a particular project. Wikipedia articles usually provide a good example of typical formal written text in the language, and Wiktionary usually provides a larger number of distinct forms of words, and some additional variety of foreign scripts and languages. Foreign scripts and languages are not always processed well by language-specific text processing.

I sanitize the documents by removing markup (mostly HTML tags) and leading white space, and deduplicating individual lines. Deduplication reduces the number of instances of wiki-specific words, such as the local equivalent of "References", "See also", "Noun", "Etymology", etc.

For this analysis, I also pulled a random collection of 50,000 user queries from Slovak Wikipedia over a couple of months and 9,266 (~9k) user queries from Slovak Wiktionary (which is everything that was available at the time).

Analyzing the user queries will be a new kind of analysis, since I usually use the Wikipedia article text as a reference for the way people write in a language. Some of the info from the user queries will probably be less detailed compared to the usual analysis.

Query Data: Inspection[edit]

I started out by looking at the most common queries and most common words in queries on Slovak Wikipedia. Two of the top results were Zuzana Čaputová and Maroš Šefčovič, the two candidates in the recent Slovak presidential election. There were many variants of their names. Ignoring extra spaces, single-word queries include:

 48 čaputová         29 šefčovič
 33 caputova         29 Sefcovic
 31 Caputova         27 Šefčovič
 26 Čaputová         24 sefcovic
 17 Čaputova          3 Šefčovic
 11 čaputova          1 šefcovic
  1 CaputovA          1 ŠEFČOVIČ
                      1 Sefčovič
                      1 Sefcovič

If we ignore case, the lists look like this:

 74 čaputová         57 šefčovič
 65 caputova         53 sefcovic
 28 čaputova          3 šefčovic
                      1 šefcovic
                      1 sefčovič
                      1 sefcovič

Clearly, at least for these two presumably very well-known names, searching without diacritics is common.

I searched for other common words with diacritics and then searched for variants without diacritics. There are many cases that seem relatively unambiguous—such as čím, článok, Kočner, Košice, planéta, škola, štáty, voľby, Žilina, and živí. In these cases, the diacriticless version is also common, often equally common, as above. (It seems that the length-marking diacritic ´ is more likely to be dropped—especially for the vowels áéíóúý, but also the consonants ĺŕ. But overall it seems to happen frequently with any diacritic.)

So, clearly Slovak searchers are expecting diacriticless searches to get results, contrary to the expectations of the Swedish searchers.

I have a future concern for Slovak Wiktionary. Right now it only has about 26K entries so there aren't as many other languages represented. However, on English Wiktionary, there are often diacriticless versions of Slovak words in other Slavic languages. (Google Translate also often suggests Czech—and sometimes Slovenian and Swedish—for the diacriticless versions of Slovak words.)

On the other hand, (a) English Wiktionary folds all diacritics, and it usually works okay, (b) if Slovak-speaking searchers are used to diacriticless search, they at least won't be surprised, and (c) quotes are always available, and they aren't as restrictive on Wiktionary as they are on Wikipedia (because (i) you are more likely to be looking for an exact form of a word, and (ii) all forms of a word are much more likely to be on the page for the base form).

Option 1: Enabling Folding[edit]

The first thing I tried was disabling the ICU folding exception for the Slovak diacritical letters (Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž).

Interestingly, this lead to an increase in the number of post-analysis tokens in the Wikipedia sample (i.e., the number of distinct words coming out of the analysis chain), from 131,091 to 137,538.

There were a lot of new collisions—words that would be indexed the same: 12,207 pre-analysis types (5.484% of pre-analysis types) / 168,977 tokens (10.305% of tokens) were added to 4,863 groups (3.710% of post-analysis types), affecting a total of 26,744 pre-analysis types (12.014% of pre-analysis types) in those groups.

Collisions are what we expect—words getting folded together. The impact is pretty high, though, 5% of distinct words and 10% of all words got folded together with something new.

There were also many new splits: 9,220 pre-analysis types (4.142% of pre-analysis types) / 41,111 tokens (2.507% of tokens) were lost from 4,475 groups (3.414% of post-analysis types), affecting a total of 29,339 pre-analysis types (13.180% of pre-analysis types) in those groups.

That's 4% of distinct words and 2.5% of all words would no longer be indexed together.

The main cause of the splits seems to be interference with the stemmer.

The Wiktionary sample had a roughly similar number of collisions: 5% of distinct words and 4.7% of all words. Wiktionary had very differently balanced splits: <1% of distinct words, but still 5% of all words. The difference seems to come down to a much smaller sample size—the Wikipedia sample has approximately 25x as many tokens in it—and many more distinct words in the Wiktionary sample.

[Note: I've made the the fold-first examples collapsible since we didn't get any speaker review, and stemming first is probably the right way to go.]

Interlude: Some Stemmer Struggles[edit]

Ugh. While looking into Option 2—Stem Before Folding—I ran into some unexpected changes.

I noticed that francúzskeho and Francúzského got split up. That makes sense, since the -ého suffix is stripped, but not -eho. However, the numbers were backwards from what I expected: 62 francúzskeho, but only 1 Francúzského, making it look like Francúzského was the typo. A little research later, and I discover that some adjectives take the -eho suffix instead of the -ého suffix, and the stemmer doesn't strip it.

I pulled some Slovak declension and conjugation tables from English Wiktionary and discovered that a lot of Slovak suffixes are not handled by the stemmer, including some unaccented varieties. There are a lot of potential reasons for this, like some suffixes being too ambiguous. For example, in English -ing can be a verbal suffix (hoping, talking, thinking) or just the way a word ends (ceiling, sibling, lightning), which makes stripping -ing harder than it could be. Another likely source of the problem is that -ého could be more common than -eho—though a very rough search on Slovak Wikipedia gives a similar number of instances.

We didn't detect this when looking at the stemmer because the process doesn't really focus on false negatives. As long as everything grouped together is supposed to be together (true positives), it's "right". Plus, you can't always infer that a missing form is a stemmer deficiency. For example, if you have hope, hoped, and hoping together, but not hopes, is that because hopes isn't processed properly, or because it isn't in your sample?

In the future when looking at stemmers, I'll try to pull some relevant data from Wiktionary inflection tables and spend some time looking for false negatives, too.

I've gathered a few (probably unrepresentative) examples of Slovak adjectives, nouns, and verbs with inflection tables on English Wiktionary, and run all the inflections through the stemmer. The stems are collected on a sub-page for future reference. The first few are perfect—every form has the same stem—but some of the later ones are all over the place.

For now, I'll open a Phab ticket (T227924) and leave improving the stemmer for a future project.

Option 2: Stem Before Folding[edit]

The most obvious solution to the problem of the unexpectedly large number of lost tokens is to first stem words with diacritics, then fold and remove the diacritics.

One potential problem with this approach is that suffixes that always include diacritics won't be removed by the stemmer if the diacritics are missing—leading to false negatives. Option 3: Modify the Stemmer, below, could address that, though it is possible that it could introduce new problems if it results in the stemmer being too aggressive, or if suffixes that differ only in diacritics should be treated differently.

Some positive aspects of stemming first should include:

  • We won't lose tokens with diacritical suffixes (and forms involving čt will be treated correctly), which seems desirable.
  • Many of the merged groups will still merge, because their stems will merge after stemming.
    • e.g., Amalia will stem to amali, while Amália will stem to amáli, and then be folded to amali, so Amalia and Amália will still be grouped together—for better or worse.
  • We won't get false positives on suffix removal, so -áta won't be treated as a a suffix.

The first good sign from stemming before folding is that the total number of distinct post-analysis types (unique words in the sample) decreased, from 131,091 to 125,638—as opposed to increasing when we folded before stemming.

As before, there were a lot of new collisions—words that would be indexed the same: 11,594 pre-analysis types (5.208% of pre-analysis types) / 161,318 tokens (9.838% of tokens) were added to 4,167 groups (3.179% of post-analysis types), affecting a total of 24,699 pre-analysis types (11.096% of pre-analysis types) in those groups.

That's not quite as many new collisions as when folding first, but the impact is very similar: 5% of distinct words and almost 10% of all words got folded together with something new.

There were very few splits: 148 pre-analysis types (0.066% of pre-analysis types) / 294 tokens (0.018% of tokens) were lost from 137 groups (0.105% of post-analysis types), affecting a total of 776 pre-analysis types (0.349% of pre-analysis types) in those groups.

That's much less than before: less than 0.1% of distinct words and less than 0.02% of all words will no longer be indexed together.

So if the new collisions are good, then this arrangement is probably doing much of the same good work as folding first, without the bad side effects.

The Wiktionary sample had roughly similar collision stats: about 3.8% of both distinct words and total words got folded with other words. There were more splits than in the fold-first test, with 1.8% of distinct words and about 0.9% of all words no longer being indexed with something they were indexed with before.

Speaker Review: Overview[edit]

The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether any changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of the words hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others. (Note that the results in each case will be ranked differently because exact matches are preferred).

In addition to listing the words that are grouped together, we also include the number of times each word appears in the text sample. This helps us estimate the relative importance of potential errors. For example, if two words are improperly grouped together, but the words are very rare, that’s not as bad as if they were very common.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

When we make less extreme modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at groups before and after the modification to assess the effect of the group changes.

Old-vs-new groups are presented as follows:

hope >> 2
  o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes]
  n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 ĥợṕễ]

The first line shows the stem (hope), a pair of arrow heads (>>) indicating whether words were gained or lost by the group, and a number indicating how many gains and/or losses there were (2).

The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do.

In terms of gains and losses:

  • >> indicates that words were gained by the group
  • << indicates that words were lost from the group
  • >< indicates that there were both losses and gains

The o: section (for “old”) shows all the words that shared a stem before the change. The n: section (for “new”) shows all the words that shared a stem after the change. Sharing a stem means that searching for any of the words will find all of the others. (Note that while searching for each word in a group will give the same results, the results could be in a very different order—in particular because exact matches are given more weight.)

The numbers with the word—e.g., [1208 hope] and [1 Hopē]—indicate how many times a given word appears in the text sample. In this case, hope is over a thousand times more common than Hopē. Rare words that are not great matches with the rest of a group are less of a problem because they don’t occur very often. When you search for them, exact matching will usually bring them to the top of the results list.

Problems can arise when more common words are grouped together incorrectly. For example, a grouping like [1208 hope][747 hop] would be worse, because these words don’t belong together, and both words are common.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

Speaker Review: Folding Groups that Lost Members[edit]

The question for speakers of Slovak reviewing the Random Sample is this: would it be bad if searching for the "lost" words no longer found the remaining words, and vice versa?

Random Sample[edit]

Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that lost members as a result of folding Slovak diacritical characters after stemming. (These are from the Wikipedia sample.)

The lost terms almost all seem to have the same pattern: a diacritic on one of the last few letters in the word that blocks the stemmer from removing what otherwise looks like a Slovak suffix. (An exception is Gæa, which is too short to be stemmed (while the folded version, Gaea, is not.)

Some of the lost terms look to be incorrectly lost, to me, but possibly unavoidably so. Jarząbcza, Jarząbczej and Jarząbczy look to be inflected forms of the name Jarząbczą, though the final ą blocks the citation form of the name from being stemmed.

Key:

  • bechyn << 1
    • bechyn indicates that all of these words were stemmed to bechyn. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
    • << 1 indicates that from "old" to "new", this stemming groups lost 1 member.
  • o: — the "old" group, in this case, the current behavior
  • n: — the "new" group, in this case, with Slovak letters folded after stemming
  • [19 Bechyně] — Bechyně occurs 19 times in our sample (of 10K articles)

Lost members are bolded.

bechyn << 1
  o: [4 Bechyni][19 Bechyně]
  n: [4 Bechyni]
desk << 1
  o: [1 Deskový][2 deska][7 dešti][1 deště]
  n: [1 Deskový][2 deska][7 dešti]
gae << 1
  o: [1 GAE][2 Gaea][2 Gæa]
  n: [1 GAE][2 Gaea]
gyongy << 1
  o: [1 Gyöngyi][2 Gyöngyös]
  n: [1 Gyöngyi]
issar << 1
  o: [1 Issari][1 Issarlès]
  n: [1 Issari]
jarzabcz << 1
  o: [3 Jarząbcza][2 Jarząbczej][2 Jarząbczy][2 Jarząbczą]
  n: [3 Jarząbcza][2 Jarząbczej][2 Jarząbczy]
jesk << 2
  o: [2 Jesko][1 Ještě][2 ještě]
  n: [2 Jesko]
kart << 1
  o: [1 Karta][2 Kartová][1 Kartové][17 karta][5 kartami][3 karte][2 karti]
     [1 kartiny][9 kartou][1 kartovej][4 kartová][5 kartové][2 kartových]
     [27 karty][2 kartách][1 kartą]
  n: [1 Karta][2 Kartová][1 Kartové][17 karta][5 kartami][3 karte][2 karti]
     [1 kartiny][9 kartou][1 kartovej][4 kartová][5 kartové][2 kartových]
     [27 karty][2 kartách]
kork << 1
  o: [6 Korçë][1 korkových]
  n: [1 korkových]
lau << 1
  o: [1 Lau][1 Laua][5 Lauzès]
  n: [1 Lau][1 Laua]
maneth << 3
  o: [2 Manetho][1 Manethos][1 Manethovi][1 manetʰō][1 maˈnetʰō]
     [1 maˈnetʰōs]
  n: [2 Manetho][1 Manethos][1 Manethovi]
melk << 1
  o: [1 Melk][1 Melka][1 mělčině]
  n: [1 Melk][1 Melka]
mu << 1
  o: [3 MU][8 Mu][3 Mureș][1 Musím][625 mu][3 musím]
  n: [3 MU][8 Mu][1 Musím][625 mu][3 musím]
nasz << 1
  o: [1 Nasza][1 naszą]
  n: [1 Nasza]
national << 2
  o: [2 NATIONAL][84 National][1 Nationala][6 Nationale][1 Națională][11 national]
     [4 nationale][1 națională]
  n: [2 NATIONAL][84 National][1 Nationala][6 Nationale][11 national][4 nationale]
nestl << 1
  o: [1 Nestle][1 Nestlé][1 ˈnɛstlə]
  n: [1 Nestle][1 Nestlé]
niccol << 1
  o: [2 Niccola][1 Niccolo][8 Niccolò]
  n: [2 Niccola][1 Niccolo]
nicol << 1
  o: [4 Nicola][8 Nicole][2 Nicolò]
  n: [4 Nicola][8 Nicole]
paran << 1
  o: [15 Paraná][1 Paranã]
  n: [15 Paraná]
sabra << 1
  o: [1 Sabrazes][1 Sabrazès]
  n: [1 Sabrazes]
vor << 2
  o: [1 VOR][1 Vorë][1 Vőrös]
  n: [1 VOR]
vrchov << 1
  o: [2 Vrchovinami][59 vrchovina][19 vrchovine][2 vrchovinou][38 vrchoviny]
     [1 vrchovině]
  n: [2 Vrchovinami][59 vrchovina][19 vrchovine][2 vrchovinou][38 vrchoviny]
vresovisk << 1
  o: [1 vresoviskového][2 vresoviská][2 vřesoviště]
  n: [1 vresoviskového][2 vresoviská]
want << 1
  o: [18 Want][1 Wantą]
  n: [18 Want]
zem << 2
  o: [2 ZEM][2 ZEMÍCH][72 Zem][157 Zeme][66 Zemi][13 Zemou][1 Země][76 zem]
     [47 zeme][43 zemi][8 zemou][7 zemí][1 zemích][6 země]
  n: [2 ZEM][2 ZEMÍCH][72 Zem][157 Zeme][66 Zemi][13 Zemou][76 zem][47 zeme]
     [43 zemi][8 zemou][7 zemí][1 zemích]

High-Impact Groups[edit]

There are no stemming groups that lost 10 or more members in either of the Wikipedia or Wiktionary samples..

High-Frequency Words[edit]

There are no high-frequency words (> 1000 occurrences) lost from any groups in either of the Wikipedia or Wiktionary samples.

Speaker Review: Folding Groups that Gained Members[edit]

The question for speakers of Slovak reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?

Random Sample[edit]

Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.)

Key:

  • alternativ >> 7
    • alternativ indicates that all of these words were stemmed to alternativ. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
    • >> 7 indicates that from "old" to "new", this stemming groups gained 7 members.
  • o: — the "old" group, in this case, the current behavior
  • n: — the "new" group, in this case, with Slovak letters folded after stemming
  • [5 Alternative] — Alternative occurs 5 times in our sample (of 10K articles)

Note that which group is shown as "gaining" new members is always in favor of the stem with no accents.

A lot of the changes here are the kinds we'd expect to see, with accented versions of words (especially names) being merged.

Some notes:

  • For longer words that aren't names, it's very likely that the words are related. For example, it's hard to imagine that pohyblivosti and pohyblivosťou are not related, though whether searching for one should find the other is a different question (hence, speaker review).

Gained members are bolded.

alternativ >> 7
  o: [5 Alternative][2 alternative]
  n: [5 Alternative][1 Alternatívou][1 Alternatívy][2 alternative][1 alternatív]
     [14 alternatíva][1 alternatívami][4 alternatívou][2 alternatívy]
ange >> 1
  o: [76 Angeles]
  n: [76 Angeles][1 Ángeles]
cedric >> 1
  o: [1 Cedric]
  n: [1 Cedric][1 Cédric]
dubravk >> 4
  o: [1 Dubravko][1 dubravka]
  n: [1 Dubravko][7 Dúbravka][3 Dúbravke][1 Dúbravkou][1 Dúbravky][1 dubravka]
emili >> 3
  o: [8 Emilia][2 Emilio]
  n: [8 Emilia][2 Emilio][5 Emília][1 Emílie][2 Émilie]
gerard >> 1
  o: [14 Gerard][1 Gerarda][2 Gerardo][1 Gerardus][1 gerard]
  n: [14 Gerard][1 Gerarda][2 Gerardo][1 Gerardus][16 Gérard][1 gerard]
gramatik >> 1
  o: [3 Gramatika][1 Gramatiko][1 Gramatiky][7 gramatik][5 gramatika][5 gramatike]
     [3 gramatikom][3 gramatikou][5 gramatiky]
  n: [3 Gramatika][1 Gramatiko][1 Gramatiky][7 gramatik][5 gramatika][5 gramatike]
     [3 gramatikom][3 gramatikou][5 gramatiky][1 gramatík]
hermely >> 1
  o: [1 Hermelyová]
  n: [1 Hermelyová][1 Hermélyová]
hors >> 8
  o: [5 Horse][5 horse]
  n: [5 Horse][2 Horší][5 horse][1 horšej][1 horšom][2 horší][2 horších]
     [1 najhoršom][2 najhorší][3 najhorších]
hra >> 2
  o: [56 Hra][135 hra][7 hrami][4 hraním]
  n: [56 Hra][11 Hrá][135 hra][7 hrami][4 hraním][95 hrá]
kalabrijsk >> 5
  o: [1 kalabrijské]
  n: [1 Kalabríjska][1 Kalábrijský][1 Kalábrijských][1 kalabrijské]
     [1 kalábrijskom][2 kalábrijský]
karol >> 1
  o: [192 Karol][80 Karola][3 Karolina][15 Karolom][1 Karolova][9 Karolovej]
     [9 Karolovi][1 Karoly]
  n: [192 Karol][80 Karola][3 Karolina][15 Karolom][1 Karolova][9 Karolovej]
     [9 Karolovi][1 Karoly][4 Károly]
kuril >> 1
  o: [2 Kurilová][1 Kurily]
  n: [2 Kurilová][1 Kurily][2 Kuríl]
magic >> 1
  o: [34 Magic]
  n: [34 Magic][1 Mágico]
mocnost >> 2
  o: [2 mocnosti][6 mocností]
  n: [2 mocnosti][6 mocností][2 mocnosť][1 mocnosťami]
pohyblivost >> 2
  o: [2 pohyblivosti]
  n: [2 pohyblivosti][3 pohyblivosť][1 pohyblivosťou]
prohask >> 1
  o: [2 Prohaska]
  n: [2 Prohaska][1 Proháska]
romk >> 3
  o: [1 ROMKY][1 ROMky][2 Romka]
  n: [1 ROMKY][1 ROMky][2 Romka][1 Rómka][1 rómčina][1 rómčine]
sob >> 1
  o: [1 Sob][1 soba][1 sobe][1 soby]
  n: [1 Sob][1 soba][1 sobe][1 soby][1 ŠOBA]
spalovac >> 6
  o: [1 spalovacej]
  n: [1 Spalovač][1 spalovacej][21 spaľovacej][1 spaľovacom][1 spaľovacou]
     [4 spaľovací][3 spaľovacích]
studn >> 4
  o: [1 Studna][5 studne][1 studni]
  n: [1 Studna][1 Studňa][1 Studňou][5 studne][1 studni][7 studňa][2 studňou]
ukladani >> 1
  o: [5 Ukladanie][5 ukladania][10 ukladanie]
  n: [5 Ukladanie][5 ukladania][10 ukladanie][1 ukládanie]
util >> 1
  o: [4 Utila]
  n: [4 Utila][1 Útila]
vals >> 1
  o: [1 Vals]
  n: [1 Vals][1 Valšov]
volov >> 1
  o: [4 volov][1 volovými]
  n: [4 volov][1 volovými][1 vôľovej]

High-Impact Groups[edit]

High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains >>, losses <<, or a mix ><). These groups are more likely to have problems because they are outliers.

Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).

The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

There were 202 groups with 10 or more additions, so I raised the threshold to 15 or more additions, which gave 60 groups. I've removed the groups where a large group with diacritics merged with a one or two distinct words (after ignoring upper- and lowercase) that have a stem without diacritics. (Though I kept groups like greck where multiple stems were merged, in this case grečk-, gréck-, and gréčt-.)

The remaining 30 groups are shown below. These represent large groups with diacritics merging with medium to large groups without diacritics. The converse—large groups without diacritics merging with smaller groups with diacritics—is not represented. I can go looking for examples if anyone thinks they would be significantly different from the ones here or above.

One thing I noticed is that Czech ř gets folded to r, which presumably ends up merging Czech/Slovak cognates, which is probably not a bad thing.

Gained members are bolded.

byval >> 22
  o: [1 ByVal][1 byvalá][1 byvalé][1 byvalý]
  n: [1 ByVal][1 Býval][4 Bývalá][2 Bývalé][4 Bývalí][9 Bývalý][1 byvalá]
     [1 byvalé][1 byvalý][27 býval][9 bývala][46 bývalej][17 bývali]
     [5 bývalo][20 bývalom][7 bývalou][60 bývalá][13 bývalé][53 bývalého]
     [5 bývalému][12 bývalí][4 bývalú][189 bývalý][63 bývalých]
     [15 bývalým][6 bývalými]
cast >> 29
  o: [2 Cast][2 Castles][6 Castres][1 caste][1 casti]
  n: [2 Cast][2 Castles][6 Castres][1 caste][1 casti][1 Časti][67 Často]
     [1 Častou][7 Častá][5 Časté][5 Častý][3 Častým][2 Častými]
     [37 Časť][1 častej][1171 časti][1 častich][541 často][1 častom]
     [3 častou][7 častá][39 časté][2 častého][240 častí][6 častý]
     [4 častých][14 častým][9 častými][1 čas­to][1012 časť][18 časťami]
     [122 časťou][5 část][1 části]
ciel >> 19
  o: [3 Ciel][2 Ciele][5 ciel][1 ciela][63 ciele][2 cieli]
  n: [3 Ciel][2 Ciele][4 Cieľ][33 Cieľom][1 Cieľovou][5 ciel][1 ciela][63 ciele]
     [2 cieli][52 cieľ][16 cieľa][4 cieľmi][2 cieľoch][204 cieľom][34 cieľov]
     [4 cieľovej][1 cieľovou][2 cieľová][1 cieľové][4 cieľového]
     [1 cieľovú][4 cieľový][3 cieľových][1 cieľovým][1 cieľovými]
drah >> 15
  o: [1 Drahý][1 drahej][1 draho][5 drahá][7 drahé][2 drahú][6 drahý]
     [3 drahých][1 drahými][3 najdrahším]
  n: [1 Drahý][6 Dráha][1 Dráhovej][1 Dráhovou][5 Dráhy][1 drahej][1 draho]
     [5 drahá][7 drahé][2 drahú][6 drahý][3 drahých][1 drahými][15 dráh]
     [36 dráha][3 dráhami][50 dráhe][17 dráhou][6 dráhovej][2 dráhové]
     [1 dráhovú][1 dráhových][117 dráhy][1 dráze][3 najdrahším]
elektrick >> 15
  o: [1 Elektrickej][10 Elektrická][3 Elektrické][2 Elektrickú][19 Elektrický]
     [45 elektrickej][3 elektrickom][3 elektrickou][9 elektricky][31 elektrická]
     [51 elektrické][37 elektrického][5 elektrickému][17 elektrickú]
     [36 elektrický][19 elektrických][11 elektrickým][1 elektrickými]
  n: [1 Elektrickej][10 Elektrická][3 Elektrické][2 Elektrickú][19 Elektrický]
     [3 Električka][3 Električková][1 Električky][45 elektrickej][3 elektrickom]
     [3 elektrickou][9 elektricky][31 elektrická][51 elektrické][37 elektrického]
     [5 elektrickému][17 elektrickú][36 elektrický][19 elektrických]
     [11 elektrickým][1 elektrickými][8 električka][2 električkami]
     [2 električkou][6 električkovej][12 električková][3 električkové]
     [4 električkového][2 električkovú][5 električkový][2 električkových]
     [21 električky][1 električkách]
greck >> 23
  o: [1 greckej]
  n: [1 Grečka][2 Grečko][95 Grécka][13 Grécke][5 Gréckej][4 Grécki]
     [56 Grécko][10 Gréckom][1 Gréckou][8 Grécky][1 Gréčtiny][1 greckej]
     [25 grécka][51 grécke][137 gréckej][14 grécki][9 grécko][7 gréckom]
     [4 gréckou][170 grécky][9 gréčtina][9 gréčtine][4 gréčtinou]
     [13 gréčtiny]
katolick >> 15
  o: [1 Katolickom][2 Katolický][1 katolicko][1 katolické]
  n: [1 Katolickom][2 Katolický][19 Katolícka][5 Katolícke][12 Katolíckej]
     [2 Katolícki][3 Katolíckou][8 Katolícky][1 katolicko][1 katolické]
     [21 katolícka][14 katolícke][46 katolíckej][1 katolícki][2 katolíckom]
     [7 katolíckou][34 katolícky][1 katolíckého][1 katolíčky]
kral >> 34
  o: [1 Kral][1 Krali]
  n: [1 KRÁĽ][1 Kral][1 Krali][18 Král][1 Králi][4 Králova][5 Královo]
     [1 Královou][29 Králové][1 Králového][57 Kráľ][9 Kráľa][2 Kráľom]
     [4 Kráľov][7 Kráľova][3 Kráľovej][1 Kráľovi][1 Kráľovo][1 Kráľovou]
     [12 Kráľová][1 kraľ][2 král][1 krála][3 krále][12 králi][1 králom]
     [1 králov][288 kráľ][309 kráľa][76 kráľom][51 kráľov][3 kráľova]
     [1 kráľovej][42 kráľovi][1 kráľovo][1 kráľových]
kriz >> 33
  o: [1 Kriza][2 krizy]
  n: [1 Kriza][5 Kríza][7 Kríž][10 Kríža][1 Krížom][2 Krížovej]
     [2 Krížová][1 Krížové][5 Kříž][2 krizy][2 kríz][17 kríza][6 krízou]
     [1 krízovej][1 krízové][2 krízového][1 krízovú][1 krízový]
     [2 krízových][28 krízy][36 kríž][26 kríža][1 krížmi][1 krížoch]
     [11 krížom][2 krížov][8 krížovej][7 krížovou][3 krížová]
     [2 krížové][2 krížového][2 krížovú][2 krížový][1 krížovým]
     [2 krížovými]
lav >> 23
  o: [1 lava]
  n: [1 Láv][2 Láva][1 Lávy][1 lava][1 láv][2 láva][2 lávami][2 láve]
     [8 lávové][22 lávy][1 Ľavej][1 Ľavom][1 Ľavá][3 Ľavé][1 Ľavý]
     [44 ľavej][47 ľavom][1 ľavou][2 ľavá][4 ľavé][7 ľavého][7 ľavú]
     [8 ľavý][2 ľavým]
minut >> 15
  o: [1 Minute][1 Minutos][1 minute][3 minutus][1 minutých]
  n: [1 Minute][1 Minutos][1 minute][3 minutus][1 minutých][107 minút][6 minúta]
     [12 minúte][1 minútovej][2 minútovom][1 minútovou][2 minútová]
     [1 minútové][2 minútového][1 minútovú][3 minútový][1 minútových]
     [2 minútovými][24 minúty][1 minúť]
nas >> 16
  o: [77 NASA][1 NaS][1 Nas][1 Naso][4 nas][1 nasi][1 naso]
  n: [77 NASA][1 NaS][1 Nas][1 Naso][11 Naša][8 Naše][1 Našej][5 Naši][1 Našou]
     [9 Náš][4 nas][1 nasi][1 naso][10 naša][19 naše][30 našej][4 naši]
     [37 našich][15 našom][2 našou][1 naší][84 nás][19 náš]
pas >> 17
  o: [1 PASO][119 Pas][1 Paso][1 Passes][1 Pasú][5 pas][1 pasy][1 pasú]
  n: [1 PASO][119 Pas][1 Paso][1 Passes][1 Pasú][5 Paša][2 Pás][1 Páse][2 Pásy]
     [5 pas][1 pasy][1 pasú][1 paša][1 paše][1 paši][41 pás][1 pása]
     [1 pásami][18 páse][8 pásmi][3 pásoch][11 pásom][5 pásové][5 pásy]
     [1 páší]
plan >> 16
  o: [12 Plan][3 Planina][1 plan][2 plane][1 planej][14 planina][6 planine]
     [3 planinou][15 planiny][1 plané]
  n: [12 Plan][3 Planina][11 Plán][2 Plánom][3 Plány][1 Pláň][1 plan][2 plane]
     [1 planej][14 planina][6 planine][3 planinou][15 planiny][1 plané][49 plán]
     [16 pláne][3 pláni][4 plánmi][5 plánoch][7 plánom][20 plánov][37 plány]
     [1 plání][3 pláň][2 pláňami][1 pláňou]
polsk >> 31
  o: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][1 polski][1 polsko]
  n: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][93 Poľska][20 Poľskej]
     [138 Poľsko][20 Poľskom][15 Poľská][11 Poľské][7 Poľského][2 Poľskí]
     [1 Poľskú][10 Poľský][1 Poľských][1 Poľským][1 polski][1 polsko]
     [49 poľskej][22 poľsko][12 poľskom][6 poľskou][37 poľsky][32 poľská]
     [22 poľské][39 poľského][2 poľskému][7 poľskí][5 poľskú][82 poľský]
     [21 poľských][16 poľským][2 poľskými][2 poľština][8 poľštine]
     [2 poľštinou][6 poľštiny]
post >> 19
  o: [41 Post][102 post][25 poste][4 postoch][1 postom][1 postov][2 posty]
  n: [41 Post][2 Pošta][1 Poštovou][2 Poštová][2 Poštové][1 Poštový]
     [102 post][25 poste][4 postoch][1 postom][1 postov][2 posty][12 pošta]
     [5 poštou][2 poštovej][1 poštovou][4 poštová][1 poštové][2 poštového]
     [1 poštoví][4 poštový][4 poštových][15 pošty][1 pôst][1 pôsty]
     [3 pôšt]
premier >> 18
  o: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][3 première]
  n: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][1 Premiér][22 Premiéra]
     [2 Premiérom][1 Premiérový][3 première][33 premiér][26 premiéra]
     [14 premiére][10 premiérom][2 premiérou][1 premiérov][4 premiérovo]
     [1 premiérovom][1 premiérového][1 premiérovému][1 premiérovú]
     [3 premiérový][6 premiéry][1 pre­miér]
prirodn >> 25
  o: [1 prirodne]
  n: [1 Prírodne][3 Prírodnej][2 Prírodnou][46 Prírodná][4 Prírodné]
     [2 Prírodnú][5 Prírodný][1 Prírodných][4 Přírodní][1 prirodne]
     [1 prírodna][3 prírodne][32 prírodnej][1 prírodniny][1 prírodno]
     [5 prírodnom][19 prírodnou][192 prírodná][49 prírodné][29 prírodného]
     [14 prírodnú][15 prírodný][85 prírodných][6 prírodným][8 prírodnými]
     [17 přírodní]
rimsk >> 18
  o: [1 Rimského]
  n: [1 Rimského][7 Rímska][5 Rímske][24 Rímskej][2 Rímski][4 Rímsko]
     [1 Rímskom][3 Rímskou][10 Rímsky][18 rímska][24 rímske][45 rímskej]
     [12 rímski][12 rímsko][5 rímskom][2 rímskou][41 rímsky][3 Římské]
     [1 římskou]
siet >> 18
  o: [4 SIETe][6 Siete][103 siete][53 sieti][12 sietí]
  n: [4 SIETe][11 SIEŤ][6 Siete][8 Sieť][4 Sieťová][1 Sieťový][103 siete]
     [53 sieti][12 sietí][82 sieť][4 sieťami][16 sieťou][2 sieťovej]
     [1 sieťovom][1 sieťovou][1 sieťová][4 sieťové][1 sieťového]
     [2 sieťovú][1 sieťový][7 sieťových][3 sieťovým][1 sieťovými]
skol >> 15
  o: [2 skole][1 skoly]
  n: [2 skole][1 skoly][1 ŠKOLA][1 ŠKOLY][24 Škola][2 Škole][3 Školy][2 škol]
     [329 škola][2 školami][163 škole][12 školou][306 školy][46 školách]
     [2 školám][1 škoła][65 škôl]
stal >> 18
  o: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi]
     [16 Stalo][862 stal][338 stala][1 stale][150 stali][173 stalo]
  n: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi]
     [16 Stalo][3 Stál][4 Stála][16 Stále][3 Stálej][2 Stáli][1 Stálo]
     [1 Stálou][2 Stály][862 stal][338 stala][1 stale][150 stali][173 stalo]
     [67 stál][41 stála][249 stále][7 stálej][28 stáli][19 stálo][2 stálom]
     [3 stálou][13 stály][1 Štál]
stat >> 20
  o: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí]
  n: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí]
     [71 stať][4 stát][12 stáť][2 sťatá][2 sťatí][1 sťatý][1 sťať]
     [14 Štát][1 Štátoch][3 Štátov][3 Štáty][1 štatom][1 štatov]
     [186 štát][124 štáte][37 štátmi][116 štátoch][66 štátom]
     [238 štátov][91 štáty]
studi >> 27
  o: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][2 studie][7 studio]
     [1 studií]
  n: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][2 studie][7 studio]
     [1 studií][12 Štúdia][3 Štúdie][6 Štúdio][2 Štúdiom][3 Štúdiové]
     [1 Štúdioví][2 Štúdiá][1 študiovým][140 štúdia][4 štúdiami]
     [57 štúdie][12 štúdii][18 štúdio][17 štúdiom][3 štúdiou]
     [3 štúdiovom][1 štúdiová][5 štúdiové][14 štúdiového][67 štúdiový]
     [10 štúdiových][4 štúdiovým][2 štúdiovými][16 štúdiá]
     [34 štúdiách][1 štúdiám][50 štúdií]
styl >> 15
  o: [2 Style][1 Stylos][3 styl][3 style]
  n: [2 Style][1 Stylos][3 styl][3 style][5 Štýl][1 Štýlom][1 Štýlové]
     [69 štýl][61 štýle][4 štýlmi][3 štýloch][20 štýlom][13 štýlov]
     [1 štýlovej][2 štýlovo][3 štýlové][1 štýlovú][1 štýlový]
     [13 štýly]
svat >> 25
  o: [1 Svatom][5 Svatá][4 Svaté][10 Svatého][12 Svatý][2 svaté][2 svatého]
     [1 svatý][1 svatých]
  n: [1 Svatom][5 Svatá][4 Svaté][10 Svatého][12 Svatý][1 Sváti][39 Svätej]
     [1 Sväto][10 Svätom][1 Svätou][43 Svätá][6 Sväté][47 Svätého]
     [7 Svätému][3 Svätí][3 Svätú][94 Svätý][2 svaté][2 svatého][1 svatý]
     [1 svatých][51 svätej][2 svätom][5 svätou][8 svätá][4 sväté]
     [140 svätého][5 svätému][2 svätí][3 svätú][21 svätý][59 svätých]
     [7 svätým][2 svätými]
system >> 18
  o: [1 SYSTEM][52 System][17 system][1 systema][1 systeme]
  n: [1 SYSTEM][52 System][1 Systémová][1 Systémové][10 Systémy][17 system]
     [1 systema][1 systeme][16 systémami][67 systéme][37 systémoch][70 systémom]
     [86 systémov][3 systémovej][1 systémovom][2 systémová][9 systémové]
     [3 systémového][1 systémovú][4 systémový][6 systémových][102 systémy]
     [1 sýstéma]
velk >> 30
  o: [1 Velkom][1 Velkou][13 Velká][4 Velké][2 Velkého][15 Velký][1 Velkým]
     [2 velkou][2 velké][1 velkého]
  n: [1 Velkom][1 Velkou][13 Velká][4 Velké][2 Velkého][15 Velký][1 Velkým]
     [1 Veľk][1 Veľka][151 Veľkej][18 Veľkom][16 Veľkou][163 Veľká]
     [83 Veľké][70 Veľkého][3 Veľkému][1 Veľkí][31 Veľkú][180 Veľký]
     [15 Veľkých][25 Veľkým][3 Veľkými][2 velkou][2 velké][1 velkého]
     [136 veľkej][1 veľko][38 veľkom][1 veľkos][57 veľkou][96 veľká]
     [306 veľké][83 veľkého][15 veľkému][2 veľkí][120 veľkú][223 veľký]
     [137 veľkých][79 veľkým][36 veľkými]
voln >> 19
  o: [1 Volnin][1 Volný][2 volne][1 volné][1 volného][1 volný]
  n: [1 Volnin][1 Volný][1 Voľne][1 Voľná][8 Voľné][1 Voľným][2 volne]
     [1 volné][1 volného][1 volný][2 voľna][64 voľne][23 voľnej][3 voľno]
     [18 voľnom][3 voľnou][8 voľná][21 voľné][17 voľného][1 voľnému]
     [4 voľnú][19 voľný][22 voľných][38 voľným][1 voľnými]
vyber >> 18
  o: [11 vyberá][2 vyberú]
  n: [3 Výber][1 Výberová][1 Výberové][11 vyberá][2 vyberú][76 výber]
     [17 výbere][1 výbermi][9 výberom][1 výberov][2 výberovom][1 výberová]
     [1 výberové][1 výberovú][6 výberový][2 výberových][2 výberovým]
     [1 výberovými][1 výběr][1 výběrový]

High-Frequency Words[edit]

I also looked for high-frequency words that were added to a group.

High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

I dropped groups that are easily interpreted as a small number of words without diacritics being added to a larger group of words with diacritics, one of which is high-frequency. For example, 1 instance of clanok would be indexed with 1501 instances of článok, which isn't actually very interesting (and may just be a typo).

I kept groups where the group being added to had at least 3 different words in it, or at least one of the words had 10 or more instances. The remaining 12 groups with high-frequency words are below.

As before, the most interesting collisions (ignoring case) seem to be:

  • co and čo
  • kedy and keď
  • su, sú, šu, and šú

Gained members are bolded.

az >> 2
  o: [84 AZ][1 Az][6 az]
  n: [84 AZ][1 Az][101 Až][6 az][2357 až]
byt >> 6
  o: [1 Bytom][1 Byty][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom][5 bytové]
     [12 byty]
  n: [1 Bytom][1 Byty][4 Byť][1 Být][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom]
     [5 bytové][12 byty][1931 byť][4 být][1 býti][2 býť]
cast >> 29
  o: [2 Cast][2 Castles][6 Castres][1 caste][1 casti]
  n: [2 Cast][2 Castles][6 Castres][1 caste][1 casti][1 Časti][67 Často]
     [1 Častou][7 Častá][5 Časté][5 Častý][3 Častým][2 Častými]
     [37 Časť][1 častej][1171 časti][1 častich][541 často][1 častom]
     [3 častou][7 častá][39 časté][2 častého][240 častí][6 častý]
     [4 častých][14 častým][9 častými][1 čas­to][1012 časť][18 časťami]
     [122 časťou][5 část][1 části]
co >> 3
  o: [3 CO][28 Co][7 Comes][7 co][2 comes]
  n: [3 CO][28 Co][7 Comes][26 Côtes][7 co][2 comes][50 Čo][1418 čo]
ked >> 2
  o: [1 Kedy][1 ked][461 kedy]
  n: [1 Kedy][297 Keď][1 ked][461 kedy][1084 keď]
ma >> 3
  o: [5 MA][16 Ma][2 Makes][1 Manes][1 Mates][1 mA][61 ma][1 makes][1 malém]
     [1 mares]
  n: [5 MA][16 Ma][2 Makes][1 Manes][1 Mates][476 Má][2 Mánes][1 mA][61 ma]
     [1 makes][1 malém][1 mares][3428 má]
neskor >> 4
  o: [1 Neskoro][1 Neskorším][13 neskorej][13 neskoro][3 neskorom][1 neskorou]
     [1 neskory][2 neskoré][10 neskorého][7 neskorý][6 neskorých][9 neskorším]
  n: [1 Najneskôr][1 Neskoro][1 Neskorším][270 Neskôr][9 najneskôr]
     [13 neskorej][13 neskoro][3 neskorom][1 neskorou][1 neskory][2 neskoré]
     [10 neskorého][7 neskorý][6 neskorých][9 neskorším][1075 neskôr]
podl >> 3
  o: [4 Podla][1 Podle][9 podla][3 podle][1 podlete]
  n: [1 PODĽA][4 Podla][1 Podle][531 Podľa][9 podla][3 podle][1 podlete]
     [1254 podľa]
ponuk >> 3
  o: [1 Ponuka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][1 ponukách]
  n: [1 Ponuka][8 Ponúka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][1 ponukách]
     [3 ponúk][2285 ponúka]
region >> 9
  o: [6 Region][1 Regione][2 Regionova]
  n: [6 Region][1 Regione][2 Regionova][7 Región][1 Regióny][76 región]
     [1859 regióne][3 regiónmi][13 regiónoch][9 regiónom][28 regiónov]
     [11 regióny]
su >> 5
  o: [6 SU][141 Su][2 Sü][9 su][3 sü]
  n: [6 SU][141 Su][146 Sú][2 Sü][9 su][3855 sú][3 sü][19 ŠÚ][1 šu][3 šú]
ze >> 2
  o: [4 Ze][14 ze]
  n: [4 Ze][14 ze][4 Že][3268 že]

Speaker Review: Folding Groups that Lost and Gained (Mixed) Members[edit]

The question for speakers of Slovak reviewing these sections (Random Sample and High-Frequency Words) is this: would it be bad if searching for the the new groups of words found each other, in stead of the old groups? (That's clunky, but after looking separately at groups that lost and gained members, the idea should be clear enough.)

Random Sample[edit]

Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

I don't see a lot of differences here, other than we happen to have both gains and losses applying at once. However, I'm including them in case there is something non-obvious.

Below is a sample of 10 randomly selected stemming groups (words that would all be indexed together) that both lost and gained members as a result of folding Slovak diacritical characters after stemming. (These are from the Wikipedia sample.)

Key:

  • cest >< 5
    • cest indicates that all of these words were stemmed to cest. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
    • >< 5 indicates that from "old" to "new", this stemming groups lost some members and gained some members and the total lost or gained is 5.
  • o: — the "old" group, in this case, the current behavior
  • n: — the "new" group, in this case, with Slovak letters folded before stemming
  • [1 CESTA] — CESTA occurs 1 time in our sample (of 10K articles)

Lost and gained members are bolded.

cest >< 5
  o: [1 CESTA][76 Cesta][1 Cestami][2 Ceste][12 Cestou][21 Cesty][283 cesta]
     [20 cestami][118 ceste][2 cesto][94 cestou][177 cesty][24 cestách][2 cestám]
     [1 cestě]
  n: [1 CESTA][76 Cesta][1 Cestami][2 Ceste][12 Cestou][21 Cesty][283 cesta]
     [20 cestami][118 ceste][2 cesto][94 cestou][177 cesty][24 cestách][2 cestám]
     [3 Česť][1 česti][1 čestine][10 česť]
dob >< 2
  o: [9 Doba][2 Dobové][2 dob][36 doba][1 dobami][317 dobe][18 dobou][8 dobové]
     [213 doby][15 dobách][1 době]
  n: [9 Doba][2 Dobové][2 dob][36 doba][1 dobami][317 dobe][18 dobou][8 dobové]
     [213 doby][15 dobách][6 dôb]
hlav >< 2
  o: [11 Hlava][1 Hlavina][2 Hlavou][1 Hlavový][4 Hlavy][26 hlava][2 hlavami]
     [27 hlave][44 hlavou][2 hlavová][1 hlavovú][1 hlavový][70 hlavy][2 hlavách]
     [1 hlavým][1 hlavě]
  n: [11 Hlava][1 Hlavina][2 Hlavou][1 Hlavový][4 Hlavy][26 hlava][2 hlavami]
     [27 hlave][44 hlavou][2 hlavová][1 hlavovú][1 hlavový][70 hlavy][2 hlavách]
     [1 hlavým][37 hláv]
kop >< 2
  o: [8 Kop][10 Kopa][1 Kopú][1 Kopę][2 kop][14 kopa][5 kope][1 kopom][2 kopou]
     [17 kopy]
  n: [8 Kop][10 Kopa][1 Kopú][2 kop][14 kopa][5 kope][1 kopom][2 kopou][17 kopy]
     [1 kôp]
mat >< 7
  o: [1 MAT][9 Mat][2 Mate][63 Matej][6 Mato][1 Matom][2 Matěj][5 mat]
  n: [1 MAT][1 MAŤA][9 Mat][2 Mate][63 Matej][6 Mato][1 Matom][3 Mať][1 Maťo]
     [5 mat][192 mať][1 máta][1 máte]
otrokyn >< 2
  o: [1 Otrokyně][4 otrokyne]
  n: [4 otrokyne][5 otrokyňa]
pohrebisk >< 2
  o: [1 Pohrebisko][1 Pohrebiská][1 pohrebiska][6 pohrebisko][1 pohrebiskom]
     [1 pohrebiská][1 pohřebiště]
  n: [1 Pohrebisko][1 Pohrebiská][1 pohrebiska][6 pohrebisko][1 pohrebiskom]
     [1 pohrebiská][3 pohrebísk]
sa >< 2
  o: [14 SA][5 Sa][1 Sages][2 Sales][3 Savès][24614 sa]
  n: [14 SA][5 Sa][1 Sages][2 Sales][1 Sá][24614 sa]
slavk >< 2
  o: [4 Slavkov][1 Slavkova][6 Slavkove][1 Slavkově]
  n: [4 Slavkov][1 Slavkova][6 Slavkove][1 Slávka]
sut >< 5
  o: [2 sute][1 sutí][1 sutě]
  n: [2 sute][1 sutí][2 suť][1 suťové][1 Šuta][1 Šuty]

High-Impact Groups[edit]

High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains >>, losses <<, or a mix ><). These groups are more likely to have problems because they are outliers.

Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).

The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

There were 8 groups with 10 or more changes, which are shown below. Changes are bolded.

Lost and gained members are bolded.

dom >< 12
  o: [2 DOM][71 Dom][4 Doma][9 Dome][1 Domo][3 Domus][7 Domy][170 dom][58 doma]
     [6 domami][55 dome][10 domoch][7 domom][78 domy][1 domě]
  n: [2 DOM][71 Dom][4 Doma][9 Dome][1 Domo][3 Domus][7 Domy][7 Dóm][1 Dóma]
     [2 Dóme][22 Dôme][170 dom][58 doma][6 domami][55 dome][10 domoch][7 domom]
     [78 domy][9 dóm][1 dóma][4 dómami][3 dóme][1 dómoch][3 dómom][1 dómy]
kon >< 10
  o: [4 Kon][4 Kone][3 Koná][1 Koně][16 kone][11 koni][43 koná][50 koní]
  n: [4 Kon][4 Kone][3 Koná][1 Koňa][1 Kóňa][5 Kôň][16 kone][11 koni]
     [43 koná][50 koní][19 koňa][3 koňmi][5 koňoch][4 koňom][3 koňovi]
     [18 kôň]
lud >< 21
  o: [1 Lud][1 Lude][1 Ludiès][1 Ludo][1 Ludus]
  n: [1 Lud][1 Lude][1 Ludo][1 Ludus][1 luďom][1 ĽUDÍ][1 Ľud][1 Ľuda][1 Ľudo]
     [4 Ľudové][1 Ľudí][1 Ľuďom][46 ľud][1 ľude][6 ľudi][1 ľudmi]
     [7 ľudom][33 ľudové][3 ľudy][421 ľudí][42 ľuďmi][13 ľuďoch]
     [25 ľuďom][1 ľuďí]
metod >< 10
  o: [9 Metod][20 Metoda][1 Metodom][1 Metodov][1 Metodova][2 Metodovi]
     [1 Metodových][1 Metoděj]
  n: [9 Metod][20 Metoda][1 Metodom][1 Metodov][1 Metodova][2 Metodovi]
     [1 Metodových][8 Metóda][1 Metódou][5 Metódy][34 metód][58 metóda]
     [15 metódami][5 metóde][33 metódou][62 metódy]
roman >< 17
  o: [1 ROMAN][66 Roman][12 Romana][1 Romani][5 Romano][5 Romanom][1 Romanov]
     [1 Romanova][2 Romanovej][3 Romanovi][2 Romanus][1 Romany][1 Româna][3 roman]
     [2 române][1 română]
  n: [1 ROMAN][66 Roman][12 Romana][1 Romani][5 Romano][5 Romanom][1 Romanov]
     [1 Romanova][2 Romanovej][3 Romanovi][2 Romanus][1 Romany][23 Román]
     [1 Románi][1 Româna][3 roman][195 román][19 románe][4 románmi]
     [4 románoch][11 románom][27 románov][9 románovej][1 románovou]
     [5 románová][1 románové][2 románového][1 románový][2 románových]
     [33 romány][2 române]
slovak >< 10
  o: [1 Slovaci][15 Slovak][1 Slovakė][1 Slovači]
  n: [1 Slovaci][15 Slovak][1 Slovači][47 Slováci][22 Slovák][3 Slováka]
     [5 Slovákmi][2 Slovákoch][6 Slovákom][58 Slovákov][1 Slovákovi]
     [1 slováci]
volb >< 10
  o: [1 volba][1 volbě]
  n: [5 Voľba][5 Voľby][1 volba][14 voľba][10 voľbami][15 voľbe][5 voľbou]
     [58 voľby][117 voľbách][3 voľbám]
zbran >< 11
  o: [6 Zbrane][144 zbrane][1 zbrani][78 zbraní][1 zbraně]
  n: [6 Zbrane][9 Zbraň][144 zbrane][1 zbrani][78 zbraní][64 zbraň][16 zbraňami]
     [11 zbraňou][2 zbraňová][2 zbraňové][1 zbraňového][1 zbraňový]
     [10 zbraňových][1 zbraňovými]

High-Frequency Words[edit]

High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.

[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]

There are no high-frequency words (> 1000 occurrences) in any groups that lost and gained words in either of the Wikipedia or Wiktionary samples.

Wiktionary Notes[edit]

The Wiktionary sample is generally similar in terms of words lost and gained from stemming groups. The most obvious difference other than the smaller size of the sample is the presence of pronunciations in IPA—in particular, IPA ending with ː (the vowel lengthening mark), such as slɔvniː, no longer get stemmed.

Speaker Review Summary[edit]

Update October 2019: Jetam2 looked over the samples here, and said, "When I compare the old and new there, it really seems that the ones we lost are rather foreign, versus the ones we gained are rather useful."

More details on specific groups, specific examples, and some minor concerns:

  • In the random sample where words are lost from groups (the list starting with bechyn), the most common groups are low-frequency non-Slovak words or names. The primary reason for words to be lost from a group is that a diacritic near the end of the word blocks the stemmer from removing what looks like a suffix. So, -e is a valid suffix, but -ě is not a suffix, so Bechyně no longer gets stemmed to bechyn. Some related forms may not group together, but they are non-Slovak, so that’s okay. Overall, this is fine.
  • In the random sample where words are added to groups (the list starting with alternativ), the groups are generally Slovak words and they all look okay. The only concern is Vals/Valšov, where Valšov gets -ov removed by the stemmer, giving valš, which gets folded to vals. This is the kind of thing that happens with diacritic folding, so it’s expected, and based on this sample, not what happens in the majority of cases. This is good!
  • In the random sample where words are both added and lost from groups (the list starting with cesta), Czech cognates ending in are lost because the stemmer doesn’t recognize them anymore. It would be okay if they were grouped with the Slovak cognate, but it’s okay if they are not. This group is basically a combination of the two previous groups—some related non-Slovak words lost from the group because diacritics block stemming, with generally good additions. Overall, this is fine.

These three random groups are the most representative of the changes we will see, and they are generally good, so we are good to go.

  • In the groups with high-frequency words that added to the groups (the list starting with az), the only case that might be a problem is kedy and keď, but they look to be etymologically related, high-frequency function words, and so not a huge problem. Overall, this is fine.
  • In the groups with high-impact changes that both added and lost from groups (the list starting with dom), most lost words are non-Slovak and most gained words are good additions. This is good!

In summary, folding after stemming is a net improvement, and we should implement the change to enable this version of the analysis chain for now, and follow up on improving the stemmer at a later time. (See T227924.)

Next Steps[edit]

  • DONE Modify the Slovak analysis chain to enable diacritic folding for Slovak diacritics. (See T235561.)
  • Once that is merged and in production, re-index Slovak-language wikis. (See T235654.)

Option 3: Modify the Stemmer[edit]

THIS IS A PLACEHOLDER FOR POSSIBLE FUTURE WORK.

And, getting waaaaay ahead of myself, another option to consider if stemming before folding doesn't work is to modify the stemmer to include an option to work on words without diacritics. This could be a fair amount of work to minimize the number of inaccurate stems.

I did note above that the stemmer needs some additional suffixes added to it, which is a separate task. (See Phab ticket T227924.)