User:TJones (WMF)/Notes/Serbian Stemmer Analysis

November 2017 — See TJones_(WMF)/Notes for other projects. See also T178926.

Background & Stemmers
My goal is to test the stemming quality of the previously identified stemmers available for Serbian. One of the stemmers, SerbianStemmer, is implemented in Python and doesn't have any licensing info. On closer inspection, it seems that the other stemmer, SCStemmers, which includes multiple stemmers, also re-implements the one implemented in Python. For the first pass, then, I'm only going to look at the stemmers in SCStemmers.

SCStemmers labels the four stemmers by number, and I'll refer to them by number from now on:
 * 1) "Kešelj & Šipka - Greedy" for Serbian
 * 2) "Kešelj & Šipka - Optimal" for Serbian
 * 3) "Milošević" for Serbian
 * 4) "Ljubešić & Pandžić" for Croatian

See the documentation for more details on the papers and algorithms. Since #4 is for Croatian, I'm going to ignore it.

Data
I gathered 10K random articles/entries each from Serbian Wikipedia and Serbian Wiktionary, and ran them through the default analyzer (standard tokenizer + icu_normalizer) to get tokens. The icu_normalizer lowercases letters, and does a few other regularizations (most relevant for Serbian, ǳ and ǆ to dz and dž, plus some other character variants).

I divided the tokens into a number of groups:
 * Cyrillic (156,120)—only Cyrillic Serbian letters
 * Latin (73,613)—only Latin Serbian letters
 * Mixed (620)—both Cyrillic and Latin Serbian letters
 * These are mostly words with one letter in a different alphabet from the rest—e.g. gаsoviti, where а is Cyrillic—and are probably typos. Some where the word is Latin but the last letter is in Cyrillic and the main terms looks technological, like dosу (lowercase of DOSy) or directxа might be Latin words with Cyrillic inflections, similar to how in English one might refer to the former Soviet Socialist Republics as the ССРs, with only the s being Latin.
 * Number (9,491)—any combination of digits, commas, and periods
 * Cyrillicish (11,643)—first character is Cyrillic (not limited to Serbian Cyrillic)
 * Includes words with non-Serbian characters, mixed letters and numbers, tokens with punctuation, mixed Cyrillic and Latin, etc., but largely Cyrillic.
 * Latinish (14,186)—first character is Latin (not limited to Serbian Latin)
 * Similar to "Cyrillicish", but also includes IPA and Latin-based symbols (e.g. ℝ).
 * Numberish (990)—first character is a number or fullwidth number
 * Lots of letters in here, many look like measurements or ID numbers (e.g., ISBNs).
 * Other (21,148)—everything else, including tokens starting with symbols (3), Greek (336), Coptic (1), Glagolitic (1), Georgian (17), Armenian (10), Hebrew (179), Arabic (2,321), Syriac (5), Tifinagh (3), Ethiopic (79), Devanagari (527), Bengali (1), Tamil (59), Kannada (3), Thai (7), Tibetan (37), Ogham (1), one (1) that I apparently missed, and lots and lots of CJK (17,557). (Bonus!—I got to install more fonts!)
 * Most of these are words or tokens in a given script, but Greek in particular mixes with Latin for measurements and other sciency stuff (e.g., μmol or ΔG).

Stemming Results
Before we get to the linguistic quality of the stemming results, there are some other issues to deal with.

The implementation of the Serbian stemmers internally converts characters from Cyrillic to Latin (C2L), do stemming in Latin, and give the final form in Latin. This is a nice feature, because it also solves the problem of searching in either character set getting matches in the other character set—though I've noticed that the majority of text in the Serbian Wikipedia is in Cyrillic. It also nicely addresses the problem of accidentally mixing homoglyphic Cyrillic and Latin characters, which obviously happens from time to time.

Anything that is not "a letter" according to Java's  function is unchanged. Anything that is "a letter" but not an English letter, number, or underscore (according to Java's ), or a Serbian Latin or Cyrillic letter (listed explicitly) is dropped. As a result, a couple of undesirable things happen: words in non-Latin, non-Cyrillic scripts get dropped entirely, except for non-letter diacritics. So some words in Hebrew (ַָָּׁ), Arabic (ََُِِّ), Hindi (ि्ु्ा), Tamil (ொ்ி்ொ்ு) and others are reduced to little piles of diacritics.

Non-Linguistic Impressions
In general, stemmers 1 and 2, being variants of the same basic algorithm, have fewer differences between them than compared to stemmer 3, with stemmer 1 being more different from stemmer 3.


 * Cyrillic—this is where most of the content is, and there are thousands of differences among the tokenizers. We'll have to look at groupings and see what's what. Stemmers 1 and 2 return a small number of empty tokens (< 0.1%)
 * Latin-Again, a small number of empty tokens for stemmers 1 and 2 (≤ 0.10%), and thousands of differences among the tokenizers.
 * Mixed—not a lot of tokens here, and no empty tokens.
 * Number—no changes: token in, token out.
 * Cyrillicish—A higher number of empty tokens: about 0.25% for stemmers 1 and 2, and 0.16% for stemmer 3. Fewer differences between stemmer 1 and 2, but still plenty.
 * Latinish—much higher rate of empty tokens (0.7%, 0.8%, and 0.4%).
 * Numberish—No empty tokens (the numbers always come through); only 2 tokens differ for stemmers 1 and 2.
 * Other—a bit more than 87% of these tokens come back empty. Stemmer 1 and 2 are identical; stemmer 3 differs from them on 3 tokens that have some Latin or Cyrillic characters in them (the other characters are dropped).

Problematic Groupings
Because of the problems with certain non-Serbian letters being dropped, there are lots of weird groupings for stemmers 1 and 2: (2, 2ε, 2π, 2σ, ω2—not too bad), (95, ņ95—okay), (ber, òberū, über, etc.—wait, no...), (aga, agati, żagań, ага, агате—aaaa!), (aa, aа, ğaʿala, аa, ḍaẓaġ, ṯaẖaḏ—noooooo!). The largest group has 397 tokens in it—all words in Devanagari script with a single virama (" ् "), which get stemmed to just the virama. Similarly there are other groups for Arabic, Tamil, and Tibetan.

Along with the not-so-good, a good thing is happening in this grouping: (aa, aа, ğaʿala, аa, ḍaẓaġ, ṯaẖaḏ). It looks like there are 3 instances of aa, but the first is two Latin a's, the second is one Latin and one Cyrillic, and the third is one Cyrillic and one Latin. This is generally a good thing, because these mixed script words get converted to one script and stemmed properly (though aa doesn't get stemmed).

So, I re-ran the analysis just including the Cyrillic, Latin, and Mixed groups (i.e., only words with Serbian letters). There are still a number of very aggressive-looking stemming groups with one-letter stems from stemmer 1 and 2, and a smaller number of two-letter stems from stemmer 3. (Stemmers 1 and 2 also have plenty of two-letter stems.)

Comparison Among Stemmers
I had some difficulty figuring out how to best compare competing stemmers. Because the stemmers are different, the same words could be grouped together, but with a different stem. Or the same stem could represent different groups. Given the fact that the groups can also merge and split, it's not straightforward.

I decided to do my usual sampling of the biggest groups with the most words in them, the "unexpected" groups that have no common initial or final letters, and a few random groups.

Unexpected Groups
Groups with no common first or last letter—in languages with typical European affixing—are suspect. They can be perfectly fine, though, as with forms of English be or good/better/best.

Stemmer 1: None! There aren't any groupings that don't at least start or end with the same letter (or letter mapped from Cyrillic to Latin).

Stemmer 2: Also none!

Stemmer 3 : Only 2—


 * [stem hteti, 26 word types] hoće, hoću, htela, hteli, hteo, hteti, će, ćemo, ćete, ćeš, ću, хоће, хоћемо, хоћеш, хоћу, хтеде, хтела, хтеле, хтели, хтело, хтео, ће, ћемо, ћете, ћеш, ћу
 * A quick trip to Google translate shows all of these as being versions of "will" or "want", so that doesn't seem unreasonable.


 * [stem jesam, 22 word types] je, jesi, jeste, jesu, jе, sam, si, smo, ste, su, сам, си, смо, сте, су, јe, је, јесам, јеси, јесмо, јесте, јесу
 * Google translate says these are all forms of "be", though Wiktionary also says sam/сам can mean "alone" and other things, too. Quite reasonable.

Big Groups
Bigger groups tend to be a sign of over-aggressive stemming or errors.

Aggressive one- and two-letter stems
As mentioned before, there are a lot of one- and two-letter stems that seem much too aggressive. Below are the three biggest for each stemmer, plus stems and type counts for the next few, down to size 60. My expectation is that a lot of these are poor and merge too much. A possible improvement might be to have minimum stemm length.

Stemmer 3 seems likely to be the best of the bunch in this dimension, since it has no one-letter stems and fewer large groups.

Stemmer 1


 * [stem: p, 121 word types] p, paba, paca, paja, paka, past, pega, pekao, pela, pele, pena, peni, pete, peti, peć, peći, pijem, piju, pila, pile, piliće, pima, pisala, pisali, pisalo, pisan, pisana, pisane, pisani, pisao, pisati, pismo, pite, piti, piće, piše, pišući, poste, pula, puli, pune, puste, puta, pute, puti, puto, puše, п, пава, пади, пака, паст, пата, пајa, паја, пева, пега, пегле, пекао, пела, пеле, пели, пени, пено, пете, пети, пећ, пећи, пила, пиле, пили, пилиће, пило, пирели, писала, писале, писали, писало, писан, писана, писане, писани, писано, писао, писати, писах, писаше, писаће, писаћу, писмо, писте, пите, пити, пише, пишем, пишеш, пишући, пијем, пијемо, пијеш, пију, пиће, пићу, плац, плаца, плима, плиће, пому, поци, поче, птит, пула, пуле, пули, пуне, пусте, пута, путе, пути, путо, пучем.


 * [stem: s, 99] s, saba, sadi, saka, sama, sata, sava, saća, sekte, sela, seli, selo, sena, seno, sevi, seći, sijena, sijene, sila, sile, sili, sisati, site, siti, sivana, skovao, slovima, smeš, sova, steći, stila, stit, sumo, suše, svima, ševa, с, саба, сава, сади, сака, самa, сама, сата, саја, свима, сева, сега, секао, секте, села, селе, сели, село, сена, сене, сени, сено, сете, сети, сећи, сила, силе, сили, сима, симо, сисати, сите, сити, сијело, сијена, сијено, сију, сковала, сковали, скован, скована, сковао, словима, смем, смемо, смете, смеш, сова, сове, сови, стела, стеле, стећи, стивши, стила, сула, суле, сули, сумо, суше, шаху, шева, шеве


 * [stem: b, 97] b, baba, baja, baka, bata, bega, bela, bele, beli, belo, beste, bijeg, bijela, bijeli, bijelo, bila, bile, bili, bilo, bismo, biste, biti, bivalo, bivati, bivši, biće, biću, bješe, boca, boce, boci, bocu, boga, bogn, bruk, buta, б, баба, бака, бата, баја, баћа, бега, бела, беле, бели, бело, бесмо, бети, беше, беју, биla, бивале, бивали, бивало, бивао, бивати, бивши, била, биле, били, било, бима, бисао, бисмо, бисте, бити, биш, бише, бијела, бијеле, бијели, бијело, бијен, бијеше, бију, биће, бићеш, бићу, блаца, блиш, бове, бога, боца, боце, боци, боцу, брук, була, буле, були, було, буне, бута, бује, бјела, бјело

Other stems, by group size: tr (95 types) ,l (84), m (83), pa (81), č (79), kr (78), br (75), d (71), n (70), v (70), r (68), rad (67), le (66), vid (66), vr (66), dob (65), k (65), par (65), ra (65), pra (64), š (64), ko (63), prav (63), pr (62), bo (60), ma (60).

Stemmer 2


 * [stem: s, 146] s, saba, saka, sala, sama, sana, sanje, sata, sava, scie, sekte, sela, seli, selo, seći, sečem, sijena, sila, sile, sili, site, siti, sivana, slom, smeš, sova, steg, stega, steći, stigao, stigavši, stigla, stigli, stiglo, stigne, stigneš, stila, stilom, stit, stići, sumo, suše, suši, svekar, svemu, svima, svom, šalje, šalju, šem, šeno, ševa, šš, с, саба, сава, сака, сала, самa, сама, сана, сата, сање, свекар, свему, свима, свом, својима, сего, секао, секле, секли, секло, секте, секу, села, селе, сели, село, сећи, сила, силе, сили, сима, симо, сирина, сирија, сирије, сирији, сиријом, сирију, сирт, сите, сити, сијело, сијена, сијено, слом, смем, смемо, смеш, сова, сове, сови, стад, стадима, стега, стела, стеле, стећи, стивши, стигавши, стигао, стигла, стигле, стигли, стигло, стигне, стигнемо, стигну, стигнути, стила, стилом, стим, стих, стићи, сула, суле, сули, сумо, суше, суши, сушимо, шаху, шаље, шаљи, шаљу, шева, шеве, шем, шен, шене, шени, шеш, шч, шш


 * [stem: d, 82] d, daba, dalja, dama, data, datei, degli, dela, dele, deli, delo, dereš, deti, digne, dignem, dignemo, dignuti, digo, dijela, dijete, dima, divan, dići, domu, dove, dođu, došla, došli, д, даеи, дама, дата, даца, даља, дела, деле, дели, дело, дерем, дереш, дећ, джем, диван, дивши, дигао, дигли, дигло, дигне, дигнете, дигни, дигну, дигнуо, дигнут, дигнута, дигнути, дигнуто, дима, дише, дијега, дијего, дијела, дијете, дићи, дога, дому, доца, дошла, дошле, дошли, дођем, дођу, доћ, дују, дјела, дјели, дјело, дјеце, дјецом, дјецу, ђах, ђена, ђене


 * [stem: m, 82] m, maca, mago, maka, maknuti, mata, meće, mećemo, mećeš, mijem, mila, mile, mili, milo, milou, mimo, miradi, mirage, mirel, mirovati, mirta, mirti, miti, miće, miš, mlje, mlju, moca, moga, mogao, mogla, mogle, mogli, moglo, move, moć, м, мака, макне, макнути, мареш, мата, матеа, маца, маћи, маџа, мела, меше, меће, мећемо, мила, миле, мили, мило, мимо, мираж, мирза, мировала, мировали, мировао, мировати, мирта, мите, миш, мише, миће, мови, мога, могао, могла, могле, могли, могло, мого, мому, моца, моше, моћ, мрмка, мулој, мљи, мљу

Other stems, by group size: č, (82), v (77), le (76), b (75), kr (74), pos (74), p (71), š (70), tr (69), ba (68), rad (68), n (67), vid (66), gr (62), par (61), pis (61), prav (61), re (61), živ (61), mar (60), me (60), pr (60).

Stemmer 3


 * [stem: kr, 80] kra, krah, kraja, kraka, kralja, kraste, krat, kraća, krem, kremu, krene, krenu, kreće, kreću, kreše, krila, krilo, krio, krku, krlja, kroj, krova, krte, kru, krune, krut, kruta, kruti, krš, krše, кр, крад, крака, крах, краја, краћа, краћу, кре, крег, крега, креле, крем, крене, крену, кресте, креш, креју, креће, крећу, кри, крила, криле, крили, крило, крим, крио, крит, крити, криш, крију, крка, крке, крку, крле, крма, кро, крова, крој, крсте, круне, крут, крута, круте, крути, круто, кручи, крује, крчки, крш, крше


 * [stem tr, 78] tr, tra, traka, tran, trata, tre, trem, treo, treće, treći, treću, treš, tri, triju, trima, trio, trka, trla, trlja, trna, trom, trska, trske, trti, trule, truo, trut, tрке, тра, траба, трад, трака, трачки, тре, треве, трем, трему, трена, трене, треном, трета, треју, треће, трећи, трећу, три, триван, трим, трима, трио, трију, трка, трке, трку, трла, трле, трна, тро, трова, тровали, трован, трога, тром, трој, трска, трске, трски, тру, труле, труло, трун, труо, трут, трута, трује, трца, трци, трче


 * [stem: sv, 77] sv, sva, svad, svaja, svaka, svat, svata, sve, svega, svemu, svet, sveta, svete, sveti, svetom, svetu, sveće, sveći, sveću, svi, svih, svila, svile, svim, svima, svinja, svinje, svinju, svite, svo, svog, svoga, svoj, svom, svu, св, сва, свад, свака, сван, свао, сват, свата, свати, свах, свац, све, свеви, свег, свега, свему, свео, свет, света, свете, свети, светима, светом, свету, свеће, сви, свила, свиле, свили, свим, свима, свите, свих, свиш, свију, сво, свог, свога, свом, свој, сву, свуче

Other stems, by group size: rad (75), br (70), da (67), gr (67), prav(66), li (62), ma (62), vid (61), bo (60), st (60).

Biggest common three- or four-letter stems
The stems prav, rad and vid were among the largest groups for all stemmers, so I've pulled them out for comparison. I transliterated everything to Latin and deduplicated the lists to make comparison easier. I expect these aren't great groupings just because they are big.

The groupings are presented below. Everything common to all three stemmers is listed under "common", with the remainder listed for each stemmer (with some overlap remaining, since two of the three sometimes agree).


 * common: prav, pravac, pravaca, pravcem, pravci, pravcima, prave, pravi, pravih, pravila, pravili, pravilo, pravim, pravimo, pravio, praviti, praviš
 * 1: pravilima, pravljeni, pravna, pravne, pravni, pravno, pravn, pravljen, pravljena, pravljene, pravljeno
 * 2: pravca, pravcu, pravilima, pravima, pravljeni, pravdu, pravljen, pravljena, pravljene, pravljeno
 * 3: prava, pravca, pravcu, pravima, pravna, pravne, pravni, pravno, pravnog, pravnu


 * common: rad, rada, rade, radeći, radila, radili, radilo, radim, radimo, radio, radit, raditi, rado, radom, radova, radovan, radove, radovi, radovima, radu, radac, radile, radiću, radla, radovali, raduju
 * 1: radiju, radna, radno, radovanje, rađen, rađena, rađeni, radovalo, radovana, radovani, radovati, radovahu, radule, rađene, rađeno
 * 2: radi, radića, radići, radiš, rađen, rađena, rađeni, radić, radovalo, radovana, radovani, radovati, radovahu, raduje, rađene, rađeno
 * 3: radan, radeta, radi, radiju, radiš, radna, radne, radni, radnji, radno, radnog, radnu, ratko, radetom, radul, radule, raduje, ratka, ratko


 * common: vid, vidac, vide, videla, videle, videli, videlo, video, videti, vidi, vidim, vidimo, vidio, vidite, vidla, vidom, vidova, vidove, vidu, videvši, videsmo, videše, videće, videći, vidiću, vidli, vidna, vido, vidovi, vidovima
 * 1: vida, vidjeti, vidjevši, viđen, viđeni, videćemo, videćete, vidno, viduše, vidjela, vidjeli, vidjelo, viđena, viđene, viđeno
 * 2: vidak, vidić, vidiš, viđen, viđeni, videćemo, videćete, vidiljivih, vidjela, vidjeli, vidjelo, vidjeti, viđena, viđene, viđeno
 * 3: vida, vidan, vidati, vidiš, vitka, vidni, vidno, vidnog, viduše, vitko

Random Sample of Groups
I chose 15 random words from the corpus, and found the groups those words are in. I transliterated everything to Latin and deduplicated the lists, because it isn't shocking when lumperaj and лумперај are in the same group.

All groups overlap by at least the one word they have in common, and as above, the words shared by all three stemmers are pulled out under "common", with the remainder listed for each stemmer (with some overlap remaining, since two of the three sometimes agree).


 * common: lumperaj


 * common: krepati
 * 1: krepak
 * 2: (no others)
 * 3: krep, krepa, krepiti


 * common: provera, provere, proverena, proverene, proveri, proveriti, provereno, proverilo, proverom, proveru
 * 1: proverava, proveren, provereni
 * 2: proverava, proveren, provereni
 * 3: (no others)


 * common: besplatna, besplatno, besplatne, besplatni
 * 1: (no others)
 * 2: besplatnih, besplatnog, besplatnom, besplatnu
 * 3: besplatan, besplatnog, besplatnu


 * common: galija, galijama, galije, galiji, galijom
 * 1: (no others)
 * 2: galiju
 * 3: galijano


 * common: deseptikona, deseptikone, deseptikoni


 * common: konzularne, konzularni
 * 1: (no others)
 * 2: konzularnog, konzularnom
 * 3: konzularnog


 * common: pepeljug, pepeljuga
 * 1: (no others)
 * 2: (no others)
 * 3: pepeljugi


 * common: pregažen
 * 1: pregaženo
 * 2: pregaženo
 * 3: (no others)


 * common: preklapanja, preklapanjem, preklapanju
 * 1: preklapanje, preklapanjem
 * 2: preklapanjem, preklapa
 * 3: preklapanje


 * common: preporučen
 * 1: preporučeno, preporučila, preporučio, preporučivati, preporučuje, preporučuju, preporučena, preporučene, preporučeni, preporuči, preporučivao, preporučili
 * 2: preporučeno, preporučila, preporučio, preporučivati, preporučuje, preporučuju, preporučena, preporučene, preporučeni, preporuči, preporučivao, preporučivši, preporučili
 * 3: preporučenih


 * common: pristanak
 * 1: (no others)
 * 2: pristanka, pristankom
 * 3: (no others)


 * common: svetki, svetkog
 * 1: svetkovanje
 * 2: svet, sveta, svete, sveti, svetlo, svetlu, sveto, svetog, svetom, svetovan, svetu,, svetima, svetla, svetlima, svetlom, svetlu, sveto, svetova, svetove, svetovi, svetog, svetoga, svetom, svetoj, svetu
 * 3: svetkom, svetkuje


 * common: fiksiranom, fiksiranoj
 * 1: fiksiranog
 * 2: fiksiraju, fiksirana, fiksiranog, fiksiran, fiksirane, fiksirani
 * 3: (no others)


 * common: huligani, huligana
 * 1: huligan
 * 2: huligan
 * 3: (no others)

Lost tokens
Stemmer 1 and 2 lost some of the Serbian-character-only tokens, 196 and 164 (out of 230,353)—most 4 characters or less, including words like demo, ivan, ivana, memo, and their Cyrillic counterparts, демо, иван, ивана, мемо. Stemmer 3 did not lose any Serbian-character-only tokens.

I will also report these as bugs to the developer.

Recommendations and Next Steps

 * Get speaker review of groupings to assess general stemming quality, and see if it's possible to easily choose among the three stemmers. (DONE)
 * Željko reviewed the groupings and said that other than the very short stems, everything looks reasonable, so I'm waiting to hear back from the developer about fixing the non-Serbian and short-stem problems.
 * File bugs with the developer for the empty token problems (DONE):
 * losing non-Latin, non-Cyrillic characters are one problem: Issue #1 (DONE)
 * the small number of all-Latin or all-Cyrillic tokens that come back empty is a problem: Issue #2 (tentatively declined)
 * overly aggressive stemming resulting in one-letter stems may be an avoidable problem: also Issue #2 (after a peek at the code, it looks like this and the previous one may have the same cause) (tentatively declined)


 * Depending on the developer's response, consider some combination of submitting pull requests, forking the project, or creating a wrapper in the plugin to prevent problems (e.g., don't stem strings with any "bad" characters and replace empty tokens with the original token). (DONE—see below)
 * Test the Croatian stemmer...

Player 4 Has Entered the Game!
The developer of the SCStemmers library, Vuk Batanović, has recommended stemmer #4 as the overall best. I hadn't considered it because it was labeled as "Croatian" and also didn't handle Cyrillic input, which is critical for Serbian. I've since learned more about Serbo-Croatian from Željko and WIkipedia, and Vuk has added a Cyrllic-to-Latin filter on the front end of the Croatian stemmer—so I'm planning to test it next. If it doesn't do well, we'll go back and look at adding a minimum stem length option to the other stemmers.