User:TJones (WMF)/Notes/Esperanto Stemmer Analysis

June 2018 — See TJones_(WMF)/Notes for other projects. See also T197240.

Background
Esperanto is a bit further down the list of the remaining top 50 languages to look at (T171652), but it jumped to the top because I had a developer ask me to recommend a project to work on, and I suggested an Esperanto stemmer. As a constructed language, Esperanto is very regular and reasonably well documented, so the barrier to implementing a stemmer is much lower for a non-speaker.

It's now available on GitHub; it's in Java and has GPL3 license (the structure is based on the Serbian stemmer), so all of the technical details are in good shape. The next step is to get a review of the stemming quality from speakers.

Data
I pulled 5,000 random articles from the Esperanto Wikipedia and 5,000 entries from the Esperanto Wiktionary, and did my usual stripping of markup and deduplication of individual lines (to get rid of excess copies of the equivalent of commonly used headings like "References", "See Also", "Noun", language names, etc.).

On Tokenization
Since this is an external stemmer, I used the default analyzer configuration—standard tokenizer and ICU normalization—to break the text into tokens, and then pass them to the stemmer in a file. The stemmer does its own round of tokenization when reading a file, creating some discrepancies.

The stemmer tokenizes strings of Esperanto characters with word boundaries at either end, which means that it does not stem words with q, w, x, y, or any letters with non-Esperanto diacritics. Esperanto Wikipedia generally translates names (Wolfgang Amadeus Mozart becomes Volfgango Amadeo Mozarto) but some still slip through (the Mozart article includes Wolfgang-on).

The stemmer also handles the Esperanto use of a dash for odd plurals (as in Wolfgang-on, I think). It's similar to how English sometimes uses an apostrophe, so that the plurals of a, b and i can be a's, b's, and i's rather than as, bs, and is, which can be confusing. The current default stemmer breaks on dashes, so those kinds of tokens don't exist in this corpus.

The stemmer also breaks up some tokens the standard tokenizer doesn't, stems them, and reassembles them. So here's becomes her's, and ethnologue.com becomes ethnologu.com. If we deploy the stemmer, we won't be using its tokenizer, so this won't be an issue (though other oddities will inevitably show up).

Stemming Groups for Review
Below are some stemming groupings for review by speakers of Esperanto. These are tokens that would be indexed together, so searching for one would find the others. The format is  - [ ][  ].... The  is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the tokens being considered. It's sometimes useful for figuring out the logic the stemmer has used. The is a token found by the language analyzer (more or less a word) and is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters. Some errors are expected, because language is messy—though Esperanto is far less messy than most.

About 10.3% of unique words and 17.9% of all words in the Wikipedia corpus end up stemming together with another word. In the Wiktionary corpus, it's about 9.6% of unique words and 10.0% of all words that end up stemming together with another word.

Large Groups
The largest groups are more likely to have resulted from unrelated words being stemmed together, though none of these are particularly large. They are also often just more common words.

These are the ten largest groups from the Wikipedia corpus, and the groups with 10 or more distinct forms from the Wiktionary corpus.

Wikipedia:


 * est: [1 EST][4 Est][1 Esta][6 Estante][192 Estas][4 Este][3 Esti][2 Estinta][65 Estis][1 Esto][1 Estonto][1 Estos][7 Estu][3 Estus][13 est][7 esta][1 estaj][1 estan][12 estanta][2 estantaj][34 estante][14359 estas][611 esti][18 estinta][3 estintaj][4 estinte][1 estintus][7868 estis][1 estita][11 esto][2 eston][38 estonta][10 estontaj][1 estontajn][6 estontan][7 estonte][4 estonto][1 estonton][89 estos][112 estu][122 estus]
 * far: [5 Far][1 Fare][1 Fari][2 Farita][6 Faro][2 Faru][38 far][2 faranta][1 farantaj][8 farante][1 faranto][81 faras][14 farata][4 farataj][344 fare][112 fari][1 farinte][1 farinto][222 faris][82 farita][36 faritaj][6 faritajn][3 faritan][9 faritis][3 faro][6 faroj][1 farojn][2 faron][1 faronta][4 faros][1 farota][7 faru][2 farus]
 * form: [1 FORMAS][1 Form][1 Forma][1 Forman][1 Formanto][5 Formato][1 Formatoj][2 Formo][2 Formoj][5 forma][2 formaj][3 forman][3 formanta][1 formantaj][1 formantajn][15 formante][4 formantoj][1 formantojn][110 formas][7 formata][1 formatan][18 formato][1 formatoj][3 forme][30 formi][1 forminta][41 formis][27 formita][8 formitaj][1 formitan][268 formo][70 formoj][24 formojn][46 formon][1 formos]
 * kant: [10 Kant][1 Kantate][1 Kantaten][4 Kantato][13 Kanto][1 Kantoj][1 Kantu][3 kanta][1 kantaj][2 kantajn][2 kantanta][3 kantantaj][6 kantante][1 kantanto][3 kantantoj][1 kantanton][22 kantas][6 kantata][5 kantataj][1 kantatan][3 kantato][3 kantatoj][12 kanti][1 kantintaj][29 kantis][5 kantita][97 kanto][67 kantoj][23 kantojn][15 kanton][1 kantus]
 * lud: [1 Ludanta][1 Ludante][4 Ludanto][2 Ludantoj][1 Ludo][7 Ludoj][2 Ludojn][1 Ludon][1 lud][3 luda][1 ludaj][11 ludanta][3 ludantaj][1 ludantajn][1 ludantan][3 ludante][62 ludanto][35 ludantoj][3 ludantojn][1 ludanton][57 ludas][9 ludata][2 ludataj][1 ludatas][25 ludi][150 ludis][8 ludita][2 luditaj][1 luditajn][141 ludo][61 ludoj][6 ludojn][21 ludon][3 ludos][1 ludu]
 * mort: [3 MORTON][2 Mort][6 Morta][1 Mortas][1 Mortintaj][7 Mortis][6 Morto][6 Morton][4 mort][10 morta][4 mortaj][1 mortan][1 mortanta][1 mortantaj][1 mortanto][1 mortantoj][1 mortanton][14 mortas][3 morte][5 morti][135 mortinta][13 mortintaj][1 mortintan][1 mortinte][6 mortinto][17 mortintoj][6 mortintojn][1 mortinton][611 mortis][1 mortitoj][247 morto][8 mortoj][2 mortojn][15 morton][1 mortonta][2 mortos][1 mortu][1 mortus]
 * nom: [4 NOM][1 Noman][3 Nome][1 Nomi][1 Nomis][2 Nomita][7 Nomo][1 Nomoj][1 Nomon][5 nom][3 noma][1 nomajn][1 nomantaj][2 nomante][97 nomas][296 nomata][72 nomataj][3 nomatajn][16 nomatan][11 nomatas][1 nomate][382 nome][16 nomi][1 nominta][64 nomis][345 nomita][28 nomitaj][2 nomitajn][11 nomitan][2 nomite][2 nomitis][928 nomo][74 nomoj][33 nomojn][321 nomon][1 nomos]
 * sekv: [1 Sekva][1 Sekvajn][1 Sekvanta][1 Sekvantan][4 Sekvante][9 Sekvas][1 Sekvata][49 Sekve][1 Sekvintaj][25 Sekvis][1 Sekvoj][1 Sekvos][1 sekv][82 sekva][80 sekvaj][5 sekvajn][14 sekvan][21 sekvanta][37 sekvantaj][2 sekvantajn][10 sekvantan][8 sekvante][9 sekvanto][11 sekvantoj][2 sekvantojn][1 sekvanton][46 sekvas][13 sekvata][1 sekvataj][1 sekvatajn][113 sekve][8 sekvi][4 sekvinta][1 sekvintaj][1 sekvintajn][1 sekvintan][86 sekvis][6 sekvita][2 sekvitaj][2 sekvite][10 sekvo][10 sekvoj][3 sekvojn][3 sekvon][3 sekvonta][3 sekvontaj][3 sekvontan][3 sekvos][1 sekvota][2 sekvu]
 * uz: [1 Uz][2 Uzante][1 Uzas][3 Uzo][1 uzan][8 uzanta][1 uzantaj][30 uzante][8 uzanto][11 uzantoj][3 uzantojn][1 uzanton][185 uzas][265 uzata][88 uzataj][2 uzatajn][3 uzatan][36 uzatas][1 uzate][2 uzatis][83 uzi][1 uzintaj][1 uzintus][132 uzis][78 uzita][27 uzitaj][2 uzitan][1 uzitis][83 uzo][13 uzoj][29 uzon][1 uzonto][8 uzos][1 uzota][1 uzotaj][1 uzotajn][6 uzu][3 uzus]
 * vid: [5 VIDA][2 Vida][1 Vidanto][1 Vidita][3 Vido][1 Vidoj][49 Vidu][4 vid][13 vida][5 vidaj][1 vidan][1 vidanta][2 vidanto][39 vidas][23 vidata][10 vidataj][1 vidatajn][1 vidate][3 vide][46 vidi][1 vidinta][1 vidinto][1 vidintoj][54 vidis][10 vidita][5 viditaj][1 viditajn][1 viditan][9 vido][1 vidojn][5 vidon][2 vidos][119 vidu]

Wiktionary:


 * dir: [3 Dir][7 dir][3 diras][2 dire][10 diri][1 dirinte][5 diris][1 dirite][1 diros][2 diru][1 dirus]
 * est: [2 Estas][1 Estate][1 Esti][1 Esto][5 est][6 esta][2 estante][128 estas][2 estate][5 este][62 esti][1 estinta][32 estis][1 esto][2 estonta][1 estonte][10 estos][1 estu][11 estus]
 * far: [1 Fari][3 far][3 faras][1 farata][1 fare][24 fari][1 faris][2 farita][1 faritas][3 faro][2 farota][1 faru]
 * hav: [1 hava][8 havanta][1 havanto][21 havas][3 have][20 havi][3 havis][2 havo][2 havos][1 havu][1 havus]
 * help: [2 help][3 helpa][2 helpanto][1 helpas][3 helpe][9 helpi][2 helpis][11 helpo][3 helpon][1 helpos][1 helpu][1 helpus]
 * uz: [2 Uzata][4 uzas][14 uzata][1 uzataj][4 uzi][2 uzis][2 uzita][5 uzo][1 uzon][1 uzos][1 uzu]

Random Groups
Here are 25 groups each from the Wikipedia and Wiktionary corpora, chosen at random. These are likely to be more representative of the performance of the stemmer, though weird stuff can always make it into the sample.

Wikipedia:


 * akademian: [6 akademiano][2 akademianoj]
 * bazarad: [2 bazarada][2 bazarado]
 * demokr: [9 Demokrata][1 Demokrataj][2 Demokratan][7 Demokratoj][3 Demokrito][5 demokrata][5 demokrataj][1 demokratan][1 demokrate][1 demokratoj]
 * elkonstru: [1 elkonstrui][1 elkonstruis]
 * enketist: [1 Enketisto][1 Enketistoj][5 enketistoj]
 * halopreĝej: [1 halopreĝejo][1 halopreĝejoj]
 * klav: [3 klavoj][2 klavojn]
 * lodz: [4 Lodzo][1 lodzajn]
 * mais: [1 Maise][8 Maison]
 * motor: [3 Motor][1 Motoren][2 Motoro][2 motora][1 motoraj][1 motoris][21 motoro][12 motoroj][2 motorojn][5 motoron]
 * noval: [4 Novalis][3 novalo]
 * objektivec: [1 objektiveco][1 objektivecon]
 * podlask: [2 Podlaska][1 Podlaski]
 * pop: [2 Pop][1 Pope][15 pop][1 popa]
 * prag: [6 Praga][59 Prago][4 praga][1 pragaj][1 pragan]
 * pretigad: [4 pretigado][2 pretigadon]
 * rapid: [3 Rapida][1 Rapide][5 rapid][39 rapida][9 rapidaj][2 rapidajn][12 rapidan][1 rapidanta][2 rapidas][101 rapide][1 rapidi][1 rapidis][37 rapido][6 rapidoj][1 rapidojn][6 rapidon][1 rapidu]
 * rim: [2 Rima][3 Rime][1 rima]
 * roz: [4 Roza][2 roza][1 rozaj][2 rozan][3 rozo][4 rozoj][2 rozojn][1 rozon]
 * rozkolor: [2 Rozkolora][5 rozkolora][5 rozkoloraj][3 rozkolorajn][2 rozkoloran]
 * sekvoj: [1 Sekvoja][2 sekvojo][2 sekvojoj][1 sekvojon]
 * ŝipestr: [1 ŝipestra][4 ŝipestro]
 * socorrens: [2 socorrense][5 socorrensis]
 * sulfur: [2 sulfura][4 sulfuro]
 * unujar: [3 unujara][3 unujaraj]

Wiktionary:


 * anĉ: [2 anĉo][1 anĉojn]
 * dobr: [2 dobro][1 dobroj]
 * dritt: [2 Dritte][1 dritte][3 dritten]
 * duontag: [1 duontaga][2 duontage]
 * ekonomi: [1 Ekonomia][4 ekonomia][2 ekonomio]
 * far: [1 Fari][3 far][3 faras][1 farata][1 fare][24 fari][1 faris][2 farita][1 faritas][3 faro][2 farota][1 faru]
 * fervor: [1 fervora][2 fervoran]
 * hidrogen: [1 hidrogena][1 hidrogeno]
 * ĵurnal: [1 ĵurnalo][1 ĵurnaloj]
 * kann: [1 Kann][1 Kanne][27 kann]
 * kompakt: [1 kompakt][2 kompakta][1 kompakten]
 * konsil: [1 konsilantoj][1 konsilantojn][2 konsilo][2 konsilon]
 * meksik: [3 Meksiko][2 meksika]
 * miel: [1 miel][1 miela][1 mielaj][7 mielo]
 * miokardi: [1 miokardiito][2 miokardio]
 * narcis: [1 narciso][2 narcisoj][1 narcisojn]
 * ofend: [1 ofendi][1 ofendo]
 * panteism: [1 panteismoj][1 panteismojn]
 * percept: [1 percepti][1 percepto]
 * ple: [1 Pleite][1 plea][1 pleite][1 pleito]
 * richt: [1 richt][1 richten]
 * ŝi: [5 Ŝi][1 Ŝia][8 ŝi][2 ŝia][1 ŝiaj][3 ŝian][4 ŝin] **UPDATED**—see "Stemmer Update" below
 * sopir: [1 sopir][2 sopirata][1 sopiri][2 sopiro]
 * tekst: [1 tekste][1 teksti][2 teksto][1 tekstoj]
 * to: [8 To][75 to][3 too][1 tous]

Problem Groups
There aren't any "problem" groups (those without either a common beginning or common ending substring), since the stemmer is only slicing off suffixes, and nothing in Esperanto is irregular.

Stemmer Update
After getting some feedback from speakers, and reviewing some of the exceptions and corner cases, I made some suggestions to the developer, and he updated the stemmer.

In the Wikipedia corpus, out of 127,776 input words (pre-processing types) 107 changed their stems. 338 types (0.379%) and 13,622 tokens (1.348%) were involved in new mergers. 204 types (0.229%) and 8,783 tokens (0.869%) were involved in new splits.

In the Wiktionary corpus, out of 34,890 input words (pre-processing types) 82 changed their stems. 155 types (0.536%) and 801 tokens (0.881%) were involved in new mergers. 57 types (0.197%) and 523 tokens (0.575%) were involved in new splits.

Since most of the words affected were pronouns, determines, prepositions and such, each change had a relatively large impact (more tokens affected than types).

On manual review of the changed groups, I see there are some very good splits, like separating vin, (the accusative of vi (plural "you") from vino, ("wine"), and few bad collisions, like non-Esperanto fin matching fi (interjection: "for shame!"—though in Wikipedia it's more likely something related to wi-fi, hi-fi, or Finland). A portion of the changes only happened because of the tokenization the command-line stemmer does. Overall it looks like an improvement.

The only change in the samples is for the ŝi group in the Wiktionary random groups, which added Ŝia, ŝia, and ŝin.

Next Steps

 * Get speaker review of the stemming groups.
 * Assuming the review is positive, convert the stemmer into an Elasticsearch plugin, create an analysis config for it, and test that (which should show only minor differences from the analysis here, mostly due to tokenization differences).
 * Deploy the plugin and plugin-dependent config.
 * Once the config and plugin are deployed, reindex Esperanto-language wikis with the new analyzer.