User:TJones (WMF)/Notes/Esperanto Stemmer Analysis

June/July 2018 — See TJones_(WMF)/Notes for other projects. See also T197240.

Background
Esperanto is a bit further down the list of the remaining top 50 languages to look at (T171652), but it jumped to the top because I had a developer ask me to recommend a project to work on, and I suggested an Esperanto stemmer. As a constructed language, Esperanto is very regular and reasonably well documented, so the barrier to implementing a stemmer is much lower for a non-speaker.

It's now available on GitHub; it's in Java and has GPL3 license (the structure is based on the Serbian stemmer), so all of the technical details are in good shape. The next step is to get a review of the stemming quality from speakers.

Data
I pulled 5,000 random articles from the Esperanto Wikipedia and 5,000 entries from the Esperanto Wiktionary, and did my usual stripping of markup and deduplication of individual lines (to get rid of excess copies of the equivalent of commonly used headings like "References", "See Also", "Noun", language names, etc.).

On Tokenization
Since this is an external stemmer, I used the default analyzer configuration—standard tokenizer and ICU normalization—to break the text into tokens, and then pass them to the stemmer in a file. The stemmer does its own round of tokenization when reading a file, creating some discrepancies.

The stemmer tokenizes strings of Esperanto characters with word boundaries at either end, which means that it does not stem words with q, w, x, y, or any letters with non-Esperanto diacritics. Esperanto Wikipedia generally translates names (Wolfgang Amadeus Mozart becomes Volfgango Amadeo Mozarto) but some still slip through (the Mozart article includes Wolfgang-on).

The stemmer also handles the Esperanto use of a dash for odd plurals (as in Wolfgang-on, I think). It's similar to how English sometimes uses an apostrophe, so that the plurals of a, b and i can be a's, b's, and i's rather than as, bs, and is, which can be confusing. The current default stemmer breaks on dashes, so those kinds of tokens don't exist in this corpus.

The stemmer also breaks up some tokens the standard tokenizer doesn't, stems them, and reassembles them. So here's becomes her's, and ethnologue.com becomes ethnologu.com. If we deploy the stemmer, we won't be using its tokenizer, so this won't be an issue (though other oddities will inevitably show up).

Stemming Groups for Review
Below are some stemming groupings for review by speakers of Esperanto. These are tokens that would be indexed together, so searching for one would find the others. The format is  - [ ][  ].... The  is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the tokens being considered. It's sometimes useful for figuring out the logic the stemmer has used. The is a token found by the language analyzer (more or less a word) and is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters. Some errors are expected, because language is messy—though Esperanto is far less messy than most.

About 10.3% of unique words and 17.9% of all words in the Wikipedia corpus end up stemming together with another word. In the Wiktionary corpus, it's about 9.6% of unique words and 10.0% of all words that end up stemming together with another word.

Large Groups
The largest groups are more likely to have resulted from unrelated words being stemmed together, though none of these are particularly large. They are also often just more common words.

These are the ten largest groups from the Wikipedia corpus, and the groups with 10 or more distinct forms from the Wiktionary corpus.

Wikipedia:


 * est: [1 EST][4 Est][1 Esta][6 Estante][192 Estas][4 Este][3 Esti][2 Estinta][65 Estis][1 Esto][1 Estonto][1 Estos][7 Estu][3 Estus][13 est][7 esta][1 estaj][1 estan][12 estanta][2 estantaj][34 estante][14359 estas][611 esti][18 estinta][3 estintaj][4 estinte][1 estintus][7868 estis][1 estita][11 esto][2 eston][38 estonta][10 estontaj][1 estontajn][6 estontan][7 estonte][4 estonto][1 estonton][89 estos][112 estu][122 estus]
 * far: [5 Far][1 Fare][1 Fari][2 Farita][6 Faro][2 Faru][38 far][2 faranta][1 farantaj][8 farante][1 faranto][81 faras][14 farata][4 farataj][344 fare][112 fari][1 farinte][1 farinto][222 faris][82 farita][36 faritaj][6 faritajn][3 faritan][9 faritis][3 faro][6 faroj][1 farojn][2 faron][1 faronta][4 faros][1 farota][7 faru][2 farus]
 * form: [1 FORMAS][1 Form][1 Forma][1 Forman][1 Formanto][5 Formato][1 Formatoj][2 Formo][2 Formoj][5 forma][2 formaj][3 forman][3 formanta][1 formantaj][1 formantajn][15 formante][4 formantoj][1 formantojn][110 formas][7 formata][1 formatan][18 formato][1 formatoj][3 forme][30 formi][1 forminta][41 formis][27 formita][8 formitaj][1 formitan][268 formo][70 formoj][24 formojn][46 formon][1 formos]
 * kant: [10 Kant][1 Kantate][1 Kantaten][4 Kantato][13 Kanto][1 Kantoj][1 Kantu][3 kanta][1 kantaj][2 kantajn][2 kantanta][3 kantantaj][6 kantante][1 kantanto][3 kantantoj][1 kantanton][22 kantas][6 kantata][5 kantataj][1 kantatan][3 kantato][3 kantatoj][12 kanti][1 kantintaj][29 kantis][5 kantita][97 kanto][67 kantoj][23 kantojn][15 kanton][1 kantus]
 * lud: [1 Ludanta][1 Ludante][4 Ludanto][2 Ludantoj][1 Ludo][7 Ludoj][2 Ludojn][1 Ludon][1 lud][3 luda][1 ludaj][11 ludanta][3 ludantaj][1 ludantajn][1 ludantan][3 ludante][62 ludanto][35 ludantoj][3 ludantojn][1 ludanton][57 ludas][9 ludata][2 ludataj][1 ludatas][25 ludi][150 ludis][8 ludita][2 luditaj][1 luditajn][141 ludo][61 ludoj][6 ludojn][21 ludon][3 ludos][1 ludu]
 * mort: [3 MORTON][2 Mort][6 Morta][1 Mortas][1 Mortintaj][7 Mortis][6 Morto][6 Morton][4 mort][10 morta][4 mortaj][1 mortan][1 mortanta][1 mortantaj][1 mortanto][1 mortantoj][1 mortanton][14 mortas][3 morte][5 morti][135 mortinta][13 mortintaj][1 mortintan][1 mortinte][6 mortinto][17 mortintoj][6 mortintojn][1 mortinton][611 mortis][1 mortitoj][247 morto][8 mortoj][2 mortojn][15 morton][1 mortonta][2 mortos][1 mortu][1 mortus]
 * nom: [4 NOM][1 Noman][3 Nome][1 Nomi][1 Nomis][2 Nomita][7 Nomo][1 Nomoj][1 Nomon][5 nom][3 noma][1 nomajn][1 nomantaj][2 nomante][97 nomas][296 nomata][72 nomataj][3 nomatajn][16 nomatan][11 nomatas][1 nomate][382 nome][16 nomi][1 nominta][64 nomis][345 nomita][28 nomitaj][2 nomitajn][11 nomitan][2 nomite][2 nomitis][928 nomo][74 nomoj][33 nomojn][321 nomon][1 nomos]
 * sekv: [1 Sekva][1 Sekvajn][1 Sekvanta][1 Sekvantan][4 Sekvante][9 Sekvas][1 Sekvata][49 Sekve][1 Sekvintaj][25 Sekvis][1 Sekvoj][1 Sekvos][1 sekv][82 sekva][80 sekvaj][5 sekvajn][14 sekvan][21 sekvanta][37 sekvantaj][2 sekvantajn][10 sekvantan][8 sekvante][9 sekvanto][11 sekvantoj][2 sekvantojn][1 sekvanton][46 sekvas][13 sekvata][1 sekvataj][1 sekvatajn][113 sekve][8 sekvi][4 sekvinta][1 sekvintaj][1 sekvintajn][1 sekvintan][86 sekvis][6 sekvita][2 sekvitaj][2 sekvite][10 sekvo][10 sekvoj][3 sekvojn][3 sekvon][3 sekvonta][3 sekvontaj][3 sekvontan][3 sekvos][1 sekvota][2 sekvu]
 * uz: [1 Uz][2 Uzante][1 Uzas][3 Uzo][1 uzan][8 uzanta][1 uzantaj][30 uzante][8 uzanto][11 uzantoj][3 uzantojn][1 uzanton][185 uzas][265 uzata][88 uzataj][2 uzatajn][3 uzatan][36 uzatas][1 uzate][2 uzatis][83 uzi][1 uzintaj][1 uzintus][132 uzis][78 uzita][27 uzitaj][2 uzitan][1 uzitis][83 uzo][13 uzoj][29 uzon][1 uzonto][8 uzos][1 uzota][1 uzotaj][1 uzotajn][6 uzu][3 uzus]
 * vid: [5 VIDA][2 Vida][1 Vidanto][1 Vidita][3 Vido][1 Vidoj][49 Vidu][4 vid][13 vida][5 vidaj][1 vidan][1 vidanta][2 vidanto][39 vidas][23 vidata][10 vidataj][1 vidatajn][1 vidate][3 vide][46 vidi][1 vidinta][1 vidinto][1 vidintoj][54 vidis][10 vidita][5 viditaj][1 viditajn][1 viditan][9 vido][1 vidojn][5 vidon][2 vidos][119 vidu]

Wiktionary:


 * dir: [3 Dir][7 dir][3 diras][2 dire][10 diri][1 dirinte][5 diris][1 dirite][1 diros][2 diru][1 dirus]
 * est: [2 Estas][1 Estate][1 Esti][1 Esto][5 est][6 esta][2 estante][128 estas][2 estate][5 este][62 esti][1 estinta][32 estis][1 esto][2 estonta][1 estonte][10 estos][1 estu][11 estus]
 * far: [1 Fari][3 far][3 faras][1 farata][1 fare][24 fari][1 faris][2 farita][1 faritas][3 faro][2 farota][1 faru]
 * hav: [1 hava][8 havanta][1 havanto][21 havas][3 have][20 havi][3 havis][2 havo][2 havos][1 havu][1 havus]
 * help: [2 help][3 helpa][2 helpanto][1 helpas][3 helpe][9 helpi][2 helpis][11 helpo][3 helpon][1 helpos][1 helpu][1 helpus]
 * uz: [2 Uzata][4 uzas][14 uzata][1 uzataj][4 uzi][2 uzis][2 uzita][5 uzo][1 uzon][1 uzos][1 uzu]

Random Groups
Here are 25 groups each from the Wikipedia and Wiktionary corpora, chosen at random. These are likely to be more representative of the performance of the stemmer, though weird stuff can always make it into the sample.

Wikipedia:


 * akademian: [6 akademiano][2 akademianoj]
 * bazarad: [2 bazarada][2 bazarado]
 * demokr: [9 Demokrata][1 Demokrataj][2 Demokratan][7 Demokratoj][3 Demokrito][5 demokrata][5 demokrataj][1 demokratan][1 demokrate][1 demokratoj]
 * elkonstru: [1 elkonstrui][1 elkonstruis]
 * enketist: [1 Enketisto][1 Enketistoj][5 enketistoj]
 * halopreĝej: [1 halopreĝejo][1 halopreĝejoj]
 * klav: [3 klavoj][2 klavojn]
 * lodz: [4 Lodzo][1 lodzajn]
 * mais: [1 Maise][8 Maison]
 * motor: [3 Motor][1 Motoren][2 Motoro][2 motora][1 motoraj][1 motoris][21 motoro][12 motoroj][2 motorojn][5 motoron]
 * noval: [4 Novalis][3 novalo]
 * objektivec: [1 objektiveco][1 objektivecon]
 * podlask: [2 Podlaska][1 Podlaski]
 * pop: [2 Pop][1 Pope][15 pop][1 popa]
 * prag: [6 Praga][59 Prago][4 praga][1 pragaj][1 pragan]
 * pretigad: [4 pretigado][2 pretigadon]
 * rapid: [3 Rapida][1 Rapide][5 rapid][39 rapida][9 rapidaj][2 rapidajn][12 rapidan][1 rapidanta][2 rapidas][101 rapide][1 rapidi][1 rapidis][37 rapido][6 rapidoj][1 rapidojn][6 rapidon][1 rapidu]
 * rim: [2 Rima][3 Rime][1 rima]
 * roz: [4 Roza][2 roza][1 rozaj][2 rozan][3 rozo][4 rozoj][2 rozojn][1 rozon]
 * rozkolor: [2 Rozkolora][5 rozkolora][5 rozkoloraj][3 rozkolorajn][2 rozkoloran]
 * sekvoj: [1 Sekvoja][2 sekvojo][2 sekvojoj][1 sekvojon]
 * ŝipestr: [1 ŝipestra][4 ŝipestro]
 * socorrens: [2 socorrense][5 socorrensis]
 * sulfur: [2 sulfura][4 sulfuro]
 * unujar: [3 unujara][3 unujaraj]

Wiktionary:


 * anĉ: [2 anĉo][1 anĉojn]
 * dobr: [2 dobro][1 dobroj]
 * dritt: [2 Dritte][1 dritte][3 dritten]
 * duontag: [1 duontaga][2 duontage]
 * ekonomi: [1 Ekonomia][4 ekonomia][2 ekonomio]
 * far: [1 Fari][3 far][3 faras][1 farata][1 fare][24 fari][1 faris][2 farita][1 faritas][3 faro][2 farota][1 faru]
 * fervor: [1 fervora][2 fervoran]
 * hidrogen: [1 hidrogena][1 hidrogeno]
 * ĵurnal: [1 ĵurnalo][1 ĵurnaloj]
 * kann: [1 Kann][1 Kanne][27 kann]
 * kompakt: [1 kompakt][2 kompakta][1 kompakten]
 * konsil: [1 konsilantoj][1 konsilantojn][2 konsilo][2 konsilon]
 * meksik: [3 Meksiko][2 meksika]
 * miel: [1 miel][1 miela][1 mielaj][7 mielo]
 * miokardi: [1 miokardiito][2 miokardio]
 * narcis: [1 narciso][2 narcisoj][1 narcisojn]
 * ofend: [1 ofendi][1 ofendo]
 * panteism: [1 panteismoj][1 panteismojn]
 * percept: [1 percepti][1 percepto]
 * ple: [1 Pleite][1 plea][1 pleite][1 pleito]
 * richt: [1 richt][1 richten]
 * ŝi: [5 Ŝi][1 Ŝia][8 ŝi][2 ŝia][1 ŝiaj][3 ŝian][4 ŝin] **UPDATED**—see "Stemmer Update" below
 * sopir: [1 sopir][2 sopirata][1 sopiri][2 sopiro]
 * tekst: [1 tekste][1 teksti][2 teksto][1 tekstoj]
 * to: [8 To][75 to][3 too][1 tous]

Problem Groups
There aren't any "problem" groups (those without either a common beginning or common ending substring), since the stemmer is only slicing off suffixes, and nothing in Esperanto is irregular.

Stemmer Update
After getting some feedback from speakers, and reviewing some of the exceptions and corner cases, I made some suggestions to the developer, and he updated the stemmer.

In the Wikipedia corpus, out of 127,776 input words (pre-processing types) 107 changed their stems. 338 types (0.379%) and 13,622 tokens (1.348%) were involved in new mergers. 204 types (0.229%) and 8,783 tokens (0.869%) were involved in new splits.

In the Wiktionary corpus, out of 34,890 input words (pre-processing types) 82 changed their stems. 155 types (0.536%) and 801 tokens (0.881%) were involved in new mergers. 57 types (0.197%) and 523 tokens (0.575%) were involved in new splits.

Since most of the words affected were pronouns, determines, prepositions and such, each change had a relatively large impact (more tokens affected than types).

On manual review of the changed groups, I see there are some very good splits, like separating vin, (the accusative of vi (plural "you") from vino, ("wine"), and few bad collisions, like non-Esperanto fin matching fi (interjection: "for shame!"—though in Wikipedia it's more likely something related to wi-fi, hi-fi, or Finland). A portion of the changes only happened because of the tokenization the command-line stemmer does. Overall it looks like an improvement.

The only change in the samples is for the ŝi group in the Wiktionary random groups, which added Ŝia, ŝia, and ŝin.

Background
As part of the suggested improvements to the Esperanto stemmer, the list of exceptions was normalized to the standard spelling. See the Wikipedia article on Esperanto orthography for full details, but the highlight is that Esperanto has six letters that use diacritics (ĉ, ĝ, ĥ, ĵ, ŝ, ŭ), and there are two systems in use to replace them when the diacritical characters are not available.

The h-system uses ch, gh, hh, jh, sh, and u. Using u for ŭ is ambiguous, and rarer ambiguities arise from using h, which is also a normal letter in Esperanto.

The x-system uses cx, gx, xx, jx, sx, and ux. Since x is not a normal letter in Esperanto, this is unambiguous for Esperanto words.

I wanted to test what happens if we convert apparent h-system and x-system diagraphs into their diacritical forms. Of course this will impact non-Esperanto words, too.

Conversion Stats
(N.B.: In the stats below, token percentages are approximate because my tools aren't up to the task of counting precisely. If more exact counting is required, I can work on new tools, but the approximate counts below are definitely the right order of magnitude.)

In the Wikipedia corpus (5,000 articles), there were only 127 types with x-system conversions (< 0.025% of tokens): 117 ux, 4 sx, 3 gx, 2 cx, and 1 hx. The ux conversion are mostly French words; others include a Roman numeral, some ID number-like strings, and three Esperanto words: cxinio, logxis, and naskigxis.

In the Wiktionary corpus (5,000 entries), there were only 13 types with x-system conversions (< 0.025% of tokens): 11 ux, 1 jx, and 1 sx. Mostly lux- words, some French and English, and Esperanto idiomajxa.

The h-system conversions are much more numerous, and ch—which occurs commonly in English, French, and German words—predominates. Wikipedia corpus: 3,353 ch, 613 sh, 285 gh, 16 jh, 5 hh (< 1% of tokens); Wiktionary corpus: 3,416 ch, 104 sh, 42 gh, 8 hh, 2 jh (~7% of tokens). There were too many to review carefully, but as expected in the ch and sh instances, names and English, French, and German words predominated, though were a few obvious dis- + Esperanto word starting with h examples. Instances of gh were similar, with a few obvious Esperanto-ized names with appropriate gh in them, like Ghatoj and Ŝanghajo (though Ŝanhajo seems to be preferred). hh is mostly German compounds. jh has some names and Hungarian words, but also at least one Esperanto compund: lernejhistoriisto.

A small number of words across both corpora got both h-system and x-system conversions when I ran both at once. They all look French or German to me: chabroux, charroux, châteaux, chaux, grenicheux, michaux, from Wikipedia; luxusdämchen, luxusmädchen from Wiktionary.

I also pulled 4 weeks worth of queries from Esperanto Wikipedia, which came out to 9,745 queries, made up of 24,321 words. X-system conversions: 35 sx, 22 ux, 18 cx, 9 jx, 9 gx, 7 hx (< 0.5% of tokens), all predominately Esperanto words! H-system conversions: 538 ch, 340 sh, 195 gh, 13 jh, 10 hh (~4.5% of tokens). These are less obviously Esperanto. After a quick review, ch, sh, gh words are mostly names and English/French/German words. jh, hh are names, possible Esperanto words that don't match anything, and some gibberish (e.g., jhjfkshfgsjk).

Overall the impact of x-system conversions is pretty small, with most of the impact (and a lot of the mistakes) affecting ux words.

The impact of h-system conversions is bigger, especially for Wiktionary. Most h-system conversions are not Esperanto words.

Note that the search box and other input elements on Esperanto-language wikis also allow for automatic conversion of h-system and x-system input (among others). (This may cause different errors! See "Incidental Errors" below.)

H-system conversion in particular seems likely to be unhelpful, but we're here, so let's give it a try.

Impact on Stemming Groups
Using the updated stemmer, I ran comparisons against the updated stemmer and automated h-system and x-system conversion.

Looking only at h-conversion:

In the Wikipedia corpus, 252 pre-analysis types (0.181%) / 663 tokens (0.066%) were added to 173 groups (0.194% of post-analyis types), affecting a total of 769 pre-analysis types (0.553%) in those groups.

For Wiktionary, 36 pre-analysis types (0.100%) / 225 tokens (0.248%) were added to 24 groups (0.083% of post-analyis types), affecting a total of 87 pre-analysis types (0.242%) in those groups.

Looking only at the x-system conversion:

Wikipedia: 15 pre-analysis types (0.011%) / 54 tokens (0.005%) were added to 13 groups (0.015% of post-analyis types), affecting a total of 108 pre-analysis types (0.078%) in those groups.

Wiktionary: 3 pre-analysis types (0.008%) / 3 tokens (0.003%) were added to 3 groups (0.01% of post-analyis types), affecting a total of 7 pre-analysis types (0.019%) in those groups.

Samples
I took a random sample of 10 from each of the four groups (Wikipedia/Wiktionary + h/x), except for the Wiktionary x-system group, which only had 3 items. I did a quick lookup of the converted and unconverted words to see if they were the same thing, or at least obviously related.

Collisions caused by x-system conversions generally seem to be okay. Some h-system conversions map to transliterations, but others don't look good, especially on Wiktionary.

Wikipedia
Probably good:
 * h-system transliteration: Iljich/Iljiĉ, Menshikov/Menŝikov, Shigeru/Ŝigeru
 * h-system (maybe): reghado/reĝado
 * x-system: auxtoro/aŭtoro, naskigxis/naskiĝis, cxinio/Ĉinio
 * x-system (maybe): SX/Ŝ
 * x-system keyboard errors: Patuxent/Patŭent, dieux/dieŭ, Michaux/Michaŭ, VERREAUX/Verreaŭ, Verreaux/Verreaŭ, verreauxi/verreaŭi

Probably bad:
 * Chikato -> Ĉik, Ĉika, ĉiKIta
 * Ghise -> Ĝis, ĝis, ĝisjn
 * gush -> guŝo, guŝoj
 * Hache -> Haĉi, Haĉo
 * Marchi -> Marĉa, Marĉo, Marĉoj, marĉa, marĉaj, marĉajn, marĉo, marĉoj, marĉojn, marĉon
 * Marsh -> Marŝo, Marŝu, marŝanta, marŝas, marŝi, marŝis, marŝo, marŝoj, marŝojn, marŝon

Wiktionary
Probably good:
 * x-system: idiomajxa/idiomaĵa
 * x-system keyboard errors: deux/deŭ, Luxemburg/Lŭemburg

Not sure:
 * h-system borrowing: shark/ŝarko
 * ghi -> Ĝi, ĝi, ĝia, ĝin

Probably bad:
 * anche -> anĉo, anĉojn
 * chi -> Ĉi, ĉi, ĉia, ĉian, ĉie
 * Mache -> maĉi
 * Machen -> maĉi
 * marchante -> marĉ, marĉa, marĉo
 * marche -> marĉ, marĉa, marĉo
 * mash -> maŝo
 * Suche -> suĉ

Orthography Conversion Summary & Recommendations
Looks like automatic h-system conversion probably isn't worth it. Automatic x-system conversion may help, but the impact would be pretty small. I suggest not doing anything, unless someone else thinks either would be particularly helpful.

That said, there are some apparent x-system keyboard errors (see below) that should probably be fixed.

Incidental Errors
Some of the collisions I see are likely errors, especially in names that are fairly distinctive, and where mere transliteration isn't a possibility. Any sh/ŝ and ch/ĉ alternations are plausibly translitertations, but names were ux is pronounced like English ucks are not plausibly transliterated as ŭ.

The two most obvious examples are Patuxent/Patŭent and Luxemburg/Lŭemburg:


 * Patŭent, especially in USGS Patŭent Bird Identification InfoCenter is fairly common on Wikipedia.
 * Lŭemburg appears several times on Wiktionary, and once on Wikisource (in a document in German, which seems to have some other transcription errors, including x-system conversions and Esperanto ŭ for German ü).

Other potential errors:


 * Verreaux/Verreaŭ and verreauxi/verreaŭi, Montreux/Montreŭ, Lavaux/Lavaŭ, Michaux/Michaŭ on Wikipedia
 * deux/deŭ on Wiktionary

There's probably some way to mine for these, other than looking for such collisions, but nothing obvious jumps to mind.

Next Steps

 * Get speaker review of the stemming groups.
 * Assuming the review is positive, convert the stemmer into an Elasticsearch plugin, create an analysis config for it, and test that (which should show only minor differences from the analysis here, mostly due to tokenization differences).
 * Deploy the plugin and plugin-dependent config.
 * Once the config and plugin are deployed, reindex Esperanto-language wikis with the new analyzer.