User:TJones (WMF)/Notes/Esperanto Analysis Chain Analysis

August 2018 — See TJones_(WMF)/Notes for other projects. See also T202173. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
Esperanto is a bit further down the list of the remaining top 50 languages to look at (T171652), but it jumped to the top because I had a developer ask me to recommend a project to work on, and I suggested an Esperanto stemmer. As a constructed language, Esperanto is very regular and reasonably well documented, and speakers are available to help, so the barrier to implementing a stemmer was much lower for a non-speaker.

The stemmer is available on GitHub; it's in Java and has GPL3 license (the structure is based on the Serbian stemmer). My analysis of the stemmer, with a lot of speaker input, led to some good feedback on the stemmer, which was improved a fair bit.

I ported the stemmer into an Elasticsearch plugin, refactored it a fair bit, and updated it to pass our style checks, findbugs rules, etc.

Since the projects the stemmer went into had been already upgraded to Elasticsearch 6, the plugin is an ES6 plugin, but we're still officially on ES5. However, David kindly back-ported the plugin to ES5, so I can test it as part of an analysis chain in the current ES5 environment.

Data
I used the same 5,000 random articles from the Esperanto Wikipedia and 5,000 entries from the Esperanto Wiktionary as for the stemmer analysis, with my usual stripping of markup and deduplication of individual lines (to get rid of excess copies of the equivalent of commonly used headings like "References", "See Also", "Noun", language names, etc.).

Unpacking the Default
Esperanto currently uses the default analysis chain, which is the standard tokenizer plus ICU normalization (which includes lowercasing). I unpacked the default analysis chain into a custom Esperanto analysis chain, without changing or adding anything. To test, I compared the analysis results for both the Wikipedia and Wiktionary corpora to my baseline results from the Esperanto stemmer analysis, and everything was the same. So, unpacking was successful.

Stemmer vs Plugin
Next I enabled the new stemmer and re-ran the analysis on my samples, and compared them to the newest version of the original stemmer that I had data for.

There were some expected splits and collisions, though the impact was relatively small.


 * New collisions: 787 pre-analysis types (0.566% of pre-analysis types) / 4920 tokens (0.487% of tokens) were added to 581 groups (0.651% of post-analysis types), affecting a total of 2319 pre-analysis types (1.667% of pre-analysis types) in those groups.
 * New splits: 167 pre-analysis types (0.12% of pre-analysis types) / 2850 tokens (0.282% of tokens) were lost from 114 groups (0.128% of post-analysis types), affecting a total of 859 pre-analysis types (0.617% of pre-analysis types) in those groups.

A lot of the splits had to do with tokenization. The command line stemmer would, for example, split www.free.fr into parts, stem the parts, and then reassemble them, to give www.fre.fr, whereas the Elasticsearch analysis chain would treat it as a single token.

The command line stemmer also punts on any token that contained words that contained characters other than Esperanto letters, so tokens with foreign letters, numbers, etc, were not stemmed. The plugin version of the stemmer is not so picky.

During development of the plugin, dealing with various corner cases shifted from the command line stemmer to explicit test cases in the plugin project. As a result, some of those cases behave differently between the last version I tested on the command line and the version I tested as a plugin.

Plugin Stemming Updates
Reviewing the stemmer-vs-plugin diffs, and doing a solo-analysis of just the new plugin results, I found a regression from the stemmer, and a couple of opportunities for easy improvements.


 * My port ended up stemming final j, n, and jn (the plural and direct object markers) even when the final result was an empty string for a stem. That's not right.
 * Reviewing the data I noticed that strings ending in numbers are often inflected without a hyphen (e.g., 1980an instead of 1980-an). It's easy enough to detect the common cases, so I added them to the stemmer.
 * I also noticed that non-Esperanto words were often being stemmed in ways that are non-sensical in Esperanto. For example, barn gets stemmed to bar, mann to man, and djerdj to djerd. In the vast majority of cases, -j, -n, and -jn are preceded by o (as nouns) or a (as adjectives), though there are some exceptions: the number unu (one) acting as a determiner can take -j or -n without the expected -o or -a ending intervening. Pronouns like ĉiu and ci can also take -j or -n to mark plural or accusative uses. But even in these cases, there is always an Esperanto vowel before the Esperanto suffix -j or -n. So, I modified the stemmer to require one, to cut down on the false positive stemming of non-Esperanto words.
 * Of course, there are still false positives like English urb/an, German aff/en, or Spanish tibur/on, and even a few false negatives, such as inflected acronyms, like EEZj, or NPPn. (Other acronyms ending in -A or -O, like NAS/Aj, UNESK/On, or POJ/Ojn also get stemmed incorrectly—with or without the -j/n endings—in one of the few cases where lowercasing before stemming loses important information.) On the whole, it should be a net gain to get rid of a lot of incorrect non-Esperanto stems.

ICU Folding
I enabled ICU folding before stemming, excluding the Esperanto letters Ĉ/ĉ, Ĝ/ĝ, Ĥ/ĥ, Ĵ/ĵ, Ŝ/ŝ, and Ŭ/ŭ. It had a moderate impact: about 0.5-1.5% of tokens merged with something; about 5.5-6.5% of tokens had something merge with them.


 * On the Wikipedia corpus: 2301 pre-analysis types (1.654% of pre-analysis types) / 6707 tokens (0.664% of tokens) were added to 1666 groups (1.881% of post-analysis types), affecting a total of 9270 pre-analysis types (6.663% of pre-analysis types) in those groups.
 * On the Wiktionary corpus: 790 pre-analysis types (2.195% of pre-analysis types) / 1279 tokens (1.407% of tokens) were added to 599 groups (2.1% of post-analysis types), affecting a total of 2074 pre-analysis types (5.762% of pre-analysis types) in those groups.

I did a quick review of a random sample of 100 collisions from each corpus. A bit over 80% looked like clearly good mergers, a little over 10% were unclear, and less than 10% were probably unhelpful.


 * Clearly good folding includes folding accents on Latin characters that Esperanto speakers may not be able to type, or wouldn't know to type—like Éditions/Editions, l’Astrolabe/l'Astrolabe, or Vörös/Voros—or generic folding in non-Latin scripts, like stripping Hebrew vowel diacritics, Russian accent diacritics, Greek accent/breathing diacritics, etc.
 * Unclear folding, to me, includes single letters and International Phonetic Alphabet pronunciations, like merging a with all of à, á, ä, ā, ą, ǎ, ǟ and ə. (I've always been annoyed that "schwa" (ə) is folded as a, while "turned-e" (ǝ) is folded as e.)
 * Potential bad folding always involves removing stemming-blocking diacritics from the end of a word, leaving something that looks like an Esperanto suffix, which is stripped. For example, Ramón gets stemmed as rom. On the other hand, Ramon without the diacritic does, too, so at least they match each other.

Overall, it looks like a net positive, so I'm going to enable it.

Next Steps

 * Upload a patch with the new Esperanto analysis chain, including the Esperanto stemmer and the ICU folding (also T202173)
 * Wait for the Esperanto stemmer plugin to be deployed (the version for Elasticsearch 5 will probably be deployed, but there is a small chance we have to wait for the Elasticsearch 6 version).
 * Re-index Esperanto-language wikis to enable the new analysis chain.