User:TJones (WMF)/Notes/Elasticsearch 6 vs Elasticsearch 5 Analyzer Analysis

February 2019 — See TJones_(WMF)/Notes for other projects. See also T194849 and T199791. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background & Data
As part of the upgrade to Elasticsearch 6 I pulled a small sample from Wikipedia and Wiktionary (when available) for each of Chinese, Dzongkha, English, Finnish, French, Gan, Greek, Hebrew, Italian, Japanese, Javanese, Mirandese, Polish, Russian, Rusyn, Serbian, Slovak, Swedish, Tibetan, Turkish, and Ukrainian. I pulled 500 articles/entries, except for Dzongkha, which doesn't have 500 articles. The languages were chosen (in May of 2018) to provide a diversity of scripts and analysis configurations.

Back in May of 2018, I ran the then-current analysis for each of my samples, and then again now (Feb 2019).

The samples are small because there are a lot of them, and we aren't expecting any changes—though of course we also expect there to be some unexpected changes.

Problems of the Moment
Ah—the expected unexpected. These are the kinds of things we are here to find! These should all get fixed before the final upgrade to ES 6 happens.


 * Serbian: Elastic search reports that "extra-analysis-serbian / 6.5.4-SNAPSHOT" is installed, but when I try to reindex, I get an error:
 * Esperanto isn't in my original samples because it didn't have custom processing at the time I took them. However, I noticed that there was no Esperanto plugin in the new ES 6 batch of plugins.

General Changes
Some of the changes are more general even though they only showed up in specific language samples.


 * Addition of remove_empty filter: We added a generic filter that removes empty tokens, which prevents unrelated strings that happen to generate empty tokens from matching. It's now automatically added to unpacked analyzers that use ICU folding, and it only affects a small number of tokens. Since the effect is small, we also didn't bother to reindex every wiki that would be effected, so the change will take effect for some when we finish the upgrade to ES 6. The one example I saw that came up was from French Wikipedia sample, where the character "ː" (U+02D0, modifier letter triangular colon) is folded into nothing.


 * Changes to the ICU tokenizer: It looks like there are some changes to the ICU tokenizer, which is used by serveral languages that don't have custom analyzers.
 * CJK strings are broken up into smaller chunks, presumably based on a dictionary. I see examples of Hiragana, Katakana, and Ideographs in the Gan Wikipedia sample, for example.
 * "Rare characters" are preserved by the tokenizer. ☊ and ☋ are preserved in the Gan Wikipedia sample. The general topic of "rare characters" is discussed more in T211824, and I tested the three characters from that ticket description—☥ (Ankh), 〃 (ditto mark), and 〆 (ideographic closing mark)—and ☥ and 〆 are preserved by the ICU tokenizer—so they would get indexed in Gan, but would still not be indexed in English, French, Greek, or others that use the standard tokenizer (either unpacked or monolithic).

Specific Changes

 * The Hebrew analysis chain—which uses a manually rebuilt version of HebMorph—for a very small number of tokens, generates stems that start with "או" instead of "א". (see T214439)
 * Mirandese has been updated with a custom analysis chain that handles elision (so d'admenistración is indexed as just admenistración) and has a custom stop word list, so there are lots of differences in my sample, but they aren't differences versus what's in production. (see T194941)
 * The Polish analysis chain has similarly been updated since I took my sample, including custom filters to remove bad statistical stems, and general effects of unpacking the analyzer in order to customize it. (see T186046)
 * Ukrainian has picked up a new superpower and now folds Ґ/ґ to Г/г, accounting for about 0.030% of the tokens in my Wikipedia sample. The tokenizer also generated one additional token (out of 119,490). I couldn't find it, though—so the new token must not be unique.
 * The promised fix for the high/low surrogate problem in Chinese (where UTF-32 characters are broken into their component surrogates—see T168427) has arrived! My sample didn't include the surrogate fix because—again—it was implemented after I took the sample.