User:TJones (WMF)/Notes/Analyzer Analysis for Elasticsearch Upgrade from 6.5 to 6.8

February 2022 — See TJones_(WMF)/Notes for other projects. See also T300302. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Summary
Between Elastic 6.5 and 6.8, changes in Lucene have caused changes to tokenization for the standard tokenizer:


 * A lot of "interesting" Unicode characters are now surviving tokenization.
 * The tokenizer no longer splits on narrow no-break spaces (U+202F).

It also turns out that both of the above were already true for the ICU tokenizer in ES 6.5.

The Nori (Korean) tokenizer has changed the way it defines character sets (regular vs "extended"), while still breaking on clearly different character sets (Hangul, Cyrillic, Latin, Greek), leading to lots of small changes in Korean tokens: 6.4% fewer tokens in my Korean Wikipedia sample, 9.5% fewer tokens in my Korean Wiktionary sample.

Background & Data
We've already seen one unexpected change in 6.8 where a chess piece [♙] became searchable, which caused an existing test to fail, so as part of the process of upgrading from Elasticsearch 6.5 to 7.10, I'm investigating language analyzer changes from 6.5 to 6.8.

I pulled 500 random documents each from the Wikipedia and Wiktionary for the following 47 languages: Arabic, Bulgarian, Bangla, Bosnian, Catalan, Czech, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Irish, Galician, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Italian, Japanese, Javanese, Khmer, Korean, Lithuanian, Latvian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo-Croatian, Slovak, Serbian, Swedish, Thai, Turkish, Ukrainian, and Chinese; plus 500 random Wikipedia articles from the following 5 languages (which do not have Wiktionaries): Tibetan, Sorani, Gan Chinese, Mirandese, Rusyn; and 150 random articles from the Dzongkha Wikipedia (which does not yet have 500 articles).

A subset of these languages were originally chosen because they had an interesting mix of writing systems and language analysis configurations. The rest were added after my initial findings showed something interesting was going on, to include the rest of the languages with some sort of custom language analysis.

Tokens, Tok ens, Tokenstokens
There are a number of tokenization changes in the 6.8 standard tokenizer, many good, at least one not so good. Plus revelations about the ICU tokenizer!

New Good Tokens!
For many language analysis chains, a number of new tokens showed up, including chess pieces. A sample: ♙ ☥ © ® ↔ ↗ ↘ ↪ ▶ ★ ☉ ♀ ♔ ♚ ♜ ♠ ♡ ♥ ♨ ♪ ⚬ ⚭ ⚲ 🇸🇬 ➡ ♭ ♯ 🀙 🂡 🃞 🍀 🍚 🔗 🖥 🗮 🥺 🧗 🧟 ™. They are all identified as having type "EMOJI", which I'm interpreting as "symbol-like thing".

A few character with very high Unicode values also showed up as new:


 * 𬶍 — U+2CD8D, a CJK character, type "IDEOGRAPHIC"
 * 𑜂𑜥 — U+11702 U+11725, Ahom characters, type "SOUTHEAST_ASIAN"

New Not-So-Good Tokens
A fair number of new tokens with spaces appeared in my output. These are coming from narrow no-break spaces (NNBSP, U+202F) in the input, which are no longer treated as word boundaries by the standard tokenizer. The ICU normalizer, when present, converts NNBSP to a regular space.

This results in multi-word tokens with spaces in the middle, and single-word tokens with spaces at either end—or both!

Since NNBSP looks like a thin space, they are very difficult to detect while reading, which makes text containing them effectively unsearchable.

ICU Tokenizer
I noticed that languages like Tibetan, Javanese, and Khmer that use the ICU tokenizer showed no changes from 6.5 to 6.8. It turns out that both the old and new versions of the ICU tokenizer have the same behavior with the "EMOJI" and NNBSP characters.

My hypothesis is that there has been some harmonization between the standard tokenizer and the ICU tokenizer in Lucene, which resulted in general improvements, but also spread the NNBSP problem to the standard tokenizer.

Aggressive Splitting
The English and Italian analyzers include the  filter, which is a specific   filter. It has the benefit of splitting on NNBSP characters, but it also deletes a lot of the "EMOJI" characters, including these: ♙ ☥ © ® ↔ ↗ ↘ ↪ ▶ ★ ☉ ♀ ♔ ♚ ♜ ♠ ♡ ♥ ♨ ♪ ⚬ ⚭ ⚲ ➡ ♭ ♯.

On the other hand, these are preserved: 🇸🇬 🀙 🂡 🃞 🍀 🍚 🔗 🖥 🗮 🥺 🧗 🧟 ™, and 𬶍 𑜂.

Outlier Analyzers

 * The Hebrew tokenizer breaks on NNBSP characters, but deletes all of the symbol characters passed through by the standard and ICU tokenizers. Unfortunately, the open source version of the analyzer is not being updated anymore. We could possibly update a fork of it in the future, but with the uncertainty around the future of Elasticsearch, it's not clear that it is worth it.


 * The Thai analyzer breaks on NNBSP characters (but not hyphens), and deletes all the symbol characters now passed through by the standard and ICU tokenizers. I haven't unpacked the Thai analyzer, but I suspect the Thai tokenizer.


 * The Korean/Nori analysis chain had a lot more changes than I was expecting, and they originally appeared to be all over the place. After looking more carefully, I discovered that the changes are actually (mostly) an improvement to the Korean/Nori tokenizer.
 * The tokenizer generally breaks words on character set changes, but earlier versions of the tokenizer were a little too fine-grained in assigning character classes, so that "Extended" character sets (e.g., "Extended Latin" or "Extended Greek"), IPA, and others were treated as a completely different character sets, causing mid-word token breaks. Some examples:
 * νοῦθοσ → was: νο, ῦ, θοσ; now: νοῦθοσ
 * пєԓӈа → was: пє, ԓ, ӈа; now: пєԓӈа
 * suɏong → was: su, ɏ, ong; now: suɏong
 * hɥidɯɽɦɥidɯɭ → was: h, ɥ, id, ɯɽɦɥ, id, ɯɭ; now: hɥidɯɽɦɥidɯɭ
 * boːneŋkai → was: bo, ː, neŋkai; now: boːneŋkai
 * On the other hand, numbers are now considered to not be part of any character set, so tokens no longer split on numbers. The standard tokenizer does this, too. The  filter breaks them up for English and Italian. Some Korean examples:
 * 1145년 → was: 1145, 년; now: 1145년
 * 22조의2 → was: 22, 조의, 2; now: 22조의2
 * лык1 → was: лык, 1; now: лык1
 * dung6mak6 → was: dung, 6, mak, 6; now: dung6mak6
 * The Korean tokenizer still splits on character set changes. This generally makes sense for Korean text, which often does not have spaces between words. However, it can give undesirable results for non-CJK mixed-script tokens, including words with homoglyphs, and stylized mix-script words. Some examples (with Cyrillic in bold):
 * chocоlate → choc, о, late
 * KoЯn → ko, я, n
 * NGiИX → ngi, и, x
 * The tokenizer doesn't seem to have any problems with narrow no-break spaces, which is nice.
 * The changes result in:
 * 6.4% fewer tokens in my Korean Wikipedia sample, with more distinct tokens (e.g., x and 1 were already tokens, now so is x1.)
 * 9.5% fewer tokens in my Korean Wiktionary sample, with fewer distinct tokens (e.g., hɥidɯɽɦɥidɯɭ was originally converted into 5 smaller unique tokens, now it's just 1 big unique token.)

Plain Tokenization
The plain field always (almost always?) uses either the standard tokenizer or the ICU tokenizer, which both have problems with narrow no-break spaces.

Monolithic Analyzers
All of the monolithic analyzers from Elasticsearch use the standard tokenizer, except for Thai, which uses the Thai tokenizer. Until these are unpacked, there's nothing we can do with them.

Miscellaneous Observations

 * I retested the problem of changes in character set from one word to the next affecting the tokenization of the latter word. With the ICU tokenizer, apple 3x λογος 7Ω is tokenized as 4 tokens: apple, 3x, λογος, 7Ω; but λογος 3x apple 7Ω is tokenized as 6 tokens: λογος, 3, x, apple, 7, Ω. The standard tokenizer does the expected thing: 4 tokens in each case.
 * I noticed some more sub-optimal stemming groups in Polish (Stempel), not related to the 6.8 upgrade, and asked Zbyszko to review them before he leaves.

Next Steps
I plan to discuss all of this with the rest of the Search Team very soon and decide on our course of action. Options include:


 * Do nothing immediately. There are very few tokens affected, and some of the issues are already present in 6.5.
 * Do nothing until ES 7.10. It's hard to keep up with all the small changes in every new version of Elastic and Lucene, so just wait until ES 7.10 and re-assess then. Who knows what may have been changed or fixed?
 * Start fixing stuff, following something like the following list, roughly sorted by priority/complexity:
 * Add a  character filter to plain filters and unpacked text filters everywhere.
 * Look into tuning the  filter for English and Italian to see if we can get what we want out of it without losing all the interesting rare characters coming out of the 6.8 tokenizers.
 * Look into harmonizing tokenization +  across all (non-monolithic) languages so that similar inputs get similar outputs when possible (e.g., when not dependent on dictionaries or other language-specific processing). See T219550.
 * Consider upgrading all languages that use the standard tokenizer to the ICU tokenizer.
 * Consider using some form of  everywhere possible. See T219108.
 * Look into patching our version of the Hebrew tokenizer to allow interesting Unicode characters to come through
 * If various issues still exist in ES 7.10:
 * Open an upstream ticket for the standard and ICU tokenizers to split on NNBSP characters.
 * Open an upstream ticket for the Thai tokenizer to allow interesting Unicode characters to come through.