User:TJones (WMF)/Notes/Strip Empty Tokens Generated by ICU Folding

August 2018 — See TJones_(WMF)/Notes for other projects. See also T192502. For help with the technical jargon used in Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background
Under certain circumstances, certain characters can generate empty tokens. The empty strings get indexed, and the original characters that map to empty strings are all conflated at search time. Unfortunately, this conflation can happen even in the plain field—i.e., even if you search with quotes!

Part of the problem comes from the tokenizer splitting on some of these characters, other times, the characters are used independently—e.g., in lists in a Wikipedia article on a character set. The articles with lists of these characters score well, too, since they have lots of hits!

The main culprit seems to be ICU folding, which folds many of these characters to nothing—which is fine when they are part of a longer word, but not okay when they are the entire word. Adding a length token filter to the analysis chain—automatically when ICU folding is enabled as an automatic upgrade from ASCII folding—will keep the empty strings from being indexed. We want the characters to still be matchable via the plain field—even though they are conflated in the plain field at the moment!—so we need to verify that preserve_original works as expected when the folded token is eliminated; this required putting the length filter after the preserve_original filter.

Below is a sample of characters that were highlighted in a search for just one of them on English Wikipedia. (This isn't all of the affected characters, and the exact list differs by language, since the language-specific processing might differently normalize or remove a subset of these characters. Hebrew does, for example.)

ˍ ˎ ˏ ˬ ̂ ̟ ̣ ̤ ֪ ۥ ့ ် ႋ ႌ ႍ ႏ ႚ ႛ ᩶ ᩷ ᩸ ᩹ ᩺ ᩻ ᩼ ᱹ ⸯ ㅤ ꜜ ꜞ ꜟ ꞈ ꪿ ꫀ ꫁ ꫂ ﳲ ﳳ ﳴ ︀ ︎ ˈˈ ˌˌ ːː ː̀ ː̃ ـִ ـْ ـٓ ــٰ ـً‎ ــِ ʽ ̃ ̇ ʹ ּ ߴ ็ ้ ๊ ์ ๎ ່ ້ ໊ ໋ ႈ ႉ ႊ ៊ ់ ៌ ៍ ៎ ៏ ័ ៑ ្ ៓ ៝ ᩵ ᵎ ꙿ ꜝ ️ ﹹ ﹻ ﹿ ـً ـٌ ـٍ ـ ߴ‎ ߵ‎ ߺ‎ ーー ̶ ๋ ႇ ៉ ᴻ ﹱ ﹷ ﹽ ـَ ـُ ـِ ˁ ˑ ́ ̰ ՙ ่ ໌ ᴯ ｰ ˊ ˮ ̅ ــ ̸ ˌ ॱ ʹ ʺ ˋ ️⃣ ⃣ ʻ ـ ˉ ˈ ˆ ʾ ˇ ʼ ʿ ー ː ˀ

Note: The ᴯ and ᴻ seem out of place in the list, and I found at least one other alphabetic character that is tokenized to nothing:ᵃ. As far as I can tell, all this goes back to the IBM implementation of the ICU tools, which Unicode.org seems to maintain (see Github). They don't seem to accept issue tickets, just pull requests. (I guess that's one way to keep the number of complaints down.) Maybe in my infinite free time one day I'll figure out their data format and submit a pull request to fix these. Sigh... so many projects, so little time.

Data
Since this change affects many languages, I've taken smaller samples from the Wikipedias of 11 languages that have some ICU folding config in the AnalysisConfigBuilder. Rather than count articles, I'm counting de-duped lines from Wikipedia articles. Also, I've limited myself to samples I happend to have on hand from previous work, so I was able to gather 50K lines each for Bosnian, English, French, Hebrew, Russian, Slovak, and Serbian, but only 12K for Swedish, 14K for Greek, 17K for Croatian, and 24K for Serbo-Croatian. All together, it's about 86MB of text to analyze. If anything untoward shows up in the data, I can easily enough run more data for a particular language.


 * N.B. Two of the languages are English: "en" and "simple". Russian has config for ICU folding exceptions (i.e., don't fold й/Й to и/И), but doesn't actually use it because ICU folding is not yet enabled for Russian; I believe it was put there as an example/placeholder when the ICU folding exception feature was first introduced. Greek doesn't use ICU folding in it's text field, but does use it in its plain field. Many of the languages use ICU folding in other fields.

Since the config affects all analysis chains with icu_folding, and the plain field uses icu_folding and preserve_original, I'm testing both the plain and text analysis for all of these languages.

All of the corpora have at least a couple hundred thousand tokens (218K for Swedish, up to 3.2M for Hebrew—because it generates several output tokens for each input token).

Text Field
The text field is the the index used for basic searching. It usually includes more regularization and stemming and stop words.

The impact is very small, with most corpora showing no changes. The biggest changes were in the English corpus, with 3 types and 4 tokens (out of 100K types and 1.1M tokens) affected.

Plain Field
The plain field is the the index used for phrase searching (and single words) in quotes. It usually includes less regularization (though in many cases it still includes ICU folding) and no stemming or stop words. The plain field also contributes to "basic searching" by allowing exact matches; otherwise it would be hard to match "to be or not to be," which is all stop words.

Again, the impact was very small, with most corpora showing no changes and the biggest changes in the English corpus, with 3 types and 4 tokens (out of 100K types and 1.1M tokens) affected.

Tools Update and Incidental Information
I decided to add a section to my tools to call out input tokens that generate empty output tokens. (This is distinct from lost and found tokens—which are based on changes between configs—and stop words, which generally do not generate tokens.) This made it easier to see what input tokens specifically were generating empty output tokens.

Oddly, I discovered that the Greek analyzer (which is monolithic and as far as I can tell has the factory settings), generates empty tokens for all of these words (and probably others): εστάτο, εστερ, εστέρ, έστερ, έστέρ, εστέρα, εστέρας, εστέρες, εστέρησε, εστερία, εστερικό, εστερικού, εστερικών, εστέρο, εστέρος, εστέρων, ήσανε, ότερ, οτέρι, ότερι, οτερό, οτέρο. I'll open a ticket (T203117) to unpack it and filter empty tokens, and maybe open an upstream ticket, but it seems to have only a small impact.

Next Steps

 * Upload a patch to automatically insert a length token filter after ICU folding when we auto-convert ASCII folding to ICU folding.
 * Reindex a lot of wikis: Bosnian, Greek, English/Simple English, French, Hebrew, Croatian, Serbo-Croatian, Slovak, Serbian, Swedish.