Jump to content

User:TJones (WMF)/Notes/Harmonizing Common Invisibles Across Tokenizers

From mediawiki.org

October 2025 — See TJones (WMF)/Notes for other projects. See also T405020. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background

[edit]

While looking into T87548, I noticed that some of the tokenizers split on soft hypens, which is an error. I looked into other common invisibles, and found the following minor issues with invisibles for the various tokenizers.

  • SmartCN tokenizer
    • splits soft hyphen, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
    • creates empty tokens for ZWNJ, ZWJ, ZWSP, LTR mark, RTL mark
  • Sudachi
    • splits on soft hyphen, ZWNJ, ZWJ, LTR mark
  • Kuromoji
    • does not split ZWSP
  • Hebrew
    • splits soft hyphens, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
  • Nori
    • splits ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
  • Thai
    • did not split ZWSPs, as designed during unpacking; harmonized with other analyzers as part of T87548
    • the full list of invisibles below should be checked

The full list of invisibles to check for these tokenizers is

  • soft hyphen (00AD)
  • RTL bidi (200F, 202B, 202E, 2067, 061C)
  • non-joiner (200C, 2063)
  • first strong isolate bidi (2068)
  • joiner (200D, 2060)
  • pop bidi (2069, 202C)
  • variation selector (FE00-FE0F, E0100-E01EF)
  • whitespace (200B, 202F, 3000, FEFF, 00A0)

These can largely be fixed with character filters to either delete characters that should not be split on, or converting characters that should split tokens into spaces.

The SmartCN tokenizer's empty token problem should be fixed by converting everything earlier, but we should look for empty tokens.

Observations and Config Changes

[edit]
  • The textify_icu_tokenizer (and the icu_tokenizer) split on "Arabic letter mark" when it is between a Latin character and something else, but not between Arabic characters. It's invisible, so we shouldn't split on it.
  • Nori tokenizer splits variation selector-17+ into its own token (generating an empty token after normalization). (Variation selectors 1–16 are in one block and 17–256 are in a different block).
  • Thai w/ the ICU tokenizer is fine.
  • My non-ICU Thai tokenizer treats variation selector-256 (U+E01EF) differently from the other 255. It splits on 256, but not the others. I suspect an incorrect < where a <= should have been!
    • TIL: I say "my" non-ICU Thai tokenizer because the non-ICU Thai tokenizer is based on a dictionary in the JDK! <shudder>
  • Other invisibles exist, like tags (U+E0000 - U+E007F), but we are already far off into the weeds with all the variation selectors. I haven't noticed tags or other invisibles regularly showing up when analyzing samples, so I'm going to ignore them for now, so I don't end up implementing my own version of ICU normalization... though, maybe..... naw...

I added "Arabic letter mark" in globo_norm, because almost everyone needs it almost all of the time.

I added the invis_cleanup char filter to AnalyzerBuilder (it converts zero-width spaces to spaces, and delete soft hyphens, various joiners and non-joiners, zero-width non-breaking spaces, various bidi marks, invisible math operators, and all 256 variation selectors), and added it to Chinese, Hebrew, Sudachi (Japanese), and Nori (Korean). Not every language that invis_cleanup is applied to needs all of it, but they all need most of it, and having one version is much easier.

Kuromoji only needed to split on zero-width spaces, so it got a tiny new mapping char filter.

I added remove_empty to Chinese, since the SmartCN tokenizer can still generate tokens with odd characters (e.g., tag characters) that get normalized to empty tokens by ICU normalization or folding.