User:TJones (WMF)/Notes/Enable Yiddish Ligatures

From mediawiki.org

April 2024 — See TJones_(WMF)/Notes for other projects. See also T362501. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Yiddish Ligatures[edit]

Yiddish uses ligatures (double-vav װ) (vav-yod ױ) (double-yod ײ) that, in a modern proportional font, can be indistinguishable from the components of the ligatures next to each other (וו וי יי)—though in a fixed-width font they can be easier to tell apart: (װ ױ ײ) vs (וו וי יי). Of course, to search, they are definitely different.

In reading up on the ligatures, I found another ligature (yod-yod-patah ײַ) that has several variants, one using a ligature from above (double-yod + patah ײַ), one with separate characters (yod + yod + patah ייַ), and a less common variant with the patah in the middle (yod + patah + yod יַי). It looks like icu_normalizer already converts the single-character form (ײַ) to one using the double-yod ligature (ײַ).

Even using an insource regex query, I can't separate yod-yod-patah (ײַ) and double-yod + patah (ײַ)—there may be another level of mapping happening in the browser or another layer of the Mediawiki software.. in fact, typing them here, I get double-yod + patah (ײַ) for both! Regex searches for either return 1392 results on Yiddish Wikipedia, while the more common decomposition yod + yod + patah (ייַ) gets 629 results, and the less common yod + patah + yod (יַי) gets 19. So all typable variants are in use.

To solve this problem, I added a character filter mapping the ligatures to the component pieces, taking yod + yod + patah (ייַ) as the canonical decomposition of variants of yod-yod-patah (ײַ).

I made the mapping global, since the difference is usually invisible to the reader or searcher, and they may have copied something from some other source without realizing it. In testing, examples of the Yiddish ligatures showed up in my small samples from Alemannic, German, Hebrew, and Russian Wikipedias.

Of course, there were lots of (visually identical) mergers in my Yiddish Wikipedia sample, as expected and hoped for. Hundreds for most of the variants, but only a (non-zero!) handful for the oddball yod + patah + yod (יַי), and zero for yod-yod-patah (which may be untypable in a browser?). I'm leaving the yod-yod-patah mapping in place, though, as a backstop just in case it ever shows up. (I can use it on the command line, so it's possible to get it to Elasticsearch in some circumstances—might as well do the right thing if we see it.)

Limited Usefulness of limited_mapping[edit]

Side Note: I wanted to add the new Yiddish ligatures to the globo_norm character filter, rather than adding a new filter, to save on the overhead of one more filter. globo_norm is a limited_mapping filter, which I introduced as part of the harmonization project to be a slightly faster mapping for one-character to one-character maps. Since Yiddish ligatures map one-to-many and many-to-many characters, I had to convert globo_norm to a plain mapping.

To my surprise, it was ever so slightly faster in my tests (loading ~90MB of Yiddish or English text with the new regular mapping filter enabled). Ugh. Timing tests are noisy, to be sure, but maybe it's not worth the complexity of using limited_mapping filters anymore. Though that's a problem for another day.