User:TJones (WMF)/Notes/Language Analyzer Harmonization Notes

From mediawiki.org

May 2023 — See TJones (WMF)/Notes for other projects. See also T219550. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro, Goals, Caveats[edit]

The goal of bringing language analyzers "into harmony" is to have as many of the non–language-specific elements of the analyzers to be the same as possible. Some split words on underscores and periods, some don't. Some split CamelCase words and some don't. Some use ASCII folding, some use ICU folding, and some don't use either. Some preserve the original word and have two ouptuts when folding, and some don't. Some use the ICU tokenizer and some use the standard tokenizer (for no particular reason—there are good reasons to use the ICU, Hebrew, Korean, or Chinese tokenizers in particular cases). When there is no language-specific reason for these differences, it's confusing, and we clearly aren't using analysis best practices everywhere.

My design goal is to have all of the relevant upgrades made by default across all language analysis configurations, with only the exceptions having to be explicitly configured.

Our performance goal is to reduce zero-results rate and/or increase the number of results returned for 75% of languages with relevant queries. This goal comes with some caveats, left out of the initial statement to keep it reasonably concise.

  • "All wikis" is, in effect, "all reasonably active wikis"—if a wiki has only had twelve searches last month, none with apostrophes, it's hard to meaningfully measure anything. More details in "Data Collection" below.
    • I'm also limiting my samples to Wikipedias because they have the most variety of content and queries, and to limit testing scope, allowing more languages to be included.
    • I'm going to ignore wikis with unchanged configs (some elements are already deployed on some wikis), since they will have approximately 0% change in results (there's always a bit of noise).
  • "Relevant" queries are those that have the feature being worked on. So, I will have a collection of queries with apostrophe-like characters in them to test improved apostrophe handling, and a collection of queries with acronyms to test better acronym processing. I'll still test general query corpora to get a sense of the overall impact, and to look for cases where queries without the feature being worked on still get more matches (for example, searching for NASA should get more matches to N.A.S.A. in articles).
    • I'm also applying my usual filters (used for all the unpacking impact analyses) to queries, mostly to filter out porn and other junk. For example, I don't think it is super important whether the query s`wsdfffffffsf actually gets more results once we normalize the backtick/grave accent to an apostrophe.
    • Smaller/lower-activity wikis may get filtered out for having too few relevant queries for a given feature.
  • We are averaging rates across wikis so that wiki size isn't a factor (and neither is sample rate—so, I can oversample smaller wikis without having to worry about a lot of bookkeeping).

Data Collection[edit]

I started by including all Wikipedias with 10,000 or more articles. I also gathered the number of active editors and the number of full-text queries (with the usual anti-bot filters) for March 2023. I dropped those with less than 700 monthly queries and fewer than 50 active editors. My original ideas for thresholds had been ~1000 monthly queries and ~100 active editors, but I didn't want or need a super sharp cut off. Limiting by very low active editor counts meant fewer samples to get at the query-gathering step, which is somewhat time-consuming. Limiting by query count also meant less work at the next step of filtering queries, and all later steps, too.

I ran my usual query filters (as mentioned above), and also dropped wikis with fewer than 700 unique queries after filtering. That left 90 Wikipedias to work with. In order of number of unique filtered monthly queries, they are: English, Spanish, French, German, Russian, Japanese, Chinese, Italian, Portuguese, Polish, Arabic, Dutch, Czech, Korean, Indonesian, Turkish, Persian, Vietnamese, Swedish, Hebrew, Ukrainian, Igbo, Finnish, Hungarian, Romanian, Greek, Norwegian, Catalan, Hindi, Thai, Simple English, Danish, Bangla, Slovak, Bulgarian, Swahili, Croatian, Serbian, Tagalog, Slovenian, Lithuanian, Georgian, Tamil, Malay, Uzbek, Estonian, Albanian, Azerbaijani, Latvian, Armenian, Marathi, Burmese, Malayalam, Afrikaans, Urdu, Basque, Mongolian, Telugu, Sinhala, Kazakh, Macedonian, Khmer, Kannada, Bosnian, Egyptian Arabic, Galician, Cantonese, Icelandic, Gujarati, Central Kurdish, Serbo-Croatian, Nepali, Latin, Kyrgyz, Belarusian, Esperanto, Norwegian Nynorsk, Assamese, Tajik, Punjabi, Oriya, Welsh, Asturian, Belarusian-TaraĆĄkievica, Scots, Luxembourgish, Irish, Alemannic, Breton, & Kurdish.

  • Or, in language codes: en, es, fr, de, ru, ja, zh, it, pt, pl, ar, nl, cs, ko, id, tr, fa, vi, sv, he, uk, ig, fi, hu, ro, el, no, ca, hi, th, simple, da, bn, sk, bg, sw, hr, sr, tl, sl, lt, ka, ta, ms, uz, et, sq, az, lv, hy, mr, my, ml, af, ur, eu, mn, te, si, kk, mk, km, kn, bs, arz, gl, zh-yue, is, gu, ckb, sh, ne, la, ky, be, eo, nn, as, tg, pa, or, cy, ast, be-tarask, sco, lb, ga, als, br, ku.

I sampled 1,000 unique filtered queries from each language (except for those that had fewer than 1000).

I also pulled 1,000 articles from each Wikipedia to use for testing.

I used a combined corpus of the ~1K queries and the 1K articles for each language to test analysis changes. This allows me to see interactions between words/characters that occur more in queries and words/characters that occur more in articles.

Relevant Query Corpora[edit]

For each task, I plan to pull a corpus of "relevant" queries for each language for before-and-after impact assessment, by grepping for the relevant characters. For each corpus, I'll also do some preprocessing to remove queries that are unchanged by the analysis upgrades being made.

For example, when looking at apostrophe-like characters, ICU folding already converts typical curly quotes (‘’) to straight quotes ('), so for languages with ICU folding enabled, curly quotes won't be treated any differently, so I plan to remove those queries as "irrelevant". Another example is reversed prime (—), which causes a word break with the standard tokenizer; apostrophes are stripped at the beginning or ending of words, so reversed prime at the edge of a word isn't actually treated differently from an apostrophe in the same place—though the reasons are very different.

For very large corpora (≫1000, for sure), I'll probably sample the corpus down to a more reasonable size after removing "irrelevant" queries.

I'm going to keep (or sample) the "irrelevant" queries (e.g., words with straight apostrophes or typical curly quotes handled by ICU folding) for before-and-after analysis, because they may still get new matches on words in wiki articles that use the less-common characters, though there are often many, many fewer such words on-wiki—because the WikiGnomes are always WikiGnoming!

Another interesting wrinkle is that French and Swedish use ICU folding with "preserve original", so that both the original form and folded form are indexed (e.g., l’apostrophe is indexed as both l’apostrophe and l'apostrophe). This doesn't change matching, but it may affect ranking. I'm going to turn off the "preserve original" filter for the purpose of removing "irrelevant" queries, since we are focused on matching here.

Some Observations[edit]

After filtering porn and likely junk queries and uniquifying queries, the percentage of queries remaining generally ranged from 94.52% (Icelandic—so many unique queries!) to 70.58% (Persian), with a median of 87.31% (Simple English), and a generally smooth distribution across that range.

There were three outliers:

  • Swahili (57.51%) and Igbo (37.56%) just had a lot of junk queries.
  • Vietnamese was even lower at 30.03%, with some junk queries but also an amazing number of repeated queries, many of which are quite complex (not like everyone is searching for just famous names or movie titles or something "simple"). A few queries I looked up on Google seem to exactly match titles or excerpts of web pages. I wonder if there is a browser tool or plugin somewhere that is automatically doing wiki searches based on page content.

Re-Sampling & Zero-Results Rate[edit]

I found a bug in my filtering process, which did not properly remove certain very long queries that get 0 results, which I classify as "junk". These accounted for less than 1% of any given sample, but it was still weird to have many samples ranging from 990–999 queries instead of the desired 1,000. Since I hadn't used my baseline samples for anything at that point, I decided to re-sample them. This also gave me an opportunity to compare zero-results rates (ZRR) between the old and new samples.

In the case of very small queries corpora, the old and new samples may largely overlap, or even be identical. (For example, if there are only 800 queries to sample from, my sample "of 1000" is going to include all of them, every time I try to take a sample.) Since this ZRR comparison was not the point of the exercise, I'm just going to throw out what I found as I found it, and not worry about any sampling biases—though they obviously include overlapping samples, and potential effects of the original filtering error.

The actual old/new ZRR for these samples ranged from 6.3%/6.2% (Japanese) to 75.4%/76.1% (Igbo—wow!). The zero-results rate differences from the old to the new sample ranged from -4.2% (Gujarati, 64.3% vs 60.1%) to +5.6% (Dutch, 22.1% vs 27.7%), with a median of 0.0% and mean of -0.2%. Proportional rates ranged from -19.9% (Galician, 17.5% vs 14.6%) to +20.2% (Dutch, 22.1% vs 27.7%, again), with a median of 0.0%, and a mean of -0.5%.

Looking at the graph, there are some minor outliers, but nothing ridiculous, which is nice to see.

"Infrastructure"[edit]

I've built up some temporary "infrastructure" to support impact analysis of the harmonization changes. Since every or almost every wiki will need to be reindexed to enable harmonization changes, timing the "before and after" query analyses for the 90 sampled wikis would be difficult.

Instead, I've set up a daily process that runs all 90 samples each day. There's an added bonus of seeing the daily variation in results without any changes.

I will also pull relevant sub-samples for each of the features (apostrophes, acronyms, word_break_helper, etc.) being worked on and run them daily as well.

There's a rather small chance of having a reindexing finish while a sample is being run, so that half the sample is "before" and half is "after". If that happens, I can change my monitoring cadence to every other day for that sample for comparison's sake and it should be ok.

Apostrophes (T315118)[edit]

There are some pretty common apostrophe variations that we see all the time, particularly the straight vs curly apostrophes—e.g., ain't vs ain’t. And of course people (or their software) will sometimes curl the apostrophe the wrong way—e.g., ain‘t. But lots of other characters regularly (and some irregularly) get used as apostrophes, or apostrophes get used for them—e.g., Hawai'i or Hawai’i or Hawai‘i when the correct Hawaiian letter is the okina: Hawaiʻi.

A while back, we worked on a ticket (T311654) for the Nias Wikipedia to normalize some common apostrophe-like variants, and at the time I noted that we should generalize that across languages and wikis as much as possible. ICU normalization and ICU folding already do some of this (see the table below)—especially for the usual ‘curly’ apostrophes/single quotes, but those cases are common enough that we should take care of them even when the ICU plugin is not available. It'd also be nice if the treatment of these characters was more consistent across languages, and not dependent on the specific tokenizer and filters configured for a language.

There are many candidate "apostrophe-like" characters. The list below is distillation of the list of Unicode Confusables for apostrophe, characters I had already known were potential candidates from various Phab tickets and my own analysis experience (especially working on Turkish apostrophes), and the results of data-mining for apostrophe-like contexts (e.g., Hawai_i).

Apostrophe-Like Candidate Characters
x'x Desc. #q #wiki samp UTF Example std tok (is) icu tok (my) heb tok (he) nori tok (ko) smart cn (zh) icu norm (de) icu fold (de) icu norm (wsp) icu norm + fold (wsp) icu fold (wsp) Nias apos-like trans­itive apos is x-like? final fold
a—x reversed prime 0 0 U+2035 Ocean—s split split split split split/‌keep → , → ' → ' – + – +
bꞌx Latin small letter saltillo 0 0 U+A78C Miꞌkmaq split/‌keep → ' → ' → ' + +
c‛x single high-reversed-9 quo­tation mark 0 1 U+201B Het‛um split split → ' split split/‌keep → , → ' → ' + +
dߎx N'ko high tone apos­trophe 1 0 U+07F4 ĐżĐ°ĐŒßŽŃŃ‚ĐșĐž split/‌keep split/‌keep split/‌keep delete delete delete – – –
eáżŸx Greek dasia 1 2 U+1FFE CháżŸen split split split split split/‌keep → [ ̔]
(sp + U+314)
→ sp delete – – –
fÊœx modi­fier letter reversed comma 1 8 U+02BD GeÊœez split/‌keep delete delete delete + +
gáŸżx Greek psili 1 11 U+1FBF láŸżancienne split split split split split/‌keep → [ ̓]
(sp + U+313)
→ sp delete – – –
hៜx Greek koronis 3 3 U+1FBD Maៜlaf split split split split split/‌keep → [ ̓]
(sp + U+313)
→ sp delete – – –
i՚x Arme­nian apos­trophe 8 1 U+055A Nobatia՚s split split split split split/‌keep – + +
jx full­width grave accent 11 0 U+FF40 JOLLYS split split split split split/‌keep → , → ` → ` delete – + +
k՝x Arme­nian comma 12 4926 U+055D People՝s split split split split split/‌keep → [ ́]
(sp + U+301)
→ sp – – –
lÊŸx modi­fier letter right half ring 18 90 U+02BE BeÊŸer split/‌keep delete delete delete ✓ + +
mˈx modi­fier letter vert­ical line 21 1041 U+02C8 Meˈyer split/‌keep delete delete delete – – –
nx full­width apos­trophe 28 0 U+FF07 Chinas → ' split split/‌keep → , → ' → ' → ' → ' + +
oÊčx modi­fier letter prime 63 16 U+02B9 KuzÊčmina split/‌keep delete delete delete + +
pÊżx modi­fier letter left half ring 71 166 U+02BF BaÊżath split/‌keep delete delete delete ✓ + +
qâ€Čx prime 93 1133 U+2032 Peopleâ€Čs split split split split split/‌keep → , → ' → ' + +
rˊx modi­fier letter acute accent 107 0 U+02CA kāˊvya split/‌keep delete delete delete – – –
sˋx modi­fier letter grave accent 118 0 U+02CB Sirenˋs split/‌keep delete delete delete + +
t΄x Greek tonos 132 856 U+0384 Adelberg΄s split split split split split/‌keep delete – – –
uÊŒx modi­fier letter apos­trophe 154 1665 U+02BC BahĂĄÊŒĂ­ split/‌keep delete delete delete ✓ + +
vŚłx Hebrew punc­tuation geresh 389 54 U+05F3 AlzheimerŚłs split/‌keep → ' split split/‌keep → ' → ' → ' – + +
wʻx modi­fier letter turned comma 824 14734 U+02BB Chʻeng split/‌keep delete delete delete + +
xÂŽx acute accent 2769 229 U+00B4 CeteraÂŽs split split split split split/‌keep → , → [ ́]
(sp + U+301)
→ sp delete + +
y`x grave accent 2901 862 U+0060 she`s split split split split split/‌keep → , delete delete ✓ + +
z‘x left single quo­tation mark 3571 4977 U+2018 Hawai‘i → ' split split/‌keep → , → ' → ' → ' ✓ + +
za’x right single quo­tation mark 35333 18472 U+2019 Angola’s → ' split split/‌keep → , → ' → ' → ' ✓ + +
zb'x apos­trophe 114116 148698 U+0027 apostrophe's split split/‌keep → , == ==
zcŚ™x Hebrew letter yod 142513 261471 U+05D9 ArchŚ™olo­giques split/‌keep split/‌keep split/‌keep – – –

Key

  • x'x—It's hard to visually distinguish all the vaguely apostrophe-like characters on-screen, so after ordering them, I put a letter (or two) before them and an x after them. The letter before makes it easier to see where each one is/was when looking at the analysis output, and the x after doesn't seem to be modified by any of the analyzers I'm working with. And x'x is an easy shorthand to refer to a character without having to specify its full name.
    • Also, apostrophe-like characters sometimes get treated differently at the margins of a word. (Schrodinger's apostrophe: inside a word it's an apostrophe, at the margins, it's a single quote.) Putting it between two alpha characters gives it the most apostrophe-like context.
  • Desc.—The Unicode description of the character
  • #q—The number of occurrences of this character (in any usage) in my 90-language full query sample. Samples can be heavily skewed: Hebrew letter yod occurs a lot in Hebrew queries—shocker! Big wiki samples are larger, so English is over-represented. Primary default sort key.
  • #wiki samp—The number of occurrences of this character in my 90-language 1K Wikipedia sample. Samples can be skewed by language (as with Hebrew yod above), but less so by sample size. All samples are 1K articles, but some wikis have longer average articles. Secondary default sort key.
  • UTF—UTF codepoint for the character. Tertiary default sort key.
  • Example—An actual example of the character being used in an apostrophe-like way. Most come from English Wikipedia article or query samples. Others I had to look harder to find—in other samples, or using on-wiki search.
    • Just because a word or a few words exist with the character used in an apostrophe-like way doesn't mean it should be treated as an apostrophe. When looking for words matching the Hawai_i pattern, I found Hawai*i, Hawai,i, and Hawai«i, too. I don't think anyone would suggest that asterisks, commas, or guille­mets should be treated as apostrophes.
    • I never found a real example of Hebrew yod being used as an apostrophe. I only found two instances of it embedded in a Latin-script word (e.g. ArchŚ™ologiques), and there it looked like an encoding error, since it has clearly replaced Ă©. I fixed both of those (through my volunteer account).
    • I really did find an example of apostrophe's using a real apostrophe!
  • std tok (is)—What does the standard tokenizer (exemplified by the is/Icelandic analyzer) do to this character?
  • icu tok (my)—What does the ICU tokenizer (exemplified by the my/Myanmar analyzer) do to this character?
  • heb tok (he)—What does the HebMorph tokenizer (exemplified by the he/Hebrew analyzer) do to this character?
  • nori tok (ko)—What does the Nori tokenizer (exemplified by the ko/Korean analyzer) do to this character?
  • smart cn (zh)—What does the SmartCN tokenizer (exemplified by the zh/Chinese analyzer) do to this character?
  • icu norm (de)—What does the ICU normalizer filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
  • icu fold (de)—What does the ICU folding filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
  • icu norm (wsp)—What does the ICU normalizer filter do to this character, after going through a whitespace tokenizer? (The whitespace tokenizer just splits on spaces, tabs, newlines, etc. There's no language for this, so it was a custom config.)
  • icu norm + fold (wsp)—What does the ICU normalizer filter + the ICU folding filter do to this character, after going through a whitespace tokenizer? (We never enable the ICU folding filter without enabling ICU normalization first—so this is a more "typical" config.)
  • icu fold (wsp)—What does the ICU folding filter do to this character, after going through a whitespace tokenizer, without ICU normalization first?
  • Tokenizer and Normalization Sub-Key
    • split means the tokenizer splits on this character—at least in the context of being between Latin characters. Specifically non-Latin characters get split by the ICU tokenizer between Latin characters in general because it always splits on script changes. (General punctuation doesn't belong to a specific script.) So, the standard tokenizer splits a—x to a and x.
    • split/keep means the tokenizer splits before and after the character, but keeps the character. So, the ICU tokenizer splits dߎx to d, ߎ, and x.
    • → ? means the tokenizer or filter converts the character to another character. So, the HebMorph tokenizer tokenizer c‛x as c'x (with an apostrophe).
      • The most common conversion is to an apostrophe. The SmartCN tokenzier converts most punctuation to a comma. The ICU normalizer converts some characters to space plus another character (I don't get the reasoning, so I wonder if this might be a bug); I've put those in square brackets, though the space doesn't really show up, and put a mini-description in parens, e.g. "(sp + U+301)". Fullwidth grave accent gets normalized to a regular grave accent by ICU normalization.
      • split/keep → ,—which is common in the SmartCN tokenizer column—means that text is split before and after the character, the character is not deleted, but it is converted to a comma. So, the SmartCN tokenizer tokenizes a—x as a + , + x.
    • delete means the tokenizer or filter deletes the character. So, ICU folding converts dߎx to dx.
  • Nias—For reference, these are the characters normalized specifically for nia/Nias in Phab ticket T311654.
  • apos-like—After reviewing the query and Wikipedia samples, this character does seem to commonly be used in apostrophe-like ways. (In cases of the rarer characters, like bꞌx, I had to go looking on-wiki for examples.)
    • + means it is, – means it isn't, == means this is the row for the actual apostrophe!
  • transitive—This character is not regularly used in an apostrophe-like way, but it is normalized by a tokenizer or filter into a character that is regularly used in an apostrophe-like way.
  • apos is x-like?—While the character is not used in apostrophe-like way (i.e., doesn't appear in Hawai_i, can_t, don_t, won_t, etc.), apostrophes are used where this character should be.
    • + means it is, – means it isn't, blank means I didn't check (because it was already apostrophe-like or transitively apostrophe-like).
  • final fold—Should this character get folded to an apostrophe by default? If it is apostrophe-like, transitively apostrophe-like, or apostrophes get used where it gets used—i.e., a + in any of the three rpevious columns—then the answer is yes (+).

Character-by-Character Notes[edit]

  • a—x (reversed prime): This character is very rarely used anywhere, but it is normalized to apostrophe by ICU folding
  • bꞌx (Latin small letter saltillo): This is used in some alphabets to represent a glottal stop, and apostrophes are often used to represent a glottal stop, so they are mixed up. In the English Wikipedia article for Mi'kmaq (apostrophe in the title), miꞌkmaq (with saltillo) is used 144 times, while mi'kmaq (with apostrophe) is used 78 times—on the same page!
  • c‛x (single high-reversed-9 quotation mark): used as a reverse quote and an apostrophe.
  • dߎx (N'ko high tone apostrophe): This seems to be an N'ko character almost always used for N'ko things. It's uncommon off the nqo/N'ko Wikipedia, and on the nqo/N'ko Wikipedia the characters do not seem to be not interchangeable.
  • eáżŸx (Greek dasia): A Greek character almost always used for Greek things.
  • fÊœx (modifier letter reversed comma): Commonly used in apostrophe-like ways.
  • gáŸżx (Greek psili): A Greek character almost always used for Greek things.
  • hៜx (Greek koronis): A Greek character almost always used for Greek things.
  • i՚x (Armenian apostrophe): An Armenian character almost always used for Armenian things, esp. in Western Armenian—however, the non-Armenian apostophe is often used for the Armenian apostrophe.
  • jx (fullwidth grave accent): This is actually pretty rare. It is mostly used in kaomoji, like (*Žω*), and for quotes. But it often gets normalized to a regular grave accent, so it should be treated like one, i.e., folded to an apostrophe.
    • It's weird that there's no fullwidth acute accent in Unicode.
  • k՝x (Armenian comma): An Armenian character almost always used for Armenian things, and it generally appears at the edge of words (after the words), so it would usually be stripped as an apostrophe, too.
  • lÊŸx (modifier letter right half ring): On the Nias list, and frequently used in apostrophe-like ways.
  • mˈx (modifier letter vertical line): This is consistently used for IPA transcriptions, and apostrophes don't show up there very often.
  • nx (fullwidth apostrophe): Not very common, but does get normalized to a regular apostrophe by ICU normalization and ICU folding, so why fight it?
  • oÊčx (modifier letter prime): Consistently used on-wiki as palatalization in Slavic names, but apostrophes are used for that, too.
  • pÊżx (modifier letter left half ring): On the Nias list, and frequently used in apostrophe-like ways.
  • qâ€Čx (prime): Consistently used for coordinates, but so are apostrophes.
  • rˊx (modifier letter acute accent): Used for bopomofo to mark tone; only occurs in queries from Chinese Wikipedia.
  • sˋx (modifier letter grave accent): Used as an apostrophe in German and Chinese queries.
  • t΄x (Greek tonos): A Greek character almost always used for Greek things.
  • uÊŒx (modifier letter apostrophe): Not surprising that a apostrophe variant is used as an apostrophe.
  • vŚłx (Hebrew punctuation geresh): A Hebrew character almost always used for Hebrew things... however, it is converted to apostrophe by both the Hebrew tokenizer and ICU folding.
  • wÊ»x (modifier letter turned comma): Often used as an apostrophe.
  • xÂŽx (acute accent): Often used as an apostrophe.
  • y`x (grave accent): Often used as an apostrophe.
  • z‘x (left single quotation mark): Often used as an apostrophe.
  • za’x (right single quotation mark): The curly apostrophe, so of course it's used as an apostrophe.
  • zb'x (apostrophe): The original!
  • zcŚ™x (Hebrew letter yod): A Hebrew character almost always used for Hebrew things. The most examples because it is an actual Hebrew letter. Showed up on the confusabled list, but is never used as an apostrophe. Only examples are encoding issues: PalŚ™orient, ArchŚ™ologiques → PalĂ©orient, ArchĂ©ologiques

Apostrophe-Like Characters, The Official Listℱ[edit]

The final set of 19 apostrophe-like characters to be normalized is [`ÂŽÊčÊ»ÊŒÊœÊŸÊżË‹ŐšŚłâ€˜â€™â€›â€Č—ꞌ]—i.e.:

  • ` (U+0060): grave accent
  • ÂŽ (U+00B4): acute accent
  • Êč (U+02B9): modifier letter prime
  • Ê» (U+02BB): modifier letter turned comma
  • ÊŒ (U+02BC): modifier letter apostrophe
  • Êœ (U+02BD): modifier letter reversed comma
  • ÊŸ (U+02BE): modifier letter right half ring
  • Êż (U+02BF): modifier letter left half ring
  • ˋ (U+02CB): modifier letter grave accent
  • ՚  (U+055A): Armenian apostrophe
  • Śł (U+05F3): Hebrew punctuation geresh
  • ‘ (U+2018): left single quotation mark
  • ’ (U+2019): right single quotation mark
  • ‛ (U+201B): single high-reversed-9 quotation mark
  • â€Č (U+2032): prime
  • — (U+2035): reversed prime
  • ꞌ (U+A78C): Latin small letter saltillo
  •  (U+FF07): fullwidth apostrophe
  •  (U+FF40): fullwidth grave accent

Other Observations[edit]

  • Since ICU normalization converts some of the apostrophe-like characters above to ́ (U+301, combining acute accent), ̓ (U+313, combining comma above), and ̔ (U+314, combining reversed comma above), I briefly investigated those, too. They are all used as combining accent characters and not as separate apostrophe-like characters. The combining commas above are both used in Greek, which makes sense, since they are on the list because Greek accents are normalized to them.
  • In French examples, I sometimes see 4 where I'd expect an apostrophe, especially in all-caps. Sure enough, looking at the AZERTY keyboard you can see that 4 and the apostrophe share a key!
  • The hebrew_lemmatizer in the Hebrew analyzer often generates multiple output tokens for a given input token—this is old news. However, looking at some detailed examples, I noticed that sometimes the multiple tokens (or some subset of the multiple tokens) are the same! Indexing two copies of a token on top of each other doesn't seem helpful—and it might skew token counts for relevance.

apostrophe_norm[edit]

The filter for Nias that normalized some of the relevant characters was called apostrophe_norm. Since the new filter is a generalization of that, it is also called apostrophe_norm. There's no conflict because with the new generic apostrophe_norm, as there's no longer a need for a Nias-specific filter, or any Nias-specific config at all.

I tested the new apostrophe_norm filter on a combination of ~1K general queries and 1K Wikipedia articles per language (across the 90 harmonization languages). The corpus for each language was run through the analysis config for that particular language. (Languages that already have ICU folding, for example, already fold typical ‘curly’ quotes, so there'd be no change for them, but for other languages there would be.)

I'm not going to give detailed notes on all 90 languages, just note general trends and highlight some interesting examples.

  • In general, there are lots of names and English, French, & Italian words with apostrophes everywhere (OÂŽReilly, R`n`R, d‘Europe, dell’arte).
  • There are also plenty of native apostrophe-like characters in some languages; the typical right curly apostrophe (’) is by far the most common. (e.g., àŠ‡àŠ•'àŠšàŠźàŠżàŠ• vs àŠ‡àŠ•â€™àŠšàŠźàŠżàŠ•, Đ·'Đ”Đ·ĐŽĐ°ĐŒ vs Đ·â€™Đ”Đ·ĐŽĐ°ĐŒ, Bro-C'hall vs Bro-C’hall)
  • Plenty of coordinates with primes (e.g., 09â€Č15) across many languages—though coordinates with apostrophes are all over, too.
  • Half-rings (ÊżÊŸ) are most common in Islamic names.
  • Encoding errors (e.g., Р’ Р±РѕР№ Đ Ń‘Đ Ò‘ĐĄŃ“ĐĄâ€š Đ Ń•Đ Ò‘Đ Đ…Đ Ń‘ Đ’Â«ĐĄĐƒĐĄâ€šĐ Â°ĐĄĐ‚Đ Ń‘Đ Ń”Đ Ń‘Đ’Â» instead of В Đ±ĐŸĐč оЮут ĐŸĐŽĐœĐž «старОĐșО») sometimes have apostrophe-like characters in them. Converting them to apostrophes doesn't help.. it's just kinda funny.
  • Uzbek searchers really like to mix it up with their apostrophe-like options. The apostrophe form o'sha will now match o`sha, oÊ»sha, o‘sha, o’sha, o`sha, oÊ»sha, o‘sha, and o’sha—all of which exist in my samples!

I don't always love how the apostrophes are treated (e.g., aggressive_splitting in English is too aggressive), but for now it's good that all versions of a word with different apostrophe-like characters in it are at least treated the same.

There may be a few instances where the changes decrease the number of results a query gets, but it is usually an increase in precision. For example, l‍®autre would no longer match autre because the tokenizer isn't splitting on ®. However, it will match l'autre. Having to choose between them isn't great—I'm really leaning toward enabling French elision processing everywhere—but in a book or movie title, an exact match is definitely better. (And having to randomly match l to make the autre match is also arguably worse.)

aggressive_splitting (T219108)[edit]

aggressive_splitting is enabled on English- and Italian-language wikis, so—assuming it does a good job and in the name of harmonization—we should look at enabling it everywhere. In particular, it splits up CamelCase words, which is generally seen as a positive thing, and was the original issue in the Phab ticket.

word_delimiter(_graph)[edit]

The aggressive_splitting filter is a word_delimiter token filter, but the word_delimiter docs say it should be deprecated in favor of word_delimiter_graph. I made the change and ran a test on 1K English Wikipedia articles, and there were no changes in the analysis output, so I switched to the _graph version before making any further changes.

Also, aggressive_splitting, as a word_delimiter(_graph) token filter, needs to be the first token filter if possible. (We already knew it needed to come before homoglyph_norm.) If any other filter makes any changes, aggressive_splitting can lose the ability to track offsets into the original text. Being able to track those changes gives better (sub-word) highlighting, and probably better ranking and phrase matching.

So Many Options, and Some Zombie Code[edit]

The word_delimiter(_graph) filter has a lot of options! Options enabling catenate_words, catenate_numbers, and catenate_all are commented out in our code, saying they are potentially useful, but they cause indexing errors. The word_delimiter docs say they cause problems for match_phrase queries. The word_delimiter_graph docs seem to say you can fix the indexing problem with the flatten_graph filter, but still warn against using them with match_phrase queries, so I think we're just gonna ignore them (and remove the commented out lines from our code).

Apostrophes, English Possessives, and Italian Elision[edit]

In the English analysis chain, the possessive_english stemmer currently comes after aggressive_splitting, so it does nothing, since aggressive_splitting splits on apostrophes. However, word_delimiter(_graph) has a stem_english_possessive setting, which is sensibly off by default, but we can turn that on, just for English, which results in a nearly 90% reduction in s tokens.

After too much time looking at apostrophes (for Turkish and in general), always splitting on apostrophes seems like a bad idea to me. We can disable it in aggressive_splitting by recategorizing apostrophes as "letters", which is nice, but that also disables removing English possessive –'s ... so we can put possessive_english back... what a rollercoaster!

In the Italian analysis chain, italian_elision comes before aggressive_splitting, and has been set to be case sensitive. That's kind of weird, but I've never dug into it before—though I did just blindly reimplement it as-is when I refactored the Italian analysis chain. All of our other elision filters are case insensitive and the Elastic monolithic analyzer reimplementation/unpacking specifies case insensitivity for Italian, too. I think it was an error a long time ago because the default value is case sensitive, and I'm guessing someone just didn't specify it explicitly, and unintentionally got the case sensitive version.

Anyway, aggressive_splitting splits up all of the leftover capitalized elision candidates, which makes the content part more searchable, but with a lot of extra bits. The italian_stop filter removes some of them, but not all. Making italian_elision case insensitive seems like the right thing to do, and as I mentioned above, splitting on apostrophes seems bad in general.

Apostrophe Hell, Part XLVII[edit]

Now may not be the time to add extra complexity, but I can't help but note that d'– and l'– are overwhelmingly French or Italian, and dell'— is overwhelmingly Italian. Similarly, –'s, –'ve, –'re, and –'ll are overwhelmingly English. Some of the others addressed in Turkish are also predominantly in one language (j'n'–, j't'–, j'–, all'–, nell'–, qu'–, un'–, sull'–, dall'–)... though J's and Nell's exist, just to keep things complicated.

All that said, a simple global French/Italian elision filter for d'– and l'– and English possessive filter for –'s would probably improve recall almost everywhere.

CamelCase (via aggressive_splitting)[edit]

Splitting CamelCase seems like a good idea in general (it was the original issue in what became the aggressive_splitting Phab ticket). In the samples I have, actual splits seem to be largely Latin script, with plenty of Cyrillic, and some Armenian, too.

Splitting CamelCase isn't great for Irish, because inflected capitalized words like bhFáinní get split into bh + Fáinní. Normally the stemmer would remove the bh, so the end result isn't terrible, but all those bh–'s are like having all the English possessive –'s in the index. However, we already have some hyphenation cleanup to remove stray h, n, and t, so adding bh (and b, g, and m, which are similar CamelCased inflection bits) to that mini stop word list works, and the plain index can pick still up instances like B.B. King.

Irish also probably has more McNames than other wikis, but they are everywhere. Proximity and the plain index will boost those reasonably well.

Splitting CamelCase often splits non-homoglyph multi-script tokens, like OpenĐœĐžŃ€ĐŸĐČĐŸĐč—some of which may be parsing errors in my data, but any of which could be real or even typos on-wiki. Anyway, splitting them seems generally good, and prevents spurious homoglyph corrections.

Splitting CamelCase is not great for iPad, LaTeX, chemical formulas, hex values, saRcAStiC sPonGEboB, and random strings of ASCII characters (as in URLs, sometimes), but proximity and the plain index take care of them, and we take a minor precision hit (mitigated by ranking) for a bigger, better recall increase.

Splitting CamelCase is good.

Others Things That Are Aggressively Split[edit]

aggressive_splitting definitely lives up to its name. Running it on non-English, non-Italian samples showed just how aggressive it is.

The Good

  • Splits web domains on periods, so en.wikipedia.org → en + wikipedia + org
  • Splits on colons

The Bad (or at least not Good)

  • Splitting between letters and numbers is okay sometimes, but often bad, e.g. j2se → j + 2 + se
  • Splitting on periods in IPA is not terrible, since people probably don't search it much; ˈsÉȘl.ə.bəl vs ˈsÉȘləbəl already don't match anyway.
  • Splitting on periods and commas in numbers is.. unclear. Splitting on the decimal divider isn't terrible, but breaking up longer numbers into ones, thousands, millions, etc. sections is not good.
    • On the other hand, having some systems use periods for decimals and commas for dividing larger numbers (3,141,592.653) and some doing it the other way around (3.141.592,653), and the Indian system (31,41,592.653)—plus the fact that the ones, thousands, millions, etc. sections are sometimes also called periods—makes it all an unrecoverable mess anyway.

The Ugly

  • Splitting acronyms, so N.A.S.A. → N + A + S + A —Nooooooooooo!
    • (Spoiler: there's a fix coming!)
  • Splitting on soft hyphens is terrible—an invisible character with no semantic meaning can un pre dictably and ar bi trar i ly break up a word? Un ac cept able!
  • Splitting on other invisibles, like various joiners and non-joiners and bidi marks, seems pretty terrible in other languages, especially in Indic scripts.

Conclusion Summary So Far[edit]

aggressive_splitting splits on all the things word_break_helper splits on, so early on I was thinking I could get rid of word_break_helper (and repurpose the ticket for just dealing with acronyms), but aggressive_splitting splits too many things, including invisibles, which ICU normalization handles much more nicely.

I could configure away all of aggressive_splitting's bad behavior, but given the overlap between aggressive_splitting, word_break_helper, and the tokenizers, it looks to be easiest to reimplement the CamelCase splitting, which is the only good thing aggressive_splitting does that word_break_helper doesn't do or can't do.

So, the plan is...

  • Disable aggressive_splitting for English and Italian (but leave it for short_text and short_text_search, used by the ShortTextIndexField, because I'm not aware of all the details of what's going on over there).
  • Create and enable a CamelCase filter to pick up the one good thing that aggressive_splitting does that word_break_helper can't do.
  • Enable word_break_helper and the CamelCase filter everywhere.
    • Create an acronym filter to undo the bad things word_break_helper—and aggressive_splitting!—do to acronyms.
  • Fix italian_elision to be case insensitive.

At this point, disabling aggressive_splitting and enabling a new CamelCase filter on English and Italian are linked to prevent a regression, but the CamelCase filter doesn't depend on word_break_helper or the acronym filter.

Enabling word_break_helper and the new acronym filter should be linked, though, to prevent word_break_helper from doing bad things to acronyms. (Example Bad Things: searching for N.A.S.A. on English Wikipedia does bring up NASA as the first result, but the next few are N/A, S/n, N.W.A, Emerald Point N.A.S., A.N.T. Farm, and M.A.N.T.I.S. Searching for M.A.N.T.I.S. brings up Operation: H.O.T.S.T.U.F.F./Operation: M.I.S.S.I.O.N., B.A.T.M.A.N., and lots of articles with "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z" navigation in them, among others.)

I had linked word_break_helper and aggressive_splitting in my head, because they both split up acronyms, but since the plan is to not enable aggressive_splitting in any text filters, we don't need the acronym fix to accompany it.

But Wait, There's More: CamelCase Encore[edit]

So, I created a simple CamelCase pattern_replace filter, split_camelCase. After my experience with Thai, I was worried about regex lookaheads breaking offset tracking. (I now wonder if in the Thai case it's because I merged three pattern_replace filters into one for efficiency. Nope, they're evil.)

However, the Elastic docs provide a simple but very general CamelCase char filter:

"pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
"replacement": " "

My original formulation was pretty similar, except I used \p{Ll} and \p{Lu}, and no lookahead, instead capturing the uppercase letter. But I tested their method, and it works fine in terms of offset mapping. (Apparently, I was wildly mistaken, and lookaheads probably are evil, as I feared.)

However, there are rare cases[†] where CamelCase chunks end in combining diacritics or common invisibles (joiners, non-joiners, zero-width spaces, soft hyphens, and bidi marks being the most common). Fortunately \p{M} and \p{Cf} cover pretty much the right things. I tried adding [\\p{M}\\p{Cf}]* to the lookbehind, but it was really, really sloooooooow. However, allowing 0–9 combining marks or invisibles seems like overkill when you spell it out like that, and there was no noticeable speed difference using {0-9} instead of * on my machine. Adding the possessive quantifier (overloaded +—why do they do that?) to the range should only make it faster. My final pattern, with lookbehind, lookahead, and optional possessive combining marks and invisibles:

'pattern' => '(?<=\\p{Ll}[\\p{M}\\p{Cf}]{0,9}+)(\\p{Lu})',
'replacement' => ' $1'

(Overly observant readers will note a formatting difference. The Elastic example is a JSON snippet, mine is in PHP snippet. I left it because it amuses me, and everything should be clear from context.)

[†] As noted above in the Not A Conclusion, I had originally linked the CamelCase filter with word_break_helper and the acronym filter. The combining diacritics and common invisibles are much more relevant to acronym processing—which I've already worked on as I'm writing this, and which made me go back and look for CamelCase cases—of which there are a few.

In Conclusion—No, Really, I Mean It![edit]

So, the plan for this chunk of harmonization is:

  • Disable aggressive_splitting for English and Italian.
  • Create and enable split_camelCase.
  • Fix italian_elision to be case insensitive.

And we can worry about enabling word_break_helper and handling acronyms in the next chunk of harmonization.

Appendix: CamelCase Observations[edit]

(Technically this doesn't violate the terms of the conclusion being conclusive, since it's just some extra observations about the data, for funsies.)

The kinds of things that show up in data focused on examples of CamelCase—some we can help with (✓), some we cannot (✗):

  • ✓ Cut and PasteCut and Paste / Double_NamebotDouble_Namebot
  • ✗ mY cAPSLCOCK kEY iS bROKEN
  • ✓ mySpaceBarIsBrokenButMyShiftKeyIsFine
  • ✓ lArt dElision sans lApstrophe
  • ✗ ÐÂșĂÂŸĂÂŽĂÂžĂ‘Â€ĂÂŸĂÂČĂÂ°ĂÂœĂÂžĂÂ” is hard.. oops, I mean, Đ Ń”Đ Ń•Đ Ò‘Đ Ń‘ĐĄĐ‚Đ Ń•Đ Đ†Đ Â°Đ Đ…Đ Ń‘Đ Â” is hard
  • ✗ "Wiki Loves Chemistry"?: 2CH2COOH + NaCO2 → 2CH2COONa + H2O + CO2
  • ✗ WĂ©IRD UPPĂšRCĂ€SE FUñCTIĂŽNS / wÉird lowÈrcÄse fuÑctiÔns
  • ✓ ĐĐ°Đ·ĐČĐ°ĐœĐžĐ”Đ±ĐŸŃ‚,ĐŸŃ€ĐŸĐ±Đ”Đ»ĐŸĐŸĐ”ĐŽĐ°ĐœĐžĐ” (Namebot,SpaceEating)
  • ✓ Lots of English examples in languages without an upper/lowercase distinction

I think the CamelCase fix is going to be very helpful for people who double-paste something (if it starts with uppercase and ends with lowercase, like Mr RogersMr Rogers). On the one hand, it's probably a rare mistake for any given person, but on the other, it still happens many times per day.

word_break_helper and Acronyms (T170625)[edit]

We—especially David and I—have been talking about "fixing acronyms" for years. On all wikis, NASA and N.A.S.A. do not match. And while they are not technically acronyms, the same problem arises for initials in names, such as J.R.R. Tolkien and JRR Tolkien; those ought to match! (I'd like to get J. R. R. Tolkien (with spaces) in on the game, too, but that's a different and more difficult issue.)

Long before either David or I were on the search team, English and Italian were configured to use word_break_helper in the text field. Generally this is a good thing, because it breaks up things like en.wikipedia.org and word_break_helper into searchable pieces. However, it also breaks up acronyms like N.A.S.A. into single letters. This is especially egregious for NASA on English-language wikis, where a is a stop word (and thus not strictly required)—lots of one-letter words are stop words in various languages, so it's not just an English problem.

Anyway... there are three goals for this task:

  • Merge acronyms into words (so NASA and N.A.S.A. match).
  • Apply word_break_helper everywhere (once acronyms are mostly safe)
  • Extend word_break_helper to any other necessary characters, particularly colon (:)

Merging Acronyms[edit]

I originally thought I would have to create a new plugin with a new filter to handle acronyms. Certainly the basic pattern of letter-period-letter-period... would be easy to match. However, I realized we could probably get away with a regular expression in a character filter, which would also avoid some potential tokenization problems that might prevent some acronyms from being single tokens.

We can't just delete periods between letters, since that would convert en.wikipedia.org to enwikipediaorg. Rather we want a period between two single letters. Probably. Certainly, that does the right thing for N.A.S.A. (converts to NASA.) and en.wikipedia.org (nothing happens).

However... and there is always a however... as noted above in the camelCase discussion, sometimes our acronyms can have combining diacritics or common invisibles (joiners, non-joiners, zero-width spaces, soft hyphens, and bidi marks being the most common). A simple example would be something like T.É.T.S or ĂŸ.ĂĄ.m. or Ä°.T.Ü.—except that in those cases, Latin characters with diacritics are normalized into single code points.

Indic languages written with abugidas are a good example where more complex units than single letters can be used in acronyms or initials. We'll come back to that in more detail later.

So, what we need are single (letter-based) graphemes separated by periods. Well, and the fullwidth period (), obviously... and maybe... sigh.

I checked the Unicode confusables list for period and got a lot of candidates, including Arabic-Indic digit zero (Ù ), extended Arabic-Indic digit zero (Û°), Syriac supralinear full stop (܁), musical symbol combining augmentation dot (𝅭), Syriac sublinear full stop (܂), one-dot leader ( ), Kharoshthi punctuation dot (𐩐), Lisu letter tone mya ti (ê“ž), and middle dot (·). Vai full stop (꘎) was also on the list, but that does not look like something someone would accidentally use as a period. Oddly, fullwidth period is not on the confusables list.

Given an infinite number of monkeys typing on an infinite number of typewriters a large enough number of WikiGnomes cutting-and-pasting, you will find examples of anything and everything, but the only characters I found regularly being used as periods in acronym-like contexts were actually fullwidth periods across languages, and one-dot leaders in Armenian. (Middle dot also gets used more than the others, but not a whole lot, and in both period-like and comma-like ways, so I didn't feel comfortable using it as an acronym separator.)

So, we want single graphemes—consisting of a letter, zero or more combining characters or invisibles—separated by periods or fullwidth periods (or one-dot leaders in the case of Armenian). A "single grapheme" is one that is not immediately preceded or followed by another letter-based grapheme (which may also be several Unicode code points). We also have to take into account the fact that an acronym could be the first or last token in a string being processed, and we have to explicitly account for "not immediately preceded or followed by" to include the case when there is nothing there at all—at the beginning or end of the string.

For Armenian, it turns out that one-dot leader is used pretty much anywhere periods are, though only about 10% as often, so I added a filter to convert one-dot leaders to periods for Armenian.

My original somewhat ridiculous regex started off with a look-behind for 1.) a start of string (^) or non-letter (\P{L}), followed by 2.) a letter-based grapheme—a letter (\p{L}), followed by optional combining marks (\p{M}) or invisibles (\p{Cf})—then 3.) the period or fullwidth period ([.]), followed by 4.) optional invisibles, then a capture group with 5.) another letter-based grapheme; and a look-ahead for 6.) a non-letter or end of string ($).

Some notes:

  • In all its hideous, color-coded glory: (?<=(?:^|\P{L})\p{L}[\p{M}\p{Cf}]{0,9}+)[.]\p{Cf}*+(\p{L}[\p{M}\p{Cf}]*+)(?=\P{L}|$)
  • (1) and (2) in the look-behind aren't part of the matching string, (3) is the period we are trying to drop, (4) is invisible characters we drop anyway, (5) is the following letter, which we want to hold on to, and (6) is in the look-ahead, and not part of the matching string. In the middle of a simple acronym, (1) is the previous period and (2) is the previous letter, and (6) is the next period.
  • For reasons of efficiency, possessive matching is used for the combining marks and invisibles, and combining marks and invisibles are limited to no more than 9 in the look-behind. (I have seen 14 Khmer diacritics stacked on top of each other, but that kind of thing is pretty rare.)
  • The very simple look-ahead does not mess up the token's character offsets—phew!
  • And finally—this doesn't work for certain cases that are relatively common not unheard of in Brahmic scripts!—though they are hard to find in Latin texts.
    • Ugh.

First, an example using Latin characters. We want e.f.g. to be treated as an acronym and converted to efg. We don't want ef.g to be affected. As mentioned above, we want to handle diacritics, such as Ă©.f.g. and Ă©f.g, which are not actually a problem because Ă© is a single code point. However, something like eÌȘ is not. It can only be represented as e +  ÌȘ. Within an acronym, we've got that covered, and d.eÌȘ.f. is converted to deÌȘf. just fine. But  ÌȘ is technically "not a letter" so the period in eÌȘf.g would get deleted, because f is preceded by "not a letter" and thus appears to be a single letter/single grapheme.

In some languages using Brahmic scripts (including Assamese, Gujarati, Hindi, Kannada, Khmer, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sinhala, Tamil, Telugu, and Thai), letters followed by separate combining diacritics are really common, because it's the most typical way of doing things. Basic consonant letters include an inherent vowel—Devanagari/Hindi à€ž is "sa", for example. To change the vowel, add a diacritic: à€žà€Ÿ (saa) à€žà€ż (si) à€žà„€ (sii) à€žà„ (su) à€žà„‚ (suu) à€žà„‡ (se) à€žà„ˆ (sai) à€žà„‹ (so) à€žà„Œ (sau).

Acronyms with periods in these languages aren't super common, but when they occur, they tend to / seem to / can use the whole grapheme (e.g., à€žà„‡, not à€ž for a word starting with à€žà„‡). The problem is that the vowel sign (e.g., à„‡) is "not a letter", just like  ÌȘ. So—randomly stringing letters togetherâ€”à€žà„‡à€«.à€ź would have its period removed, because à€« is preceded by "not a letter".

The regex to fix this scenario is a little complicated—we need "not a letter", possibly followed by combining chars (rare, but does happen, as in 9̅) or invisibles (also rare, but they are sneaky and can show up anywhere since you can cut-n-paste them without knowing it). The regex that works—instead of (1) above—is something that is not a letter, not a combining mark, and not an invisible ([^\p{L}\p{M}\p{Cf}])—optionally followed by combining marks or invisibles. That allows us to recognize eÌȘ or à€žà„‡ as a grapheme before another letter.

Some notes:

  • Updated, in all its hideous, color-coded glory: (?<=(?:^|[^\p{L}\p{M}\p{Cf}])[\p{M}\p{Cf}]{0,9}+\p{L}[\p{M}\p{Cf}]{0,9}+)[.]\p{Cf}*+(\p{L}[\p{M}\p{Cf}]*+)(?=\P{L}|$)
  • The more complicated regex (and all in the look-behind!) didn't noticeably change the indexing time on my laptop.
  • While Latin cases like eÌȘf.g are possible, the only language samples affected in my test sets were the ones listed above: Assamese, Gujarati, Hindi, Kannada, Khmer, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sinhala, Tamil, Telugu, and Thai. The changes in token counts ranged from 0 to 0.06%, with most below 0.03%—so this is not a huge problem.
    • Hindi, the language with the 0% change in token counts, still had changes. You can change the tokens themselves without changing the number of tokens—they just get split in different places (see e.e.cummings, et al., below)—though not splitting is the more common scenario.
  • In the Kanada sample—the one with the most changes from the regex upgrade—there were some clear examples where the new regex still doesn't work in every case.
    • Ugh.
      • However, these cases seem to be another order of magnitude less common, so I'm going to let them slide for now.

However-however (However2?), I am going to document the scenario that still slips through the cracks in the regex, in case it is a bigger deal than it currently seems. (Comments to that effect from speakers of the languages are welcome!)

As mentioned before, Brahmic scripts have an inherent vowel, so à€ž is "sa". The inherent vowel can be suppressed entirely with a virama. So à€žà„ is just "s"—and can be used to make consonant clusters.

  • à€ž (sa) + à€€ (ta) + à€° (ra) + à„€ (ii) = à€žà€€à€°à„€ ("satarii", though the second "a" may get dropped in normal speech.. I'm not sure), and it may or may not be a real word.
  • à€ž (sa) + à„ (virama) + à€€ (ta) + à„ (virama) + à€° (ra) + à„€ (ii) = à€žà„à€€à„à€°à„€ (strii/stree), which means "woman".

So, we have a single grapheme that is letter + virama + letter + virama + letter + combining vowel mark. So, basically, we could allow an extra (letter + virama)+ or maybe (letter + virama){0,2} in several places in our regex—though there are almost 30 distinct virama characters across scripts, and optional characters in the look-behind are complicated.

Plus—just to make life even more interesting!—in Khmer the virama-like role is played by the coeng, and conceptually it seems like it comes before the letter it modifies rather than after... though I guess in a sense both the virama and coeng come between the letters they interact with. (I do recall that for highlighting purposes, you want the virama with the letter before, and the coeng with the letter after. So I guess typographically they break differently.)

Anyway, adjusting the regex for these further cases probably isn't worth it at the moment—though, again, if there are many problem cases, we can look into it. (It might take a proper plugin with a new filter instead of a pattern-replace regex filter... though the interaction of such a filter with word_break_helper would be challenging.)

More notes:

  • Names with connected initials like J.R.R.Tolkien and e.e.cummings are converted to JRR.Tolkien and ee.cummings—which isn't great—until word_break_helper comes along and breaks them up properly!
  • I don't love that acronyms go through stemming and stop word filtering, but that's what happens to non-acronym versions (now both SARS and S.A.R.S. will be indexed as sar in English, for example)—they do match each other, though, which is the point.
    • If you have an acronymic stop word, like F.O.R., it will get filtered as a stop word. The plain field has to pick up the slack, where it gets broken into individual letters. There's no great solution here.

word_break_helper, at Long Last[edit]

Now that most acronyms won't be exploded into individual letters graphemes, we can get word_break_helper up and running.

The current word_break_helper converts underscores, periods, and parentheses to spaces. My planned upgrade was to add colon, and fullwidth versions of underscore, period, and colon. What could be simpler? (Famous last words!)

Chinese and Korean say, "Not So Fast!"[edit]

I ran some tests, not expecting anything.. unexpected.. to happen. To my surprise, there were some non-obvious changes in my Korean and Chinese samples. Upon further investigation, I discovered that both the Nori (Korean) and SmartCN (Chinese) tokenizers/segmenters take punctuation into account when parsing words—but often not spaces!

The simplest example is that "仁(äč‰)" would be tokenized in Chinese as two different words, 仁 and äč‰, while "仁äč‰" is tokenized as one. So far, so good. However, "仁 äč‰" (with a space—or with five spaces) will also be tokenized as one word: "仁äč‰".

Another Chinese example—"é™ˆéžżæ–‡ (äž­äżĄć…„ćŒŸ)":

  • With parens, 陈 / éžż / 文 / 䞭俥 / ć…„ćŒŸ
  • Without parens, 陈 / éžż / 文 / äž­ / 俥 / ć…„ćŒŸ

Korean examples are similar. "(970 ë§ˆìŽíŹëĄœìŽˆê°) 였찚ëČ”ìœ„":

  • With parens, 970 / ë§ˆìŽíŹ / 쎈각 / 였찚 / ëČ”ìœ„
  • Without parens, 970 / ë§ˆìŽíŹëĄœìŽˆ / 였찚 / ëČ”ìœ„

Other Korean examples may have less impact on search, because some Korean phrases get indexed once as a full phrase and once as individual words (in English, this would be like indexing football as both football and foot / ball)—"9. ëłŽê±Žëł”ì§€ë¶€ì°šêŽ€":

  • With period, 9 / ëłŽê±Žëł”ì§€ë¶€ / ëłŽê±Ž / ëł”ì§€ / 부 / 찚ꎀ
  • Without period, 9 / ëłŽê±Ž / ëł”ì§€ / 부 / 찚ꎀ

Somehow, the lack of period blocks the interpretation of "ëłŽê±Žëł”ì§€ë¶€" as a phrase. My best guess for both Chinese and Korean is that punctuation resets some sort of internal sentence or phrase boundary.

One more Korean example, shows a bigger difference—"ꔭ가ꎀ할권 (êČ°ì •ì›ìč™)":

  • With parens, ꔭ가ꎀ할권 / ê”­ê°€ / ꎀ할권 / êČ°ì • / 원ìč™
  • Without parens, ê”­ê°€ / ꎀ할 / 권 / êČ°ì • / 원ìč™

This one is extra interesting, because the paren after "ꔭ가ꎀ할권" affects whether or not it is treated as a phrase, but also whether it is broken into two tokens or three.

I found a workaround that works with Nori and SmartCN tokenizers, as well as the standard tokenizer and the ICU tokenizer: replacing punctuation with the same punctuation, but with spaces around it. So wikipedia.org would become wikipedia . org, causing a token split, while "仁(äč‰)" would become "仁 ( äč‰ ) ", which still blocks the token merger.

It works, but I really don't like it, because it is a lot of string manipulation to add spaces around, for example, every period in English Wikipedia for no real reason (replacing a single character with a different character happens in place and is much less computationally expensive).

I already knew that the SmartCN tokenizer converts almost all punctuation into tokens with a comma as their text. (We filter those.)

I specifically tested the four tokenizers (SmartCN, Nori, ICU, and standard) on parens, period, comma, underscore, colon, fullwidth parens, fullwidth period (), ideographic period (。), fullwidth comma, fullwidth underscore, and fullwidth colon.

SmartCN and Nori split on all of them. The standard tokenizer and ICU tokenizer do not split on period, underscore, or colon, or their fullwidth counterparts. (They do strip regular and fullwidth periods and colons at word edges, so x. and .x are tokenized as x by itself, while _x and x_ are tokenized with their underscores. x.x and x_x are tokenized with their punctuation characters.)

The easiest way to solve all of these problems was to make sure word_break_helper includes regular and fullwidth variants of period, underscore, and colon, and prevent word_break_helper from being applied to Chinese or Korean.

Finally, We Can Do as Asked[edit]

With that done, I ran some more tests, and everything looked very good. The only place where the results were suboptimal were in IPA transcriptions, where colon (:) is sometimes used for triangular colon (ː), which is used to indicate vowel length.

Add remove_duplicates to Hebrew[edit]

I got ahead of myself a little and looked into adding remove_duplicates to the Hebrew analysis chain. My analysis analysis tools assumed that there wouldn't be two identical tokens (i.e., identical strings) on top of each other. In the Hebrew analyzer, though, that's possible—common, even! I made a few changes, and the impact of adding remove_duplicates is much bigger and easier to see.

The Hebrew tokenizer assigns every token a type: Hebrew, NonHebrew, or Numeric that I've seen so far.

The Hebrew lemmatizer adds one or more tokens of type Lemma for each Hebrew or NonHebrew token. The problem arises when the lemma output is the same as the token input—which is true for many Hebrew tokens, and true for every nonHebrew token.

I hadn't noticed these before because the majority of tokens from Hebrew-language projects are in Hebrew, and I don't read Hebrew, so I can't trivially notice tokens are the same.

Adding remove_duplicates removes Lemma tokens that are the same as their corresponding Hebrew/NonHebrew token.

For a 10K sample of Hebrew Wikipedia articles, the number of tokens decreased by 19.1%! For 10K Hebrew Wiktionary entries, it was 22.7%!

Refactoring and Reimplementing, a.k.a., Plugin! The Musical!†[edit]

† No actual music was made (or harmed) in the making of this plugin.

While I was on vacation, Erik kindly started the global reindexing to enable apostrophe_norm, camelCase splitting, acronym handling, and word_break_helper updates. He noticed after a while that it was going reeeeeeeeeeeeeeally sloooooooooooow. He ran some tests and found it was taking about 3 times as long with the new filters in place. We decided to halt the reindexing while I looked into it some more. We also decided to leave the slower rebuilt indexes in place. Monitoring graphs maybe showed a little more lag, but nothing egregious. (Reindexing a document here and there and analyzing short query stings it not in the same CPU usage ballpark as cramming tens of millions of documents through the reindexing pipeline as fast as possible.) We did decide to temporarily semi-revert the code so that any reindexing during the time I was developing an alternative would be close enough to the old speed.

I hadn't noticed the slowdown before because my analysis analysis tools are built to easily gather stats on tokens, not push text through the pipeline as fast as possible. The overhead of the tools dwarfs the reindexing time. I've since mirrored Erik's timing framework, which nukes the index every time, and times directly stuffing thousands of documents (and tens of MB at a time) into a new index as fast as possible, with as few API calls as possible.

Notes on Timings[edit]

The timings are necessarily fuzzy. Erik averaged six reloads per timing, while I eventually limited myself to three, sometimes four, because I was doing a lot more testing.

I typically load 2500 documents at a go, which for my English sample is about 72MB of data. The limit for a single data transfer to Elastic in our configuration is about 100MB, so ~70MB is in the ballpark, but can fairly reliably be loaded as a single action. I also have used samples of French and Korean (5K documents) and Hebrew (3K documents) that are from 50–100MB.

My numbers are all relative. Some days my laptop seems randomly faster than others. I'm not sure whether it's the first few loads after rebooting Elastic, or the first few minutes after rebooting Elastic, but for a while there things are just slower, so I always do a few throw-away loads before taking timings. I also used whatever baseline was convenient when doing comparative timings, and as I progressed through adding new plugins, the baseline generally became longer, because I wanted to compare the potential next config to the most current new config.

So, I might say one filter added 3% to load time and another added 6%, versus a given baseline. Later, on a bad hair day slow laptop day, with a different baseline, the numbers might be 4% and 6.5%. Anything with a clear and consistent gap between them was worth taking into account.

To my happy surprise, filter load time increases added fairly linearly on general English data. (And French, Hebrew, and Korean, too.) That's not guaranteed, since the output of one filter can change the input of the next, giving it more or less to do than it would by itself. But those effects, if any, are lost in the general noise of timing.

In many cases, the fine details don't matter at all. The regex-based filters added around 249% to load time (i.e., ~3Âœx slower). Whether loading takes 240% or 260% longer doesn't really matter... it's waaaaaay too slow!

How Slow Is Slow, Anyway?[edit]

Against my initial baseline:

filter/task slowdown notes
word_break_helper 2% a few character substitutions
apostrophe_norm 5% many character substitutions
regex camelCase 35%
regex acronyms 206% Holy guacamole, Batman!

My first approach, now that I had a decent ability to estimate the impact/expense of any changes, was to try to tune the regexes and other config elements to be more efficient.

The regexes for acronyms and camelCase both used lookaheads, lookbehinds, and many optional characters (allowing 0–9‡ non-letter diacritics or common invisibles (e.g., soft hyphens or bidirectional markers)), which could allow for lots of regex backtracking, which can be super slow.

‡ Note that I had originally allowed "0 or more" diacritics or invisibles, but that was so slow that I noticed even with my not-very-sensitive set up. I should have paid more attention and thought harder about that, it seems.

CamelCase[edit]

Decreasing the number of optional diacritics and invisibles from 0–9 to 0–3 improved the camelCase regex filter from a 32% increase in load time (against a different baseline than above) to 18%.

However, the biggest win for the regex-based camelCase filter was to remove the lookbehind. I had done the acronym regex first, and it required the lookbehind because the contexts of periods could overlap. (e.g., F.Y and Y.I overlap in F.Y.I.) It was easy to do the camelCase regex in an analogous way, but not necessary. That improved it from a 32% increase in load time to merely 12%.

Both changes—no lookbehind and only 0–3 optional diacritics and invisibles—did not improve over just getting rid of the lookbehind.

Acronyms[edit]

With a new baseline, the regex-based acronym filter increased runtime by 200%. Allowing only 0–3 diacritics and invisibles dropped the increase to 98%—a huge improvement, but still not acceptable. Limiting it to 0–2 diacritics and invisibles improved to 69%. 0–1 improved to 49%.

Allowing no diacritics or invisibles, which would work correctly more than 90% of the time, but which could fail in ways that are opaque to a reader or searcher, especially in non-alphabetic scripts, increased load time by only 24%.

A Œ increase in load times might be acceptable on a smaller corpus. If you really worry about acronyms,§ a load time of 2œ hours instead of 2 hours might be worth it. However, our baseline load time for all wikis is 2 weeks. An extra 3œ days is a lot to ask for good acronym processing.

§ David and I have been complaining to each other about acronym processing for more than 5 years, easy, so we do really worry about them.

Apostrophes, Etc.[edit]

Against a very stripped down baseline (just the standard tokenizer and the default lowercase filter—which ran in ⅔ the time of earlier baselines, so all the numbers below are proportionally higher), I got these comparative numbers:

filter/task type slowdown notes
nnbsp_norm char filter 4.9% one character substitution (narrow non-breaking space)
word_break_helper char filter 5.8% a few character substitutions
apostrophe_norm char filter 7.4% many character substitutions
ICU normalization token filter 6.1% lots of Unicode normalizations
homoglyph_norm token filter 20.1% normalizes Latin/Cyrillic multiscript tokens
default en analyzer (unpacked) analyzer 50.2%

Observations:

  • The reciprocal of ⅔ is 1Âœ (i.e., 1 + 50%), so the default English analyzer running 50% slower than the stripped down version is just about perfect, mathematically speaking.
  • The homoglyph normalization filter is pretty expensive!
  • I was also surprised at how expensive the one-character character filter map for narrow non-breaking spaces is. That is the simplest possible mapping character filter: a one-character-to-one-character mapping.
    • The overall machinery for general mapping filters it pretty complex, since it can map overlapping strings (e.g., ab, bc, cd, and abcd can all have mappings) of any practical length to replacement strings of any practical length. However, we mostly use it to map single characters to either single characters or the empty string. Most of our other mappings are at most two characters being mapped from or to. I was wondering whether it would be possible to get that per–char filter overhead down for simple mappings like most of ours.

Using a similarly stripped-down baseline on a different day, I also evaluated English-specific components from our unpacked English analyzer and in the default English analyzer (except for the keyword filter), and some other filters we use or have used across multiple analyzers:

filter/task type slowdown notes
possessive_english token filter 0.60% strips –'s
english_stop token filter 0.40% removes English stop words
kstem token filter 6.20% general English stemmer
custom_stem (for English) token filter 6.30% just stems ''guidelines'' => ''guideline''
ASCII folding token filter 2.30% flattens many diacritics
ASCII preserve orig token filter 4.30% keeps both ASCII folded and original tokens
ICU folding token filter 1.30% aggressive Unicode diacritic squashing, etc.
ICU preserve orig token filter 3.20% keeps both ICU folded and original tokens
kana_map char filter 4.20% map hiragana to katakana, currently only for English, but plan to be global (except Japanese)
nnbsp_norm + apostrophe_norm + word_break_helper char filters 15.30% three separate filters
nnbsp_norm âˆȘ apostrophe_norm âˆȘ word_break_helper char filter 6.80% one filter with the same mappings as the three filters

Observations:

  • The custom_stem filter could have other mappings added to it to handle things kstem doesn't. It predates my time on the Search team, but kstem does still get guidelines wrong.
  • I'm surprised that the ICU filters are faster than the ASCII filters. ICU folding is a superset of ASCII folding, but ICU is still faster. ICU preserve is our homegrown parallel to ASCII preserve, so I'm surprised it is noticeably faster.
  • In real life, we probably can't merge nnbsp_norm, apostrophe_norm, and word_break_helper because word_break_helper has to come after acronym processing, but the experiment is enlightening. 8.5% overhead by having three separate mapping filters instead of one is a lot! nnbsp_norm and apostrophe_norm could readily be merged, though, and with appropriate comments in the code it would only be somewhat confusing. :ĂŸ

I also ran some timings using the current English, French, Hebrew, and Korean analysis chains mostly on their own Wikipedia samples, but also a few tests running some of each through the others.

Despite different baseline analyzer configs, the timings for English and French were all within 1–2% of each other.

The camelCase filter had a much smaller impact on Hebrew and Korean, presumably because the majority of their text (in Hebrew and Korean script) doesn't have any capital letters for the regex to latch onto. The maximally simplified acronym regex ran much faster on Hebrew and Korean (~7.8% vs ~21.6% on English and French). I'm not 100% sure why. But it was still the most expensive component of the ones tested (apostrophe_norm, word_break_helper, camelCase, acronyms).

These timings of various components aren't directly relevant to the efficiency of the new filters, but they serve as a nice baseline for intuitions about how much processing time a component should take.

It's Plugin' Time![edit]

The various improvements were quite good, relatively speaking, but after some discussion we set a rough heuristic threshold of 10% load time increase as warranting looking at a custom plugin. Both camelCase and acronyms were well over that, and together they were a very hefty increase, indeed.

We discussed folding the new custom filters for acronyms and camelCase into the existing extra plugin, but decided the code complexity wasn't worth it (having to not only check for the existence of a plugin, but also a specific version of the plugin) and the maintenance burden of deploying one more plugin isn't too high (there's a list of plugins, and the work is pretty much the same regardless of the length of the list, within reason) and there doesn't seem to be a run-time performance impact to having more plugins installed (yet). So, the acronym and camelCase handling are in a new plugin called extra-analysis-textify.

Acronyms[edit]

After considering various options for the acronym filter—most importantly whether a token filter or character filter would be better—I decided on a finite-state machine inside a character filter. Being aware of the efficiency issues in large documents, I was trying to avoid the common character filter approach of reading in all the text, modifying it, and then writing it back out one character at a time. The finite-state machine does a good job of processing character-by-character, but it also needs to be able to look ahead to know that a given period is in the right context to be deleted.

The simplest context for period deletion is, as previously discussed in the original acronym write-up above, a period between two letters, with each having a non-letter (space, text boundary, punctuation, etc.) on their other side. So, we'd delete the period in " A.C " or "#r.o.", but not the one in "ny.m".

The extra complication I insisted on inflicting on myself was dealing with combining diacritics and common invisible characters, so that the period would be rightly deleted in "#A`˛.-Cž≫!"—where ` ˛ and ž are combining diacritics, the hyphen is a normally invisible soft hyphen, and ≫ represents a left-to-right bidi mark—so the text looks like this: "#Ą̀.Ç!".

We can maintain the correct state in place up to the period, but after that we have to read ahead to see what comes after. The most ridiculous set of diacritics I've seen in real data on-wiki was 14 of the same Khmer diacritic on a given character; they rendered on top of each other, so the tiny ring just looked slightly bold. I rounded up to allow buffering up to 25 characters (including the actual letter after the period), so even mildly GÌ¶ÌŒÍ‰Ì˜ÌŹÌŻÌ”.L͔͓͖͇͙̔̊͝.I̛̖̎͒̑̎̄͝.T͖̻̟̔̀̓̏͊͒̚.CÌžÌ±ÌłÌŹÍŽÌœÍ‚ÌŒ.H̶͉̝͚ÌȘ͙̒̒̄̒̒̈͜.ÈČÌ¶ÌąÌș̘̠͙̅͐̇͜. "Zalgo" acronyms will be processed properly, though that was not a primary concern. (I chose to buffer the readahead text within the filter rather than trying to get the stream it's readng from to back up properly. It's theoretically possible to do that, but not every stream supports it, etc., etc., so it just seemed safer to handle it internally.)

I also decided to handle 32-bit characters correctly. Elasticsearch/Lucene internals are old-school character-by-character (which I learned in Java is 2 bytes / 16 bits at a time, unlike the really oldskool byte-by-byte C that I cut my teeth on), so we also have to read ahead when we see a high surrogate to get the following low surrogate, munge them together, and find out what kind of character that is.

We can properly process all sorts of fancy 𝒜͍.𝕼.ℝ̫.đ“žÌ„.đ˜•Ì.đšˆÌ˜.𝕄⃞.𝔖. now!

I did a regression test against the regex-based acronym filter, and the only thing that came up was in the bookkeeping for the deleted periods. (Frankly, I'm always amazed that the pattern_replace filter that uses the regexes can keep all the bookkeeping straight.) The offset corrections for the regex (which is following very general principles) differ from my special-purpose offsets, which means that there could be extra periods highlighted for the regex-generated tokens in specific cases (notably in non-alphabetic Asian languages where words get tokenized after being de-acronymed). The results are differences in highlights like ឱ.ស. vs ឱ.ស. in Khmer. Only a typography nerd would notice, and almost no one would care, I think.

The speed improvement was amazing! Against a specific baseline of the English analyzer with apostrophe handling added, the fully complex acronym regex filter added 274.1% to the load time, the maximally simplified acronym regex added 27.9%, and the plugin version added only 4.4% (ie., more than 98% faster, making it the same or slightly faster than a one-character mapping filter, such as the narrow non-breaking space fix).

I also configured our code to fall back to the maximally simplified acronym regex if the textify plugin is unavailable.

CamelCase[edit]

The camelCase plugin was more straightforward, but I also used a finite-state machine inside a character filter.

To recap, the simplest camelCase approach is to put a space between a lowercase letter and a following uppercase letter, but complications include combinging diacritics and invisibles, plus 32-bit characters with high and low surrogate characters. Compared to acronyms, though, it was much more straightforward—only the low surrogate and the capital letter following a lowercase letter (and its diacritics and invisibles) need to be buffered. Easy peasy.

Unicode Nerd Fact of the Day: In addition to UPPERCASE and lowercase letters, there are also a few letters that are TitleCase. The examples I found are characters that are made of multiple letters. So, Serbo-Croation has a Cyrillic letter љ which corresponds to lj in the Latin alphabet. It also comes as a single character: lj. It comes in uppercase form, LJ, for all uppercase words, and in title case form, Lj, if a word happens to start with that letter. Thus, in theory, you want to split words between lowercase and either UPPERCASE or TitleCase—not that it comes up very often. Also, as TitleCase letters are essentially UPPERCASE on the frontside and lowercase on the backside, they can make a whole camel hump by themselves. "LjLjLj" (3 TitleCase characters) should be camelCased to "Lj Lj Lj". (This is a weird example, but it makes sense (as much sense as it can) because the string "LjLjLj" (6 MiXeD cAsE characters) would be camelCased to "Lj Lj Lj", and the TitleCase version is ICU normalized to that exact UPPER+lower version. So, both rule orderings—ICU normalization before or after camelCasing—give the same result, and that result matches the often visually identical UPPER+lower version.)

I did a regression test against the regex-based camelCase filter, and the only difference I found was in camelCase words with 32-bit characters. The uppercase Unicode regex pattern \p{Lu} correctly matches uppercase 32-bit characters, but only modifies the offset by 1 character/2 bytes/16 bits. So, đ—„đ—źđ—±đ—¶đ—Œđ—šđ˜đ—Œđ—œđ—¶đ—ź (in Unicode math characters) gets correctly split into đ—„đ—źđ—±đ—¶đ—Œ and đ—šđ˜đ—Œđ—œđ—¶đ—ź, but the offset splits the 32-bit 𝗹 in half (and apparently throws away the other half), so the offset/highlight would be just include đ˜đ—Œđ—œđ—¶đ—ź. (So highlighting would be đ—„đ—źđ—±đ—¶đ—Œđ—šđ˜đ—Œđ—œđ—¶đ—źâ€”not great, but hardly a disaster.)

The plugin version of the camelCase splitter doesn't have that problem.

The speed improvement was not as impressive as the acronym speedup, but still great. The regex-based camelCase filter was 19.9% slower than the baseline. The plugin camelCase filter was only 3.6% slower—again faster than a mapping filter!

I also configured our code to fall back to fairly reasonable camelCase regex if the textify plugin is unavailable.

Also, I thought of a weird corner case... a.c.r.o.C.a.m.e.l. Should we de-acronym first, so the camelCase can be split? I found some real examples—S.p.A., G.m.b.H., and S.u.S.E.—and while I don't love what camelCase splitting does to them after de-acronyming them, it's the same as what happens to SpA, GmbH, and SuSE, so that seems as right as possible.

Apostrophes, et al., and Limited Mappings[edit]

As mentioned above, I was surprised that the simplest possible mapping character filter (mapping one letter to one letter) increases loading time by 4–5%. A lot of our mapping filters are one-char to one-char.

  • apostrophe_norm and nnbsp_norm are universally applied.
  • word_break_helper, the dotted_I_fix, and the upcoming change to kana_map are or will be applied to most languages.
  • Language-specific maps for Armenian, CJK, French, German, Persian, and Ukrainian are all one-char to one-char, as are numeral mappings for Japanese and Khmer.
  • near_space_flattener and word_break_helper_source_text are not used in the text field, but they are also one-char to one-char.

The Korean/Nori-specific character filter is mostly one-char to one-char, though it also has one-char to zero-char/empty string mappings (i.e., delete certain single characters).

Other languages have very simple mappings:

  • Irish uses one-char to two-char mappings.
  • Romanian and Russian use two-char to two-char mappings.

Even our complex mappings aren't that complex:

  • Chinese uses some two-char to longer string mappings.
  • Thai uses some three-char to two-char mappings.

In the simplest case—one-char to one-char mappings—there's no offset bookkeeping to be done, and the mapping can be stored in a simple hash table for fast lookups.

It seems like it should be possible to create a filter for some sort of limited_mapping that is faster for these simple cases. We can have many mapping filters in a given language analyzer: apostrophe_norm, nnbsp_norm, kana_map, word_break_helper, dotted_I_fix, and a language-specific mapping, for up to 6 mappings total in the text field. An improvement of 1% per filter would basically be "buy 5, get 1 free!"

So, I looked into it! I have to give credit to the developers of the general-purpose mapping filter. I read over it, and they use some heavy-duty machinery to build complex data structures that are very efficient at the more complex task they've set themselves. My code does a moderately efficient job at a much less complex task.

And for all but the simplest mappings—involving two characters in either the mapping from or the mapping to direction—the generic mapping filter was the same speed, modulo the fuzziness of the timings. Impressive! However, on the 1-to-1 mappings, my limited_mapping approach was ~50% faster!

I decided that rather than trying to automatically detect instances of mapping where limited_mapping could be used (which would require pre-parsing the mappings with new code in PHP), I'd let the humans (just me for now) mark things as being ok for a limited_mapping filter, and then convert all of them back to plain mapping filters if the extra-analysis-textify plugin isn't available.

Plugin Development and Deployment[edit]

The patches for the new textify plugin and the code to use it when building analyzers is still in review, but it should all get straightened out and merged soon.


In keeping with the notion that adding components to existing, established plugins can add unwanted complexity, I've decided to leave the ASCII-folding/ICU-folding task on hold¶ adnd move up the ICU token repair project, since that will also require a custom filter. Then we can do one plugin deployment and one reindex afterwards to finally get all of these updates and upgrades into the hands of on-wiki searchers!

¶ It got placed there to deal with this reindexing speed problem that necessitated the plugin development so far.

ICU Token Repair (T332337)[edit]

Background[edit]

The ICU tokenizer is much better than the standard tokenizer at processing "some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables."

However, it has some undesirable idiosyncrasies.

(Note that we're using the ICU tokenizer that is compatible with Elasticsearch 7.10, which is the last version of Elasticsearch with a Wikimedia-compatible open-source license.)

UAX #29[edit]

UAX #29 is a Unicode specification for text segmentation, which the ICU tokenizer largely implements. However, it does not quite follow word boundary rule 8 (WB8), which has this comment: Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).

Given the following input text, "3ĐŽ 3a 3a 3ĐŽ", the default ICU tokenizer will generate the tokens, 3ĐŽ, 3, a, 3a, 3, ĐŽ. While this does, however opaquely, follow the internal logic of the ICU tokenizer, it is hard to imagine that this inconsistency is what typical users expect.

More Detailed Examples[edit]

Let's look at a similar example with different numbers and letters for ease of reference. With input "1я 2a 3x 4Ю", the ICU tokenizer gives these tokens: 1я, 2, a, 3x, 4, Ю.

One of the ICU tokenizer's internal rules is to split on character set changes. Problems arise because numbers do not have an inherent character set. (This is also true for punctuation, emoji, and some other non–script-specific characters, many of which are called either "weak" or "neutral" in the context of bidirectional algorithms, and which I generally refer to collectively as "weak" when talking about the ICU tokenizer.)

In the case of a token like z7, the 7 is considered to be "Latin", like the z. Similarly, in щ8, the 8 is "Cyrillic", like the щ. In "1я 2a 3x 4Ю", the 2 is considered "Cyrillic" because it follows я, and the 4 is considered "Latin" because it follows x, even though there are spaces between them. Thus—according to the internal logic of the ICU tokenizer—the "Cyrillic" 2 and Latin a should be split, and the "Latin" 4 and Cyrillic Ю should be split.

This effect can span many non-letter tokens. Given the string "Ю ... 000; 456/789—☂☀☂☀☂☀☂☀ 3a", the ICU tokenizer assigns all the numbers and emoji between Ю and a to be "Cyrillic". (The punctuation characters are discarded, correctly, by the tokenizer.) As a result, the last two tokens generated from the string are 3 (which is "Cyrillic") and a (which is Latin). Changing the first letter of the string to x—i.e., "x ... 000; 456/789—☂☀☂☀☂☀☂☀ 3a"—results in the last token being 3a. This kind of inconsistency based on a long-distance dependency seems sub-optimal.

As a more real-world example, in a text like ĐĐ°ĐżĐžŃ‚ĐŸĐș 7Up ĐžŃĐżĐŸĐ»ŃŒĐ·ŃƒĐ”Ń‚ ŃĐ»ĐŸĐłĐ°Đœ "Drink 7Up" (which is a machine translation of the sentence The beverage 7Up uses the slogan "Drink 7Up"), the first 7Up is split into two tokens (7, Up), while the second is left as one token. Similar discussions of 3M, A1 steak sauce, or 23andMe in Armenian, Bulgarian, or Greek texts are subject to this kind of inconsistency.

Homoglyphs[edit]

Another important use case that spurred development of the icu_token_repair filter is homoglyphs. For example, the word "chocĐŸlate"—where the middle ĐŸ is actually Cyrillic—will be tokenized by the ICU tokenizer as choc, ĐŸ, late. This seems to be contrary to WB5 in UAX #29 (Do not break between most letters), but the ICU tokenizer is consistent about it, and always makes the split, because there is definitely a legitimate character set change.

On Wikimedia wikis, such homoglyphs are sometimes present as the result of vandalism, but more often as the result of typing errors, lack of easily accessible accented characters or other uncommon characters when translating, or cutting-and-pasting errors from other sources. We have a token filter homoglyph_norm that is able to handle Cyrillic and Latin homoglyphs, and repair "chocĐŸlate" to more typical "chocolate", but it only works on individual tokens, not across tokens that have already been split up.

Other Mixed-Script Tokens[edit]

Stylized, intentionally mixed-script text or names—such as "lÎčĐŒÎčтed edÎčтÎčon", "NGiИX", or "KoĐŻn"—can also occur, and the ICU tokenizer consistently splits them into single-script sub-word tokens.

Sometimes mixed-script numerals, like "2ÙĄÙĄ2" occur. The ICU tokenizer treats ÙĄ as Arabic, but 2 is still a weak character, so depending on the preceding context, the number could kept as a single token, or split into 2 and ÙĄÙĄ2.

Not a <NUM>ber[edit]

Another issue discovered during development is that the ICU tokenizer will label tokens that end with two or more digits with the token type <NUM> rather than <ALPHANUM>. So, among the tokens abcde1, abcde12, 12345a, a1b2c3, h8i9j10, ĐŽ1, ĐŽ12, àŠ…à§§, àŠ…à§§à§§, à€•à„§, à€•à„§à„§, ŰȘÛ±, and ŰȘÛ±Û±, the bold ones are <NUM> and the rest are <ALPHANUM>. This seems counterintuitive.

This can become particular egregious in cases of scripts without spaces between words. The Khmer phrase áž“áž·áž„áž˜áŸ’ážáž„áž‘áŸ€ážáž€áŸ’áž“áž»áž„áž–áž¶áž€áŸ‹áž€ážŽáŸ’ážáž¶áž›áž…áž»áž„áž€áŸ’ážšáŸ„áž™áž“áŸƒáž†áŸ’áž“áž¶áŸ†áŸĄáŸ©áŸ©áŸą ("and again in the last half of 1992") ends with four Khmer numerals (áŸĄáŸ©áŸ©áŸą, underlined because bolding isn't always clear in Khmer text). It is tokenized (quite nicely—this is why we like the ICU tokenizer!) as និង, ម្តង, ទៀត, ក្នុង, ពាក់កណ្តាល, ចុងក្រោយ, នៃ, and áž†áŸ’áž“áž¶áŸ†áŸĄáŸ©áŸ©áŸą. The bad part is that all of these tokens are given the type <NUM>, even though only the last one has any numerals in it!

If you don't do anything in particular with token types, this doesn't really matter, but parts of the ICU token repair algorithm use the token types to decide what to do, and they can go off the rails a bit when tokens like abcde12 are labelled <NUM>.

The Approach[edit]

The plan is to repair tokens incorrectly split by the ICU tokenizer. To do this, we cache each token, fetch the next one, and decide whether to merge them. If we don't merge, emit the old one and cache the new one. If we do merge, cache the merged token and fetch the next token, and repeat.

The phrase "decide whether to merge them" is doing a lot of the heavy lifting here.

  • Tokens must be adjacent, with the end offset of the previous one being equal to the start offset of the following one. No space, punctuation, ignored characters, etc., can intervene. This immediately rejects the vast majority of token pairs in languages with spaces, since tokens are not adjacent.
  • Tokens must be in different scripts. If you have two Latin tokens in a row, they weren't split because of bad behavior by the ICU tokenizer. (Maybe camelCase processing got them!)
  • Tokens must not be split by camelCase processing. By default, if one token ends with a lowercase letter and the next starts with an uppercase letter—ignoring certain invisible characters and diacritics—we don't rejoin them. A token like ВДрблюжОĐčCase, should be split for camelCase reasons, not mixed-script reasons. This can be disabled.
  • Tokens must be of allowable types. By default, <EMOJI>, <HANGUL>, and <IDEOGRAPHIC> tokens cannot be rejoined with other tokens.
    • <HANGUL> and <IDEOGRAPHIC> need to be excluded, because they are split from numbers by the ICU tokenizer, regardless of apparent script. So given the text "3년 X 7년", the ICU tokenizer generates the tokens 3 + 년 + X + 7 + 년, the 3 is Hangul (because the nearest script is Hangul) and the 7 is Latin, because it follows the Latin X, but both are separated and neither should be repaired. Allowing <HANGUL> tokens to merge would only trigger repairing 7년, but not 3년. The <IDEOGRAPHIC> situation is similar.
    • <EMOJI> are excluded because in most cases they are not intended to be part of words.
  • The different scripts to be joined must be on a list of acceptable pairs of scripts. I originally didn't have this requirement, but after testing it became clear that, based on frequency and appropriateness of rejoined tokens, it is possible to make a list of desirable pairs or groups of scripts to be joined that cover almost all of the desirable cases and exclude many undesirable cases.
    • The compatible script requirement is ignored if one of the tokens is of type <NUM> (corrected <NUM>, after fixing types that should be <ALPHANUM>; see below). <NUM> tokens still can't merge with disallowed types, like <HANGUL>.
  • Joined tokens should't be absurdly long. The threshold for absurdity is subjective, but there are very few tokens over 100 characters long that are valuable, search-wise. The ICU tokenizer itself limits tokens to 4096 characters. Without some limit, arbitrarily long tokens could be generated by alternating sequences like xχxχxχ... (Latin x and Greek χ). I've set the default maximum length to 100, but it can be increased up to 5000 (chosen as a semi-random round number greater than 4096).

Merging tokens is relatively straightforward, though there are a few non-obvious steps:

  • Token strings are simply concatenated, and offsets span from the start of the earlier token to the end of the later one.
  • Position increments are adjusted so that the new tokens are counted correctly and if the unjoined tokens were sequential, the joined tokens will be sequential.
  • Merged multi-script tokens generally get a script of "Unknown". (Values are limited to constants defined by IBM's UScript library, so there's no way to specify "Mixed" or joint "Cyrillic/Latin".) If they have different types (other than exceptions below), they get a merged type of <OTHER>.
    • The Standard tokenizer labels tokens with mixed Hangul and other alphanumeric scripts as <ALPHANUM>, so we say <HANGUL> + <ALPHANUM> = <ALPHANUM>, too.
    • When merging with a "weak" token (like numbers or emoji), the other token's script and type values are used. For example, merging "Cyrillic"/<NUM> 7 with Latin/<ALPHANUM> x gives Latin/<ALPHANUM>7x—rather than Unknown/<OTHER> 7x.
  • "Weak" tokens that are not merged are given a script of "Common", overriding any incorrect specific script they may have had. (This is the script label they get if they are the only text analyzed.)
  • <NUM> tokens that also match the Unicode regex pattern \p{L} are relabelled as <ALPHANUM>. (This applies primarily to mixed letter/number tokens that end in two or more digits, such as abc123, or longer strings of tokens from a spaceless string of text, as in the Khmer example above.)

Data, Examples, Testing[edit]

More Data for Script Testing[edit]

For testing purposes, I used the samples I had pulled from 90 different Wikipedias for general harmonization testing, plus an extra thirteen new Wikipedia samples with scripts not covered by the original 90. The new ones include Amharic, Aramaic, Tibetan, Cherokee, Divehi, Gothic, Inuktitut, Javanese, Lao, Manipuri, N’Ko, Santali, and Tamazight (scripts: Ethiopic, Syriac, Tibetan, Cherokee, Thaana, Gothic, Canadian Aboriginal Syllabics, Javanese, Lao, Meetei Mayek, N'Ko, Ol Chiki, Tifinagh; codes: am, arc, bo, chr, dv, got, iu, jv, lo, mni, nqo, sat, zgh).

Spurious Mergers[edit]

I've complained before about felonious erroneous word mergers that occur during export, but now it has a phab ticket (T311051), so it's not just me. A fair number of the dubious mixed-script tokens I found in my samples came from these export errors. The most common type were from bulleted lists, where the intro to the list is in one script "..Early Greek television shows included" and the first item in the list being in another script "΀ο ÏƒÏ€ÎŻÏ„Îč ΌΔ Ï„ÎżÎœ Ï†ÎżÎŻÎœÎčÎșα", leading to an apparent token like included΀ο (where the ΀ο is Greek). This also happens to other same-script tokens (e.g., both merged words are in English, or both in Greek, etc.), but it isn't apparent in the current analysis. It's good that these mixed-script word-merging tokens are less common in real data than in my exports, because they would benefit from script-changing token splits if they were real, undermining the motivation for repairing ICU tokens.

What Can Merge With What?[edit]

Mixed-script tokens like piligrimచంఊు (Latin/Telugu), NASDAàź”àźżàź©àŻ (Latin/Tamil), WWEà€šà„à€Żà€Ÿ (Latin/Marathi) are real, not spurious, and seem to be correct, in that they are English words with non-English/non-Latin inflections on them because they appear on non-English Wikipedias. However, splitting on scripts here is a feature, not a bug, since it makes foreign terms more findable on these wikis.

On the other hand, mixed-script tokens like VərəϑraÎłna (Greek/Latin), BoĐčтĐșo (Cyrillic/Latin), λΔÎčÏ„ĂłÏ‚ (Greek/Latin), ĐœĐ”Ń‚Đ°ĐŒÏŒŃ€Ń„ĐŸŃĐž (Cyrillic/Greek), abĐ°Îład (Cyrillic/Greek/Latin) ΞασσօΌχατօÎč (Armenian/Greek), MđŒčđŒșđŒ·đŒ°đŒčđŒ» (Gothic/Latin), KᎀᎠᏩᏎ (Cherokee/Latin), ┉┃EE┓ (Latin/Tifinagh) are the kind we want to keep together. Breaking the words up into random bits isn't going to help anything. Some, like VərəϑraÎłna, are correct; it's a pronunciation guide, using Latin and Greek symbols (though the unusual theta ϑ is pretty weird if you aren't familiar with it). Others, like BoĐčтĐșo can be fixed by our Cyrillic/Latin homoglyph processing, and λΔÎčÏ„ĂłÏ‚, ĐœĐ”Ń‚Đ°ĐŒÏŒŃ€Ń„ĐŸŃĐž, and abĐ°Îład are definitely on the list for future Greek homoglyph processing. The Armenian and Gothic examples are currently farther down the list for homoglyph processing, but if we can keep it efficient, I'd love to be able to handle a much larger many-to-many homoglyph mapping.

Based on the frequency and appropriateness (and spuriousness!) of the kinds of mergers I saw when allowing all scripts to merge, I came up with the following groups to merge: Armenian/Coptic/Cyrillic/Greek/Latin, Lao/Thai, Latin/Tifinagh, Cherokee/Latin, Gothic/Latin, Canadian Aboriginal/Latin. (Groups with more than two scripts mean each of the scripts in the group is allowed to merge with any of the other scripts in the group.)

These groups are in fact mostly based on the presence of homoglyphs, which seems like the most obvious reason for not splitting on scripts. Alphabetic scripts also seem to like to mix-n-match in stylistic non-homoglyph ways, as in lÎčĐŒÎčтed edÎčтÎčon. Technical non-homoglyph examples include words like sьrebro, which is a reconstructed Proto-Slavic form, ΑΝ΀âȰΝΙΝΟÏč, which mixes Greek and Coptic to transcribe words on an ancient coin, and sciency usages like Δv, ÎŒm, or ΛCDM.

Timing & Efficiency[edit]

Using an uncustomized analysis config using the ICU tokenizer—in this case Tibetan/bo—adding the default icu_token_repair filter increased load time of a sizable chunk of English text by an average of 6.01% across 4 reloads. Using the more complex English analysis config with the ICU tokenizer enabled as a baseline, the increase was only 3.95%, averaged across 4 reloads. So, "about 5% more" is a reasonable estimate of the runtime cost of implementing icu_token_repair.

Limitations and Edge Cases[edit]

  • The icu_token_repair filter should probably be as early as possible in the token filter part of the analysis chain, both because other filters might do better working on its output (e.g., homoglyph normalization), or because they might hamper its ability to make repairs (e.g., stemming).
  • Java 8 and v8.7 of the ICU tokenizer do not always handle uncommon 32-bit characters well. For example, they don't recognize some characters as upper or lowercase (e.g., Osage/đ“đ“˜đ“»đ“˜đ“»đ“Ÿ) or digits (e.g., Tirhuta 𑓓 and Khudawadi 𑋳), or as having a script at all (likeMathematical Bold/Italic/Sans Serif/etc. Latin and Greek characters, like đ–đ±đČ and 𝒳𝓎𝓏), labelling them as "Common".
  • When numerals are also script-specific—like Devanagari à„š (digit two)—they can be rejoined with other tokens, despite not being in the list of allowable scripts because they have type <NUM>. So, xà„š will be split and then rejoined. This is certainly a feature rather than a bug in the case of chemical formulas and numerical dimensions, like CH৩CO৚, CÛ±ÛŽHÛ±ÛČNÛŽOÛČS, or à«Șà«Šà«©Xà«§à«Šà«©à«źâ€”especially when there is a later decimal normalization filter that converts them to ch3co2, c14h12n4o2s, and 403x1038.
    • On the other hand, having the digits in a token like à„šà§šà©šà«šá ’á„ˆß‚á§’á­’ (digit two in Devanagari, Bengali, Gurmukhi, Gujarati, Mongolian, Limbu, N'ko, New Tai Lue, and Balinese) split and then rejoin doesn't seem particularly right or wrong, but it is what happens.
    • Similarly, splitting apart and then rejoining the text x5à€•5x5x5à€•5à€•5ĐŽ5x5ĐŽ5x5Îł into the tokens x5, à€•5, x5x5, à€•5à€•5, ĐŽ5x5ĐŽ5x5Îł isn't exactly fabulous, but at least it is consistent (tokens are split after numerals, mergeable scripts are joined), but the input is kind of pathological anyway.
  • Script-based splits can put apostrophes at token edges, where they are dropped, blocking remerging. rock'Őž'roll (Armenian Őž) or О'Connor (Cyrillic О) cannot be rejoined because the apostrophe is lost during tokenization (unlike all-Latin rock'n'roll or O'Connor)
  • And, of course, typos always exist, and sometimes splitting on scripts fixes that, and sometimes it doesn't. I didn't see evidence that real typos (i.e., not data export problems) were a huge problem that splitting on scripts would readily fix.

Replacing the Standard Tokenizer with the ICU Tokenizer plus ICU Token Repair (also T332337)[edit]

In my previous analysis building and testing the icu_token_repair filter above, I replaced the standard tokenizer with the ICU tokenizer and used that as my baseline, looking at what icu_token_repair would and could do to the tokens that were created by the icu_tokenizer.

In this analysis, I'm starting with the production status quo and replacing the standard tokenizer with the icu_tokenizer‖ plus icu_token_repair, and adding icu_token_repair anywhere the icu_tokenizer is already enabled. This gives different changes to review, because changes made by the icu_tokenizer that are not affected by icu_token_repair are now relevant. (For example, tokenizing Chinese text or—foreshadowing—deleting © symbols.)

‖ Technically, it's the textify_icu_tokenizer, which is just a clone of the icu_tokenizer that lives within the right scope for icu_token_repair to be able to read and modify its data.

Lost and Found: ICU Tokenizer Impact[edit]

It's a very rough measure, but we can get a sense of the impact of enabling the ICU tokenizer (or, for some languages,# where it was already enabled, just adding icu_token_repair) by counting the number of "lost" and "found" tokens in a sample. Lost tokens are ones that do not exist at all after a change is made, while found tokens are ones that did not exist at all before a change it made.

# The ICU tokenizer is always enabled for Buginese, Burmese, Cantonese, Chinese, Classical Chinese, Cree, Dzongkha, Gan Chinese, Hakka, Japanese, Javanese, Khmer, Lao, Min Dong, Min Nan, Thai, Tibetan, and Wu. Some of these have other language-specific tokenizers in the text field, but still use the icu_tokenizer in the plain and suggest fields.

For example, when using the standard tokenizer, the text "绎ćŸș癟科" would be tokenized as 绎 + ćŸș + 癟 + 科. Switching to the ICU tokenizer, it would be tokenized as 绎ćŸș + 癟科. So, in our diffs, 绎, ćŸș, 癟, and 科 might be lost tokens (if they didn't occur anywhere else) and 绎ćŸș and 癟科 would be found tokens (because they almost certainly don't exist anywhere else in the text because all Chinese text was broken up into single character unigrams by the standard tokenizer).

Another example: if the ICU tokenizer is already being used, and icu_token_repair is enabled, the text "KoĐŻn" might go from being tokenized as Ko + ĐŻ + n to being tokenized as KoĐŻn. Because, in a large sample, ĐŻ and n are likely to exist as single-letter tokens (say, discussing the Cyrillic alphabet for ĐŻ and doing math with n), then maybe only Ko would be a lost token, while KoĐŻn would be a found token.

So.. the median percentage of lost and found tokens in my 97 Wikipedia samples with changes (a handful had no changes at all) after turning on the ICU tokenizer + icu_token_repair is 0.04% (1 in 2500) for lost tokens, and 0.02% (1 in 5000) for found tokens. The maximums were 0.34% (1 in ~300) for lost tokens and 0.29% (1 in ~350) for found tokens (for Malay and Malayalam, respectively). More than 85% of samples were at or below 0.10% (1 in 1000) for both. Note that text on wikis is much more carefully edited and refined than queries on wikis, so we'd expect a higher impact on queries, where typos/editos/cut-n-paste-os are more common. The low impact is generally good, because we're generally trying to do better parsing uncommon text (e.g., CJK text on a non-CJK wiki) and fixing uncommon errors that come with that (splitting multi-script words with or without homoglyphs).

Survey of Changes[edit]

Parsing the Unparsed[edit]

The most common changes are the ones we expect to see, and the reason we want to use the ICU tokenizer over the standard tokenizer:

  • Hanzi/kanji/hanja (Chinese characters in Chinese, Japanese, or Korean contexts) and Hiragana: unigrams become words!
  • Hangul, Katakana, Lao, Myanmar, Thai: long tokens become shorter tokens (e.g., Thai tokens in the English Wikipedia sample had an average length of 11.6 with the standard tokenizer, but only 3.5 with the ICU tokenizer)
    • Some mixed Han/Kana tokens are parsed out of CJK text, too.
    • It's odd that the standard tokenizer splits most Hiragana to unigrams, but lumps most Katakana into long tokens.

Unsplitting the Split[edit]

Some less common changes are the ones we see in languages that already use the ICU tokenizer, so the only change we see is from icu_token_repair.

  • Tokens with leading numbers are no longer incorrectly split by the ICU tokenizer, like 3rd, 4x, 10x10. Some of these tokens are not "found" because other instances exist in contexts were they don't get split.
    • Reminder: the ICU tokenizer splits on script changes, but numbers inherit script from the text before them, even across white space! Thus, the text "3rd αÎČÎł 3rd" is tokenized as 3rd + αÎČÎł + 3 + rd because the second 3 is considered "Greek" because it follows "αÎČÎł". icu_token_repair fixes this to the more intuitive tokenization 3rd + αÎČÎł + 3rd.

Unsmushing the Smushed[edit]

There are also a fair number of dual-script tokens that are made up of two strings in different scripts joined together (e.g., Bibliometricsàź€àź•àź”àźČàŻ), rather than characters of two or more scripts intermixed (e.g., KoĐŻn). These are properly split by the icu_tokenizer and for many script pairs they are not repaired by icu_token_repair. These come in a few flavors:

  • Typos—someone left out a space, and the ICU tokenizer puts it back. Yay!
    • In the cases where icu_token_repair rejoins them, we have a case of "garbage in, garbage out", so it's a wash.
  • Spaceless languages—the space doesn't need to be there in the given language, and we don't have a language-specific tokenizer. Splitting them up does make the foreign word from the non-spaceless language more findable.
  • Inflected foreign words: some languages will add their morphology to foreign words, even in another script. As an unlikely but more understandable example, in English one could pluralize ĐșĐŸŃĐŒĐŸĐœĐ°ĐČт (Russian "cosmonaut") and talk about ĐșĐŸŃĐŒĐŸĐœĐ°ĐČтs (with a Latin plural -s on the end, there). Breaking these up for certain script pairs makes the foreign word more findable.
  • Export errors: These look like one of the cases above, particularly the first one, but are actually artefacts in my data. Bibliometricsàź€àź•àź”àźČàŻ is an example: it appears on-wiki as Bibliometrics on one line of a bullet list and àź€àź•àź”àźČàŻ on the next line. These are impossible to detect automatically, though I have investigated them in some cases when there are a lot of them. The underlying error has been reported on Phabricator in T311051.

Miscellaneous Errors, Changes, and Observations[edit]

Somethings still don't work the way we might want or expect them to. Some of these I may try to address now or in future patches, others I may just accept as a rare exception to an exception to an exception that won't work right all the time.

Some cases, examples, and observations (roughly in order of how much they annoy me) include:

  • Symbols can be split off like numbers, but are harder to detect and correct.
    • Example: ”l (with ”, the micro sign) is distinct from ÎŒl (with ÎŒ, Greek mu), but both are used for "microliters". As a symbol, the micro sign has no inherent script (it is labelled "Common" if it is the only text analyzed). When proceeded by non-Latin characters, the micro sign gets a script label matching those non-Latin characters and gets split off from the Latin l. On the Tamil Wikipedia, for example, the micro sign (coming after Tamil characters) is tagged as "Tamil" script and we don't normally want to rejoin Tamil and Latin tokens. For numbers, we avoid this problem because the token type is <NUM>, which overrides the Tamil/Latin script mismatch. For the micro sign, the token type is <ALPHANUM>, the same as most normal words, so it can't override the script mismatch.
    • Another example: Phonetic transcriptions like ˈdʒɒdpʊər lose the initial stress mark (ˈ) when the proceeding word is not in a script that can merge with Latin script. Of course, the use of stress marks is also inconsistent. Maybe we should just nuke them all! (That would also prevent us from indexing tokens that are just stress marks.)
    • In theory, we could look for tokens that are all symbols or all Common characters or something similar, but I'm hesitant to start extensively and expensively re-parsing so many tokens. (Maybe I should have tried to hack the ICU tokenizer directly. Ugh.)
  • Word-internal punctuation can prevent repairs. Some punctuation marks are allowed word-internally but not at word boundaries, so text like "he 'can't' do it" will generate the token can't rather than 'can't'.
    • The punctuation marks like this that I've found so far are straight and curly apostrophes/single quotes (' ‘ ’), middle dots (·), regular and fullwidth periods (. ), and regular and fullwidth colons (: ). Two instances of punctuation will cause a token split. Regular and fullwidth underscores (_ ïŒż) are retained before and after letters, even if there are multiple underscores.
    • Interestingly, for Hanzi, Hiragana, Katakana, and Hangul (i.e., CJK characters), word-internal punctuation seems to generally seems to be treated as a token boundary. (Underscores can glom on to Katakana in my minimal test.) Arabic, Cyrillic, and Greek, as well as spacelessΔ Khmer, Lao, Myanmar, and Thai are treated the same as Latin text with respect to word-internal punctuation.

      Δ These languages aren't all 100% spaceless; some use spaces between phrases, but in general many words are run together with their neighbors.

    • The periods and colons generally aren't a problem for our analysis, since word_break_helper converts them to spaces. The curly quotes are straightened by the apostrophe normalization. The middle dots should probably get added to word_break_helper. (They do get used as syl·la·ble separators, but so do hy-phens and en dash–es, so breaking on them would make it more consistent. They can also be used for multiplication in equations or units, like kW·h, but dot operators or spaces (kW⋅h or kW h) can be used, too. (Can't do anything about unspaced versions like kWh, though.) On balance, splitting seems like a good choice.)
    • So, ideally, only apostrophes should be in this category of internally-allowed punctuation. If there's a script change on one side of that internal apostrophe—then because the apostrophe is at a token boundary, it is discarded. Because there is a character (the apostrophe) between the tokens, they can't be rejoined.
      • There are many examples from the Macedonian Wikipedia sample, though most turned out to be from a single article about a particular dialect of Macedonian. The article uses Latin Ă€, ą, ĂĄ, Ă©, Ăł, Ăș and surprisingly Greek η along with general Cyrillic for transcription of the dialect.
      • An actual homoglyph error of this type is O'Єара (a transliteration of O'Hara or O'Hare, but the O is Latin). Once split by the ICU tokenizer, it can't be rejoined by icu_token_repair (and thus it can't be fixed by our homoglyph processing).
      • Intentional non-homoglyph examples, where the split-blocking apostrophe is a benefit, include:
        • Microsoft'Ń‚ŃƒĐœ in Kyrgyz (ky), where -Ń‚ŃƒĐœ is a genitive inflection—similar to English -'s, plus it looks like it uses the apostrophe because Microsoft is foreign, a proper name, or non-Cyrillic.
        • The Tamazight (zgh) Wikipedia, generally written in Tifinagh (├┉⎌┉┏⎰┖), quotes a French source (written in Latin) about liturgical Greek (written in Greek), so we have l'ΑÎșÎżÎ»ÎżÏ…ÎžÎčÎșÎź ΕλληΜÎčÎșÎź (« grec liturgique Â») embedded in the Tamazight/Tifinagh text!
  • Numbers are separated by the icu_tokenizer from words in some spaceless languages/scripts, but not others. Numbers are separated from CJK scripts, but not Khmer, Lao, Myanmar, and Thai. (These line up with the ones that are split on all internal punctuation and those that are not. Must be some category internal to the ICU tokenizer.)
    • Having Arabic (0-9) or same-script numbers (e.g., Lao numbers between Lao words) in the text seems to glue together the word on each side of it. With other-script numbers (e.g., Khmer numbers between Lao words), icu_token_repair rejoins them, which is some sort of consistent, but not really desirable.
    • This is a regression for Thai, since the Thai tokenizer does split numbers (Arabic, Thai, and others) away from Thai script characters, and I added a filter to reimplement that when we switched to the otherwise generally better ICU tokenizer. However, icu_token_repair makes this inconsistent again, because Thai digits are "Thai" and can't rejoin, but Arabic digits are "Common" and can. I can hack it by pre-converting Arabic digits to Thai digits, blocking the rejoining, and let the Thai analyzer's decimal_digit convert them back to Arabic.
      • However, that would introduce a further potential wrinkle.. The Thai analyzer would take Arabic digits embedded in Lao (or Khmer, or Myanmar) text, and convert them to Thai digits; the icu_tokenizer would split them, but icu_token_repair would put them back together. Again, it's consistent in that Arabic, Thai, and Lao digits between Lao words would all be treated equally poorly by the Thai tokenizer. Maybe it's worth it to improve the treatment of Arabic and Thai digits in Thai text, or maybe I should let them all (Khmer, Lao, Myanmar, and Thai) suck equally for now and fix icu_token_repair in the future. Or fix it now (which means redeploying again before reindexing—that'd be popular!) Epicycles within epicycles.
  • Arabic thousand separators (ÙŹ U+066C) are marked as particularly Arabic rather as "Common" punctuation, which means that "abc 123ÙŹ456ÙŹ789" is tokenized as abc (Latin) + 123 (Latin) + 456ÙŹ789 (Arabic), while dropping the abc or replacing it with Arabic script keeps 123ÙŹ456ÙŹ789 as one token. Since we see all combinations of comma or Arabic thousand separator and Western/Eastern Arabic digits—like 1ÙŹ234, Û±ÙŹÛČÛłÛŽ, 1,234, and Û±,ÛČÛłÛŽâ€”and a few cases where the Arabic thousand separator is used like a comma, it makes sense to normalize it to a comma.
    • I also found the Arabic comma (ی U+060C) is used between numbers, too. Converting that to a comma seems like it wouldn't hurt.
  • Other intentional multi-script words
    • Dhivehi Wikipedia seems to use the Arabic for "God" (usually as the single-character ligature ï·Č) in Arabic names that contain "Allah", such as ȚąȚŠȚ„Ț°Ț‹Țšï·Č & ȚąȚŠȚ„Ț°Ț‹ȚȘï·Č (both seem to be used as forms of "Abdallah"). In other places where Dhivehi and Arabic text come together, they don't seem to intentionally be one word. Neither the desirable nor undesirable mixed script tokens seem to be very common, and ï·Č by itself is uncommon, so I'm not too worried about this case.
  • Hiragana and Katakana are not treated as consistently as I would expect. Sometimes they get split into unigrams. Perhaps that's what happens when no word is recognized by the ICU tokenizer.
  • The Arabic tatweel character is used to elongate Arabic words for spacing or justification. It also seems to labelled as "Common" script by the ICU tokenizer, despite being very Arabic. As such, it picks up a script label from nearby words, like numbers do. The odd result is that in a string like "computerـەوە" (which, despite however it is displayed here, is "computer" + tatweel (probably for spacing) + Sorani/ckb grammar bits in Arabic script) the tatweel gloms on to computer, giving computerـ. In the Sorani analysis chain, the tatweel gets stripped later anyway (as it does in several other Arabic-script analysis chains).
    • In the abstract, this falls into the same category as ” and ˈ, but in the most common cases it gets used in places where it will eventually be deleted.
  • The ICU tokenizer deletes ©, Âź, and ℱ. The standard tokenizer classifies these are <EMOJI> and splits them off from words they may be attached to, which seems like the right thing to do. The ICU tokenizer ignores them entirely, which doesn't seem horrible, either, though doing so will make them unsearchable with quotes (e.g., searching "MegaCorpℱ" (with quotes) won't find instances with ℱ.. you'll have to use insource).
  • Empty tokens can occur—though this is not specific to the ICU tokenizer. Certain characters like tatweel and Hangul Choseong Filler (U+115F) and Hangul Jungseong Filler (U+1160) can be parsed as individual tokens if they appear alone, but are deleted by the icu_normalizer, leaving empty tokens. We can filter empty tokens—and we do after certain filters, like icu_folding, where they seem more likely to pop up, but we don't currently filter them everywhere by default.
  • Armenian emphasis marks (՛ U+055B), exclamation marks (՜ U+055C) and question marks (՞ U+055E) are split points in the standard tokenizer. This is not so good because they are diacritics that are placed on a vowel for stress (emphasis mark) or just the last vowel of the word they apply to (exclamation and question marks). So, it's as if English wrote questions like "What happeneˀd". The ICU tokenizer doesn't split on them, which is better, but dropping them seems to be an even better option.
  • The standard tokenizer has a token length limit of 255. Tokens longer than that get truncated. The ICU tokenizer's length limit it 4096, so the rare Very Long Token is allowed by the ICU tokenizer.
  • Language-specific diacritics can glom on to foreign script characters: Ăłàł (Kannada virama), aà„ (Devanagari virama), gà©° (Gurmukhi tippi).. on the other hand, "Latin" diacritics glom on to non-Latin scripts, too: áƒŁÌ‚ (Georgian + circumflex), ÜȘ̈ (Syriac + diaeresis), đŒč̈ (Gothic + diaeresis). There's no obvious better thing to do if someone decides to use something like this to notate something or other—it's just confusing and looks like a probable error when it pops up. When it is an error—i.e., a Latin character pops up in the middle of a Kannada word, for example, there's no real right thing to do... garbage in, garbage out. The standard tokenizer splits on spaces and gives a garbage token; the ICU tokenizer splits on spaces and scripts, and gives three garbage tokens. (The extra garbage tokens are arguably an extra expense to index, but the numbers are minuscule.)
  • The standard tokenizer splits on a couple of rare and archaic Malayalam characters, the vertical bar virama àŽ» (U+0D3B) and circular virama àŽŒ (U+0D3C). They are so rare that my generally extensive font collection doesn't cover them, so I had an excuse to install a new font, which always makes for a fun day! The vertical bar virama is so rare that I couldn't find any instances on the Malayalam Wikipedia, Wiktionary, or Wikisource! (with a quick insource regex search) Anyway, the ICU tokenizer keeps these viramas and makes better tokens, even if they do look like tofu on non–Malayalam readers' screens.

Analysis Updates to Make[edit]

Based on all of the above, these are the changes I plan to make.

  • Add middle dot (· U+00B7) to word_break_helper. The standard tokenizer doesn't split on them, either.
  • Delete primary (ˈ U+02C8) and secondary (ˌ U+02CC) stress markers, since they are inconsistently used across phonetic transcriptions, and the icu_tokenizer will generate tokens that are just stress marks. (Out of 938 examples with stress marks across my Wikipedia samples, all but two cases of either stress mark appear to be IPA, and one of the other two was clearly a non-IPA pronunciation.)
    • The remaining inconsistency would come from using apostrophe (' U+0027) as a primary stress mark, but that already doesn't match (ˈ U+02C8), so it isn't any worse than the status quo.
  • Delete tatweel (ـ U+0640) everywhere early on. This shouldn't cause any problems, but it would require checking to be sure.
    • This might be a chance to merge some small, universal mappings (e.g., nnbsp_norm, apostrophe_norm into one mapping to save on overhead. (See globo_norm below.)
  • Update the Armenian character map to delete Armenian emphasis marks (՛ U+055B), exclamation marks (՜ U+055C) and question marks (՞ U+055E).
    • On second thought, we can apply this globally. These characters occur in small numbers on non-Armenian wikis, usually in Armenian text. Global application as part of globo_norm (see below) would have a low marginal cost and would make the affected Armenian words on those wikis more searchable.
  • Convert Arabic thousands separator (ÙŹ U+066C) to comma. Might as well convert Arabic comma (ی U+060C) while we are here, too.
  • Convert Arabic digits to Thai digits in the Thai analyzer, so the digit hack we have there can do the right thing without icu_token_repair "fixing" the wrong thing.
  • Convert ” (micro sign) to ÎŒ (Greek mu) because it is the most commonly used "symbol" that gets split off of tokens in my samples.

It makes sense to create a globo_norm char filter, combining small universal mappings: fold in nnbsp_norm & apostrophe_norm, delete primary (ˈ U+02C8) and secondary (ˌ U+02CC) stress markers, delete tatweel (ـ U+0640), convert Arabic thousand separator (ÙŹ U+066C) and Arabic comma (ی U+060C) to comma, delete Armenian emphasis marks (՛ U+055B), exclamation marks (՜ U+055C) and question marks (՞ U+055E), and convert ” (micro sign) to ÎŒ (Greek mu).

Testing all those little changes showed nothing unexpected! Whew!

Timing Tests[edit]

My usual procedure for timing tests involves deleting all indices in the CirrusSearch docker image on my laptop and then timing bulk loading between 1,000 and 5,000 documents, depending on language.◊ There is definitely variation in load times, so this time I ran five reloads and averaged the four fastest for the timing of a given config. Successive reloads are often very similar, though having one outlier is common. Re-running the same config 20 minutes later can easily result in variation in the 1–2% range, which is as big as some of the effect sizes we are looking at, so all of the seemingly precise numbers should be taken with a grain of salt.

◊ The bulk load can only handle files less than 100MB in a single action. Some of my exports are limited by number of documents (max 5K, though I often aimed for a smaller sample of 2.5K), and some are limited by file size (because the average document length is greater for a given wiki or because the encoding of the script uses more bytes per character).

I ran some timing tests for different languages, comparing the old config and the new config for that specific language, and using data from a recent dump for the corresponding Wikipedia (1K–5K articles, as above). When sorted by load time increases, the languages fall into nice groupings, based on how much has changed for the given language analyzer—though there is still a fair amount of variation.

Δ Load
Time
Wikipedia
Sample
Analyzer Notes
3.20% zh/ Chinese uses other tokenizer for text field; already
uses icu_tokenizer elsewhere; add
icutokrep_no_camel_split
7.23% ja/ Japanese
7.76% he/ Hebrew uses other tokenizer for text field; adding
icu_tokenizer elsewhere, plus
icu_token_repair + camel-safe version
8.97% ko/ Korean
11.89% th/ Thai already uses icu_tokenizer in text field
and elsewhere; adding icu_token_repair
+ camel-safe version
18.40% bo/ Tibetan
19.86% de/ German introducing icu_tokenizer in text field
and elsewhere; adding icu_token_repair
+ camel-safe version
22.48% hy/ Armenian
23.08% ru/ Russian
23.62% en/ English
24.73% it/ Italian
25.61% ar/ Arabic
27.24% fr/ French
27.63% es/ Spanish


Roughly, indexing time has increased by 7–12% for analyzers using the ICU tokenizer, and 20–27% for those that switch from the standard tokenizer to the ICU tokenizer. That's more than expected based on my previous timing tests—which I realized is because the earlier tests only covered adding ICU token repair to the text fields, not the cost of the ICU tokenizer itself... it's a standard option, so it shouldn't be wildly expensive, right? (When working on the current config updates, I realized I should also take advantage of the opportunity to use the icu_tokenizer everywhere the standard tokenizer was being used, if icu_token_repair is available, which further increased the computational cost a bit.)

English, at ~24% increase in indexing time, is the close to the median for analyzers among the eight largest wikis (Arabic, Armenian, English, French, German, Italian, Russian, Spanish) that are getting the ICU tokenizer for the first time. So, I did a step-by-step analysis of the elements of the config update in the English analysis chain to see where increased load time is coming from.

My initial analysis was done over a larger span of time, and the stages were done not in any particular order, as I was still teasing out the individual pieces of the upgrade that needed acconting for. I later re-ran the analysis with only two reloads per stage, and re-ran all stages in as quick succession as I could. I also ran two extra reloads for the beginning and ending stages (i.e., the old and new configs) at the beginning and end of the re-run timings, and they showed 1–1.5% variability in these timings taken ~30 minutes apart.

The table below shows the somewhat idealized, smoothed merger of the two step-by-step timing experiments. As such, the total of the stages shown adds up to 24.5%, rather than the 23.62% in the table above.

In summary, for English, using the ICU tokenizer instead of the standard tokenizer everywhere (3 & 4) is about 7% of the increase (it does more, so it costs more); adding ICU token repair to the text field (6) is about 4% (below previous estimates); adding ICU token repair to the plain field (7) is 9.5%; adding ICU token repair to the suggest, source_text_plain, and word_prefix fields (8) is only about 2.5%. Merging character filter normalizers (2) is a net gain in speed (–2.5%), though we spend some of it (+1.5%) by increasing the number of characters we want to normalize (5).

Stage vs baseline vs previous
stage
Full re-index
estimate
(1) Baseline: old config with
      standard tokenizer
— — 14–17.5 days
(2) Combine nnbsp_norm
      and apostrophe_norm
–2.5% –2.5%
(3) Switch from standard
      tokenizer to icu_tokenizer
+7% +9.5%
(4) Switch from icu_tokenizer
      to textify_icu_tokenizer
+7% 0%
(5) Add other character mappings +8.5% +1.5%
(6) Enable icu_token_repair
      for text field
+12.5% +4%
(7) Enable icu_token_repair
      for text & plain fields
+22% +9.5%
(8) Enable icu_token_repair
      for text, plain, suggest,
      source_text_plain,
      word_prefix fields
+24.5% +2.5% 17.4–21.8 days

In recent times, we've said that reindexing all wikis takes "about two weeks", though it can be quite a bit longer if too many indexes fail (or large indexes like Commons or Wikidata fail). It may be a little longer than two weeks because no one minds when reindexes run a little longer overnight or over the weekend, and since we don't babysit them the whole time, we might not notice if they take a day or two longer than exactly two weeks. To put the increased load time in context: estimating "about two weeks" as 14–17.5 days, a ~24% increase would be 17.4–21.7 days, or "about 2Âœ to 3 weeks".↓

↓ The overall average index time increase should be a little bit less, since some large wikis (Chinese, Japanese, Korean) have smaller increases (~3–9%). We shall see!

Next Steps[edit]

  • ✔ Get the icu_token_repair plugin ("textify", which also features acronym and camelCase processing) through code review.
  • ✔ Deploy the textify plugin.
  • ✔ Configure the AnalysisConfigBuilder to be aware of the textify plugin and enable the various features appropriately (ICU tokenizer, ICU token repair, acronyms, camelCase, etc.).
    • This includes testing the ICU tokenizer itself. For the ICU token repair, I enabled the ICU tokenizer almost everywhere as a baseline. I probably won't actually want to enable it where we currently use other custom tokenizers.
    • Note to Future Self: I'm going to have to figure out the current Thai number splitting filter, which was put in place because it was a feature of the Thai tokenizer that the ICU tokenizer didn't have. ICU token repair can, in the right contexts, reassemble some of the number+Thai word tokens we tried to avoid creating. I'll have to see whether not splitting numbers or not repairing tokens is the best compromise for Thai. (Or, updating icu_token_repair to be aware of this issue for Thai.. and Lao, and others...)
  • Reindex all the wikis and enable all this great stuff (T342444)—and incidentally take care of some old index issues (T353377).
  • Try not to think too hard about whether I should have gone and just learned the ICU tokenizer Rule-Based Break Iterator syntax and spec rather than treating the ICU tokenizer as a black box. (I'm sure there's a conversation I had in a comment section somewhere that stated or implied that those rules wouldn't be able to solve the split-token problem... so I probably did the right thing....)

Harmonization Post-Reindex Evaluation, Part I (T359100)[edit]

Background[edit]

See the Data Collection and "Infrastructure" sections above for more background.

To summarize, I took large samples of queries from 90 "reasonably active" wikis, and created samples of approximately 1000 queries for evaluation. Those queries are run daily, and each day's stats can be compared to the previous day's stats.

The daily run after reindexing for a given wiki lets us see the impact of the new analysis chain on that wiki by comparing it to the previous day. We also have daily runs for several days before and after reindexing to gauge what the typical amount of variability ("noise") is for a given wiki.

Examples of noise:

  • The zero-results rate is very stable from day to day for a fixed set of queries for virtually all wikis. Rarely, ZRR may move slowly (usually downward as the amount of text on-wiki that a query can match tends to grow over time), but rarely more than 0.1% over a couple of weeks.
  • Queries that increase their results count day-to-day can rarely got over 1% for some wikis, while for others is can reliably be over 20% every day.
  • Similarly, queries that change their top result day-to-day are usually under 1%, but for some wikis may usually be 3–5%.
    • Note that index shards may play a role in top result changes for large, active wikis, as a different shard may serve the query on a different day. When there's no obvious best result (i.e., for mediocre queries), the top result can change from shard-to-shard because the shards have slightly different term weight statistics.
      • Results lower down the list may also swap around more easily, even for good queries with an obvious top result, but that doesn't affect our stats here. Smaller wikis only have one shard, and so are less prone to this kind of search/re-search volatility.
  • English Wikipedia is very active and somewhat volatile. Over a 3-week period, 21.2–27.9% of queries in a sample of 1000 increased their number of results returned every day. 3.1–8.3% changed their top result every day.

Samples: Relevant and "Irrelevant"[edit]

There is a lot of data to review! We have a general sample for each language wiki, plus "relevant" samples (see Relevant Query Corpora above) for as many languages as possible for each of acronyms, apostrophes, camelCase, word_break_helper, and the ICU tokenizer (with token repair). Relevant samples have a feature, and their analysis results changed when the config changed. (For example, a word with curly apostrophes, like aujourd’hui, would change to a word with straight apostrophes—aujourd'hui—so that counts as "relevant".)

I also held on to the "irrelevant" queries—those that had a feature but didn't change their analysis output for a given set of changes. For example a word with a straight apostrophe didn't change with apostrophe handling, but it might now match a word on-wiki that had a curly apostrophe. (I originally downplayed this a bit, but I think that not all wikis have managed to be as strict about on-wiki content as the bigger wikis, and some non-standard stuff always slips through.)

And of course in some cases queries without a given feature could match on-wiki text with that feature. For example, after the updates, the query Fred Rogers could match FredRogers or fred_rogers in an article, hence the general sample for each language.

Earlier Reindexing: A False Start[edit]

The early regex-based versions of acronym handling and camelCase handling was waaaaaaay tooooooo sloooooooow, and reindexing was stopped after it got through the letter e.

The new plugin-based versions of acronym handling and camelCase handling have almost identical results (they can handle a few more rare corner cases that are really, really far into the corner), and the apostrophe normalization and word_break_helper changes are functionally the same.

As a result, most general samples for wikis starting with a–e and feature-specific samples for everything but the ICU tokenizer upgrade (again, for wikis starting with a–e) are mostly unchanged with the most recent reindexing. I have daily diffs from the time of the earlier reindexing, though, so I will be looking at both.

Heuristics and Metrics[edit]

Because there is so much data to review, I'm making some assumptions and using some heuristics to streamline the effort.

Increased recall is the general goal, so any sample that has a net measurable decrease in the zero-results rate (i.e., by 0.1%) the day after re-indexing counts as a success. If there's no change in ZRR, then a marked increase in the number of queries getting more results counts as a somewhat lesser success. If there is no increase in the number of queries getting more results, I noted a marked increase in the number of queries changing their top result as a potential precision success (though it requires inspection).

Since the typical daily change in increased results varies wildly by language, I'm looking for a standout change the day after reindexing. Some are 5x the max for other days and thus pretty obvious, others are a bit more subjective. I'm counting a marked increase as either (i) a change of more than ~5% over the max (e.g., 20% to 25%) or (ii) more than ~1.5x the max (e.g. 1.2% to 1.8%), where the max is from the ~10 days before and after reindexing.

Sometimes ZRR increases, which is not what we generally want. But in those cases, I'm looking at the specifics to see if there is something that makes sense as an increase in precision. That tends to happen with an analysis change that prevents things from being split apart. For example, before the ICU tokenizer, 绎ćŸș癟科 is split into four separate characters, which could match anywhere in a document; with the ICU tokenizer, it's split into two 2-character tokens (绎ćŸș & 癟科), so it may match fewer documents, but presumably with higher precision/relevance. Similarly, pre-acronym-handling, N.A.S.A. matches any document with an n and s in it (a is a stop word!).. either of which can show up as initials in references, for example. So, treating W.E.I.R.D.A.C.R.O.N.Y.M. as "weirdacronym" and getting no results is actually better than treating it as a collection of individual letters that matches A–Z navigation links on list pages willy-nilly.

Also, when ZRR increases, I will also note marked increases in the number of queries getting more results or marked increase in the number of queries changing their top result, so the total tagged samples can add up to more than 100%.

I'm putting samples with fewer than 30 queries into the "small sample" category. They are interesting to look at, but will not count (for or against) the goal of reducing the zero-results rate and/or increasing the number of results returned for 75% of languages with relevant queries. I filtered as much junk as I could from the samples, but I can't dig into every language to sort reasonable queries (FredRogers) from unreasonable queries (FrrrredddĐŻojerzz was here). Small samples (especially < 5) can easily be dominated by junky queries I couldn't or didn't have time to identify.

I'm going to ignore "small" "irrelevant" samples with no changes... a tiny sample where nothing happened is not too interesting, so I didn't even annotate them.

General Sample Results[edit]

In the general larger samples (~1K queries) from each of 90 Wikipedias, we had the following results (net across acronym handling, apostrophe normalization, camelCase handling, word_break_helper changes, and adding ICU tokenization (with multiscript token repair):

The day after reindexing...

  • 59 (66%) had their zero-results rate decrease
  • 22 (24%) had the number of queries getting more results increase
  • 6 (7%) had their zero-results rate increase
  • 6 (7%) had no noticeable change in ZRR, number of results, or change in top result

The total is more than 90 (i.e., more than 100%) because a handful of wikis are in multiple groups. A few wikis that had their ZRR increase also had the number of queries getting more results increase, for example. Esperanto wiki (eo) was one of the wikis that was caught up in the earlier reindexing false start. The earlier reindexing caused a decrease in ZRR, while the later reindexing caused an increase in ZRR, so I just counted it once for each.

Of the six wikis that had their zero-results rate increase, the queries that got no results fell into a few overlapping groups:

  • English, Esperanto, Italian, and Vietnamese all had Chinese or Japanese text that was no longer being broken up into single characters, so that's a likely increase in precision (and a drop in recall).
  • The English and Italian language configs (which apply to the English, Simple English, and Italian Wikipedias) previously had aggressive_splitting enabled, which would also split words with number (e.g., 3D would become 3, D). Disabling that generally improved precision (with a related decrease in recall).
  • Russian had a query using an acute accent like an apostrophe. The acute accent caused a word break with the old config, while it was converted to an apostrophe with the new config, creating a single longer token. (This is not a clear win, but Russian is also in the group that had the number of queries getting more results increase.)

So, five of the six cases of ZRR increases reflect improved precision, and the sixth had an increase in the number of results, I'm calling all of those good. That leaves only six samples with no changes: Belarusian-TaraĆĄkievica, Hebrew, Japanese, Khmer, Kurdish, and Chinese (be-tarask, he, ja, km, ku, zh).

Thus 93% of general samples show positive effects overall from the combined harmonization efforts so far.

Note: The English general sample really highlights the limitation of looking at a single number (though that is all that is generally feasible with so much data and so little time to analyze it). Looking at everything but the ICU tokenizer (earlier reindexing), ZRR decreased. With only the ICU tokenizer, ZRR increased. General samples starting with f–z don't have this split, but still have the same ZRR tension internally. In the case of those where the decreased ZRR won out, I didn't look any closer. In the cases where increased ZRR dominated, I did, and most were cases of improved precision dampening recall. There are likely many more cases of improved precision dampening recall, but improved recall winning out overall in these samples.

Reminders & Results Key[edit]

  • The queries in a relevant sample showed changes in analysis to the query itself. In the irrelevant sample the queries had the relevant feature (e.g., something that matched an acronym-seeking grep regex) but had no change in analysis to the query itself—so there might be interesting changes in matches, but we aren't particularly expecting them.
  • Samples that are small have fewer than 30 queries in them, and are reported for the sake of interest, but are not officially part of the stats we are collecting. Small samples with no changes (noΔ) are noted in brackets, but are not counted as part of the "small" total, since the reason nothing interesting happened is likely that the samples are too small.
  • Net zero-results rate decreases (ZRR↓) are assumed to be a sign of good things happening.
  • Net zero-results rate increases (ZRR↑) require investigation. They could be bad news, or they could be the result of restricting matches to improve precision (no result is better than truly terrible nonsensical results). Zero-results rate increases that are most likely a net improvement will be re-counted under ZRR↑+, and discussed in the notes.
  • If ZRR did not go down (including ZRR going up—ZRR↑), but the number of queries getting more results markedly increased (Res↑), that is assumed to be a sign of somewhat lesser good things happening—but good things nonetheless.
  • If ZRR did not go down, and the number of queries getting more results did not go up, but the number of queries changing their top result (topΔ) markedly increased, those samples require investigation as possibly improved results. The same or fewer results, but with changed ranking could be the result of restricting or re-weighting matches to improve precision. Top result changes that are most likely the result of improvements in ranking or increased precision will be re-counted under topΔ+, and discussed in the notes.

Acronym Results[edit]

The table below shows the stats for the acronym-specific samples.

rel rel (sm) irrel irrel (sm)
total 40 26 1 0
ZRR↓ 27 (68%) 5 (19%)
Res↑ 7 (18%) 20 (77%) 1 (100%)
topΔ 2 (5%) 1 (4%)
— topΔ+ 1 (3%)
ZRR↑ 4 (10%)
— ZRR↑+ 4 (10%)
noΔ [14]

Notes:

  • Zero-results rate increases (ZRR↑) look like generally improved precision for English and Italian (though there are also a few queries from each that are affected by no longer splitting a1ph4num3ric words). The Korean queries are 40% improved precision, 20% improved recall, and 40% junk, so overall an improvement. Chinese are similar: 60% improved precision, 20% improved recall, 20% junk.
  • The two acronym samples with changes to the top result are Simple English and Thai. Simple English has some good improvements, but also a lot of garbage-in-garbage-out queries in this sample, so I'm calling it a wash. The Thai examples I looked at seem to be more accurate. Thai acronyms seem unlikely to get zero results because each character has a good chance of occurring somewhere else. English acronyms on Thai Wikipedia, though, don't necessarily match as well. For example, the query R.M.S Lusitania originally only matched the article for the RMS Olympic (which mentions the Lusitania, and has individual R, M, and S in it); the de-acronymized RMS Lusitania matches the English name in the opening text of the correct Thai article.
  • Irrelevant acronym queries are weird. There really shouldn't have been any, because acronyms are de-acronymized. The only way it happens is if both the acronym parts and the de-acronymized whole are removed from the query. This can happen in a few ways.
    • In several languages, there were just one or two queries with a backslash that I didn't properly escape during the relevant/irrelevant determination. This caused an Elastic error, resulting in 0 tokens. With the new config, there is still an error, still resulting in 0 tokens.
    • In Italian, there are non-Italian acronyms, like L.A. where the individual letters (l and a) and the de-acronymized token (la) are all stop words, so they generate no tokens!
      • Polish is somewhat similar, in that all of the unchanged queries contain Sp. z o.o., which is the equivalent to English LLC. Without acronym processing or word_break_helper, o.o. is a stop word. With word_break_helper, each individual o is a stop word. With acronym processing, the two-letter oo gets removed by a filter set up to tamp down on the idiosyncrasies of the statistical stemmer. In all cases, no tokens are generated, so there are no changes.
    • In Japanese, there are a couple of multiscript acronyms, like N.デ., which is split up by the tokenizer it uses (the standard tokenizer).
      • Thai is somewhat similar, with a de-acronymized Thai/Latin token split by the ICU tokenizer, which is a combo that is not allowed to be repaired by icu_token_repair. (The other Thai example isn't actually real because my acronym-seeking regex failed.. Thai is hard, man!)
    • The Chinese tokenizer (smartcn_tokenizer) is even more aggressive and splits up any non-Chinese/non-ASCII characters into individual tokens, so Cyrillic, Thai, and Latin mixed ASCII/non-ASCII de-acronymized tokens get broken up (along with regular Cyrillic, Thai, and Latin mixed ASCII/non-ASCII words). Parsing of Chinese characters ignores punctuation, so acronyms and de-acronymized tokens are usually treated the same. (Acronym processing still works on all-Latin acronyms, like N.A.S.A.)

Apostrophe Normalization Results[edit]

The table below shows the stats for the apostrophe normalization–specific samples.

rel rel (sm) irrel irrel (sm)
total 16 23 25 4
ZRR↓ 10 (63%) 16 (70%) 11 (44%)
Res↑ 4 (25%) 6 (26%) 9 (36%) 4 (100%)
topΔ 1 (6%) 1 (4%)
— topΔ+ 1 (6%)
ZRR↑ 5 (31%) 1 (4%) 5(20%)
— ZRR↑+ 4 (25%)
noΔ [25]
  • Czech had both an increase in zero-results rate and an increase changes in the top result. Almost all relevant queries used an acute accent (ÂŽ) for an apostrophe, which causes a word split without apostrophe normalization. In a query like donÂŽt know why, searching for don't, know, and why is obviously better than searching for don, t, know, and why (and not being able to match don't). These are precision improvements for both topΔ and ZRR↑.
  • German's net increase in ZRR is a mix of recall and precision improvements, all from acute (ÂŽ) and grave (`) accents being used mostly for apostrophes. A small number are typos or trying to use the accent as an accent, but most, like l`auberge and lÂŽame are clearly apostrophes.
    • Spanish and Portuguese are the same.
  • English increase in ZRR comes down to one example, hoÊ»omĆ«, which uses a modifier letter turned comma instead of an apostrophe. Without apostrophe normalization, it gets deleted by icu_folding and matches hoomu. Looking at the top result changes for English, I see several modifier letters (modifier letter turned comma, modifier letter apostrophe, modifier letter reversed comma, modifier letter right half ring) used as apostrophes. In English-language queries, the change to apostrophe is a help (jacobÊ»s ladder). In non-English names and words (on English Wikipedia) it's a mixed bag. For example, both Oahu and O'ahu are used to refer to the Hawaiian island; without apostrophe processing, oÊ»ahu matches Oahu (deleted by icu_folding), but with apostrophe processing it matches O'ahu. Visually, I guess matching the version with an apostrophe is better. I think this is an improvement, but we were looking at ZRR, not top result changes, so I'll call this one a miss.

CamelCase Results[edit]

The table below shows the stats for the camelCase–specific samples.

rel rel (sm) irrel irrel (sm)
total 27 38 2
ZRR↓ 27 (100%) 34 (87%)
Res↑ 4 (10%) 2 (100%)
topΔ
— topΔ+
ZRR↑
— ZRR↑+
noΔ [13]

CamelCase is an easy one, apparently!

Italian and English had some minor activity, but since they both had aggressive_splitting enabled before, they had no camelCase-related changes. Italian has one query with both camelCase and an a1ph4num3ric word, and the change in the a1ph4num3ric word was actually the cause in the change in results.

ICU Tokenization Results[edit]

The table below shows the stats for the samples specific to ICU tokenization (with ICU token repair).

rel rel (sm) irrel irrel (sm)
total 24 24 31 11
ZRR↓ 8 (33%) 4 (17%) 7 (23%) 7 (64%)
Res↑ 1 (4%) 16 (52%) 2 (18%)
topΔ 1 (4%) 1 (3%) 1 (9%)
— topΔ+
ZRR↑ 16 (67%) 18 (75%) 16 (52%) 1 (9%)
— ZRR↑+ 16 (67%)
noΔ 6 (19%)

In all cases where the zero-results rate increased fo relevant queries, the queries are Chinese characters, or Japanese Hiragana, which the standard tokenizer splits up into single characters and which the ICU tokenizer parses more judiciously, so these are all precision improvements. Most of them also feature recall improvements, from Thai, Myanmar, or Japanese Katakana—which the standard tokenizer lumps into long strings—being parsed into smaller units; or, from incompatible mixed-script tokens (i.e., spacing errors) being split up.

word_break_helper Results[edit]

The table below shows the stats for the word_break_helper–specific samples.

rel rel (sm) irrel irrel (sm)
total 49 16 25 17
ZRR↓ 39 (80%) 8 (50%) 14 (56%) 1 (6%)
Res↑ 6 (12%) 4 (25%) 7 (28%) 15 (88%)
topΔ 3 (6%) 4 (25%) 1 (6%)
— topΔ+ 1 (2%)
ZRR↑ 4 (16%)
— ZRR↑+
noΔ 1 (2%) [15]

Two of the three with changes to the top result seemed to be mediocre queries (Marathi and Myanmar, but in English/Latin script!) that had random-ish changes. The other (Malayalam, though also largely with English queries) showed some fairly clear precision improvements.

Summary Results[edit]

Below are the summary results for the general samples and the relevant non-"small" samples for each feature. net+ shows the percentage that had a positive impact (ZRR↓, Res↑, topΔ+, or ZRR↑+). Note that the categories can overlap (esp. ZRR↑(+) with Res↑ or topΔ(+)), so net+ is not always a simple sum.

ZRR↓ Res↑ topΔ topΔ+ ZRR↑ ZRR↑+ noΔ net+
general 66% 24% — — 7% 6% 7% 93%
acronym 68% 18% 5% 3% 10% 10% — 98%
apostrophe 63% 25% 6% 6% 31% 25% — 94%
camelCase 100% — — — — — — 100%
ICU tokens 33% — — — 67% 67% — 100%
wbh 80% 12% 6% 2% — — 2% 94%

Our targe goal of 75% improvement in either zero-results rate (ZRR↓, preferred) or total results returned (Res↑) held true for all but the introduction of ICU tokenization, where we expected (and found) improvements in both recall and precision, depending on which "foreign" script a query and text on-wiki is in.

For each collection of targeted queries for a given feature, we saw an improvement across more than 90% of query samples.

For the general random sample of Wikipedia queries, we also saw improvement across more than 90% of language samples, mostly in direct improvements to the zero-results rate—indicating that the features we are introducing to all (or almost all) wikis are useful upgrades!

Generally Enable dotted_i_fix (T358495)[edit]

Background[edit]

Some Turkic languages (like Turkish) use a different uppercase form of i and a different lowercase form of I, so that the upper/lowercase pairs are İ/i & I/ı. It's common in non-Turkish wikis to see İstanbul, written in the Turkish way, along with names like İbrahim, İlham, İskender, or İsmail.

On Turkish wikis, we want İ/i & I/ı to be lowercased correctly. The Turkish version of the lowercase token filter does exactly that, so it is enabled for Turkish.

On non-Turkish wikis, we want everything to be normalized to i, since most non-Turkish speakers can't type İ or ı, plus, they may know the names without the Turkish letters, like the common English form of Istanbul.

The default lowercase token filter lowercases both İ and I to i. However, the ICU upgrade filter, icu_normalizer does not normalize İ to i. It generates a lowercase i with an extra combining dot above, as in i̇stanbul. The extra dot is rendered differently by different fonts, apps, and OSes. (On my Macbook, it looks especially terrible here in italics!) Sometimes it lands almost exactly on the dot of the i (or perhaps replaces it); other times it is placed above it. Sometimes it is a slightly different size, or a different shape, or slightly off-center. See the sample of fonts below.

The icu_normalizer lowercases Ä° to an i with an extra dot above, which is rendered differently by different fonts. The display differences can be very subtle: In Arial, the dot on the initial i is a little wider (probably two dots overlapping) than the later i; in Times New Roman, the initial dot is a little lower (probably it was replaced as any combining diacritic might be).

Of course, by default search treats i and i-with-an-extra-dot as different letters.

When unpacking the various monolithic Elasticsearch analyzers, I noticed this problem, and started adding dotted_i_fix (a char_filter that maps Ä° to I, to preserve any camelCase or other case-dependent processing that may follow) to unpacked analyzers by default. I left dotted_i_fix out of any analyzers that I only refactored.

In our default analyzer, we upgrade lowercase to icu_normalizer if ICU components are available, which means default wikis have the dotted-I problem, too, but not the fix.

İ ❀ Turkıc Languages[edit]

Looking at the English Wikipedia pages for İ and ı, it looks like exceptions should be made for Azerbaijani, Crimean Tatar, Gagauz, Kazakh, Tatar, and Karakalpak (az, crh, gag, kk, tt, kaa)—all Turkic languages—and that they should use Turkish lowercasing.

When testing, I saw that the Karakalpak results didn't look so great, and upon further research, I discovered that Karakalpak uses I/i and Ä°/ı (sometimes I/ı, and maybe formerly I/i and Í/ı)... it's listed on the wiki page for ı, but not Ä°, so I should have known something was up! I took it off the list.

The others now have minimally customized analyzers to use Turkish lowercase—which is applied before the icu_normalizer, preventing any double-dot-i problems.

Things Not to Do: Before settling on a shared custom config for Azerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar, I tried adding a limited character filter mapping I=>ı and İ=>i, which seems like it could be more lightweight than an extra lowercase token filter (icu_normalizer still needs to run afterward for the more "interesting" normalizations, so it's an addition, not a replacement). However, it can cause camelCase processing to split or not split incorrectly, including during ICU token repair, it can interact oddly with homoglyph processing, and the exact ordering was brittle, so it hurt more than it helped. Turkish lowercase is just easier.

Global Custom Filters to the Rescue[edit]

The heart of the solution for non-Turkic languages is to add dotted_I_fix to the list of Global Custom Filters, which are generally applied to every analyzer, modulo any restrictions specified in the config. A few character normalizations, acronym handling, camelCase handling, homoglyph normalization, and ICU tokenizer repair are already all on the list of Global Custom Filters.

I discovered that not only does lowercase not cause the problem, but icu_folding solves it, too—converting i-with-an-extra-dot to plain i. I hadn't given this a lot of thought previously because in my step-by-step unpacking analysis, I fixed the icu_normalizer problem before enabling icu_folding.

Letting icu_folding handle Ä° instead of dotted_I_fix causes a few changes.

  • If icu_folding comes after stemming and stop word filtering, then words like Ä°N are not dropped as stop words, and words like HOPÄ°NG are not properly stemmed. This affects very few words across my samples in ~100 languages, and it is line with what already happens to words like ın and hopıng, as well as unrelated diacriticized words like Ă­n, ÎN, hopĂŻng, HOPÌNG, Ä©n, ÄŹN, hopÄ«ng, HOPǏNG, etc.
  • Rarely, homoglyph processing is thwarted. A word like КОПİЙОК, which is all Cyrillic except for the Latin Ä°, doesn't get homoglyphified, but that's because homoglyph_norm doesn't currently link Latin Ä° and Cyrillic Đ†Ì‡.
  • (These issues occur in lots of languages, though many of my examples here are just English.)

However, as icu_folding needs to be custom configured per language, it isn't available everywhere, so dotted_I_fix is still needed, especially in the default config.

There are also some unusual legacy configs out there! For example, Italian does not upgrade asciifolding to icu_folding. It got customized (along with English) more than 9 years ago, and I haven't worked on it since, so while I've refactored the code, I never had time to tweak and re-test the resulting config. (Another task, T332342 "Standardize ASCII-folding/ICU-folding across analyzers", will address and probably change this—but I didn't want to look at it now. Got to keep the bites of the elephant reasonably small, eh?)

Other languages with custom config but without ICU folding: Chinese, Indonesian, Japanese/CJK, Khmer, Korean, Malay, Mirandese, and Polish. These were mostly unpacked (many by me, some before I got here, like Italian) before adding icu_folding became a standard part of unpacking. Mirandese has a mini-analysis chain with no stemmer and didn't get the full work up. These should all generally get dotted_I_fix in the new scheme.

Sometimes we have asciifolding_preserve, which we upgrade to preserve_original + icu_folding, which then holds on to the i-with-an-extra-dot, which could improve precision in some cases, I guess. (Making a firm decision is left as an exercise for the reader probably me, in T332342.)

Overall, removing dotted_I_fix in cases where icu_folding is available—which in the past has often been on larger wikis in major languages with Elasticsearch/Lucene analyzers available—may marginally improve efficiency.

So, with all that in mind, next comes the fun part: determining how to decide whether dotted_I_fix should be applied in a given analyzer.

I originally tried excluding certain languages (Azerbaijani, Crimean Tatar, Gagauz, Kazakh, Tatar, and Turkish, obviously, but also Greek and Irish, because they have language-specific lowercasing). And I could skip enabling dotted_I_fix if certain filters were used—but ascii_folding is only upgraded if the ICU plugin is available.. except for Italian, which doesn't upgrade.. though that may change when T332342 gets worked on. Have I mentioned the epicycles?

Eventually I realized that (İ) if I moved enabling Global Custom Filters to be the very last upgrade step, then (ıı) I wouldn't have to guess whether lowercase or ascii_folding is upgraded, or (İIİ) try to maintain the long-distance tight coupling between language configs and the dotted_I_fix config in Global Custom Filters (which can be tricky—Ukrainian sneaks in ICU folding only if it is unpacked, which is dependent on the right plugin being installed!).

Finally, all I actually had to do was block dotted_I_fix on the presence of lowercase or icu_folding in the otherwise final analyzer config. Phew!

Analysis Results[edit]

Overall, the mergers in non-Turkic languages are what we want to see: İstanbul analyzed the same as Istanbul, istanbul, and ıstanbul.

For the most part, the mergers in Turkic languages are also good, though there are occasional "impossible" triples. Given English Internet and Turkish Ä°nternet, which should merge with lowercase internet? On a Turkish wiki, Turkish Ä°nternet wins by default.

(I also rediscovered that the Chinese smartcn_tokenizer splits on non-ASCII Latin text, so as noted in my write up, fußball → fu, ß, ball.. or RESPUBLÄ°KASI → respubl, Ä°, kasi (I also-also just realized that the smartcn_tokenizer lowercases A-Z, but not diacriticized uppercase Latin.. and it splits on every diacritical uppercase letter, so Ă‰ĂŽĂ‘Ć ÈšĂáșŒÇž → É, Î, Ñ, Ć , Ț, Ï, áșŒ, Ǿ—yikes!) With dotted_I_fix, RESPUBLÄ°KASI does okay in Chinese, though.)

Extreme Miscellany[edit]

Out of nowhere, I noticed that the generic code for Norwegian (no) doesn't have a config, though the two more specific codes—Norwegian BokmĂ„l (nb) and Norwegian Nynorsk (nn)—do have configs. no.wikipedia.org uses nb, and nn.wikipedia.org uses nn (and nb.wikipedia.org rolls over to no.wikipedia.org). Both nb and nn have the same config. (There are nn-specific stemmers available, but they don't get used at the moment.) We probably won't use the no config, but since I figured all this out, I added it in anyway, because no, nb, and nn all using the same config is probably not ideal, but it is less dumb than nb and nn using the same config while no doesn't have any config at all!

Enab... no, wait.. Disable Hiragana-to-Katakana Mapping (T180387)[edit]

Background[edit]

A long time ago (2017) in a Phabricator ticket far away (T176197), we enabled a mapping from Japanese hiragana to katakana for English-language wikis, to make it easier to find words that could be written in either (such as "wolf", which could be either ă‚Șă‚Șă‚«ăƒŸ or おおかみ, both "ƍkami").

After much discussion on various village pumps, the consensus was that this wasn't a good idea on Japanese wikis, but would probably be helpful on others, and it got positive feedback (or at least no negative feedback) on French, Italian, Russian, and Swedish Wikipedias and Wiktionaries.

Since then, much water has flowed under the proverbial bridge—and many code and config changes have flowed through Gerrit—which has changed the situation and the applicability of the hiragana-to-katakana mapping in our current NLP context.

Back in the old days, under the rule of the standard tokenizer, hiragana was generally broken up into single characters, and katakana was generally kept as a single chunk. On non-Japanese wikis, both are much more likely to occur as single words, so converting おおかみ (previously indexed as お + お + か + み) to ă‚Șă‚Șă‚«ăƒŸ (indexed as ă‚Șă‚Șă‚«ăƒŸ) not only allowed cross-kana matching, but also improved precision, since we weren't trying to match individual hiragana characters.

Recently, though, we upgraded to the ICU tokenizer, exactly because it is better at parsing Japanese (and Chinese, Khmer, Korean, Lao, Myanmar, and Thai) and usually works much better for those languages, especially on wikis for other languages (e.g., parsing Japanese, et al., on English, Hungarian, or Serbian wikis).

Enabling the Kana Map[edit]

My first test was to enable the previously English-only kana_map character filter almost everywhere. Note that character filters apply before tokenization, and treat the text as one big stream of characters.

As above, it makes sense to skip Japanese—and I was considering not enabling kana_map for Korean and Chinese, too, because they have different tokenizers—but the big question was how all the other languages behaved with it enabled.

The result was a bit of a mess. There were a lot of unexpected parsing changes, rather than just mappings between words written in different kana. Hmmmm.

I See You, ICU[edit]

The ICU tokenizer uses a dictionary of Chinese and Japanese words to parse Chinese and Japanese text. (The dictionary is available as a ~2.5MB plain text file. This may not be the exact version we are using on our projects since we are currently paused on Elastic 7.10 and its compatible ICU components, but there have probably not been any huge changes.)

One of the big changes I was seeing after enabling the kana_map was that words with mixed kanji (Chinese characters used in Japanese) and hiragana were being parsed differently. As a random example, 憅曞り (two kanji and one hiragana) is in the ICU dictionary, so it is parsed as one token. However, the version with katakana, 憅曞ăƒȘ (i.e., the output of kana_map), is not in the dictionary, so it gets broken up into three tokens: 憅 + 曞 + ăƒȘ.

Sometimes this kind of situation also resulted in the trailing katakana character being grouped with following characters. So, where we might have had something tokenized like "CCH+KK" before, with the hiragana-to-katakana mapping applied, we get "C+C+KK+K"... not only are words broken up, word boundaries are moving around. That's double plus ungood.

A New Hope.. uh.. New Plan[edit]

So, converting hiragana to katakana before tokenization isn't working out, what about converting it after tokenization?

This should have been easy, but it wasn't. There is no generic character-mapping token filter, though there are lots of language-specific normalization token filters that do one-to-one (and sometimes more complex) mappings, all of which we use for various languages. I looked at the code for a couple of them, and they all use hard-coded switch statements rather than a generic mapping capability, presumably for speed.

It took a fair amount of looking, but I found a well-hidden feature of the ICU plugin, the icu_transform filter, which can link together various conditions and pre-defined transformations and transliterations... and Hiragana-Katakana is one of them! ICU Transforms are a generic and powerful capability, which means the filter is probably pretty expensive, but it would do for testing, for sure!

The results were... underwhelming. All of the parsing problems went away, which was nice, but there were very few cross-kana mappings, and the large majority of those were single letters (e.g., お (hiragana "o") and ă‚Ș (katakana "o") both being indexed as ă‚Ș).

I See You, ICU—Part 2[edit]

So that Chinese/Japanese dictionary that the ICU tokenizer uses... it has almost 316K words (and phrases) in it. Many have hiragana (44.5K), many have katakana (22K), a a majority have hanzi/kanji (Chinese characters—287K), and—based on adding up those values—a fair number have various mixes of the three (including, for example, 愳たコ—an alternate spelling of "girl"—which uses all three in one word!).

I filtered out the words with kana (almost 58,921) and converted all of the hiragana in that list to katakana, and then looked for duplicates, of which there were 2,268 (only 3.85%).

So, there are only about two thousand possible words that the ICU tokenizer could conceivably parse as both containing hiragana and as containing katakana, and then match them up after the hiragana-to-katakana mapping. And the original example, ă‚Șă‚Șă‚«ăƒŸ vs おおかみ, is not even on the list. In Chrome or Safari on a Mac, if you double click on the end of ă‚Șă‚Șă‚«ăƒŸ, it highlights the whole word. If you double click on the end of おおかみ, it only highlights the last three characters. That's because ă‚Șă‚Șă‚«ăƒŸ is in the dictionary, but おおかみ is not. おかみ, meaning "landlady", is in the list, as is お, meaning "yes" or "okay". So the ICU tokenizer treats おおかみ as お + おかみ (which probably reads like terrible machine translation for "yes, landlady"). Converting お + おかみ to ă‚Ș + ă‚Șă‚«ăƒŸ still won't match ă‚Șă‚Șă‚«ăƒŸ.

Also, the icu_transform filter is pretty slow. In a very minimal timing test, adding it to the English analysis chain makes loading a big chunk of English text take 10.7% longer, so a custom hiragana/katakana mapping token filter would probably be much faster.

No Hope.. No Plan[edit]

(That section title is definitely a bit dramatic.)

On the one hand, mapping hiragana and katakana is clearly worth doing to people other than us. Some browsers support this in on-page searching. For example, in Safari and Chrome on a Mac, searching on the page for おおかみ finds ă‚Șă‚Șă‚«ăƒŸ. Though in Firefox it doesn't. (I haven't tested other browsers or operating systems.)

On the other hand, Google Search is still treating them differently. Searching for おおかみ and ă‚Șă‚Șă‚«ăƒŸ give very different numbers of results—11.8M for おおかみ (which includes ç‹Œ, the Chinese character for "wolf") vs 33.1M for ă‚Șă‚Șă‚«ăƒŸ. Yahoo! Japan gives 12.9M vs 28.2M. These tell the same story as the numbers from 2017 on Phab.

Comparing the analysis in English, with and without the kana_map character filter enabled, it's definitely better to turn it off. And as with all the other non-CJK languages, enabling the icu_transform solves the worst problems, but doesn't do much positive, and it's very expensive.

So, in conclusion, the best thing seems to be to do the opposite of the original plan, and disable the hiragana-to-katakana mapping in favor of getting the value of the ICU tokenizer parsing Japanese text (and text in other Asian languages).

Other Things to Do[edit]

A list of incidental things to do that I noticed while working on the more focused sub-projects above.

The first list is relatively simple things that should definitely be done.

  • Unpack Ukrainian even if extra-analysis-ukrainian is not available
  • Add Latin Ä° / Cyrillic Đ†Ì‡ to the homoglyph_norm list.
  • ✔ Enable dotted_I_fix (almost?) everywhere, and maybe enable Turkish lowercase for languages that distinguish I/ı and Ä°/i.
  • ✔ Add remove_duplicates after hebrew_lemmatizer in the Hebrew analysis chain, to remove exact duplicates. (See "Add remove_duplicates to Hebrew" above).
  • ✔ Refactor English, Japanese, etc. configs to use AnalysisBuilder
  • ✔ Merge mapping filters when possible
    • nnbsp_norm and apostrophe_norm are universal and (can) occur first, so merging makes sense. nnbsp_norm is used in other places so it needs to exist on its own, too, though.
    • kana_map will not be universal (it will not be used in Japanese), so merging it would be... tricky? I've thought about trying to build up single mapping filter with all the general and language-specific mappings in it, but it might be too much maintenance burden and code complexity.
      • And it could cause a mess in multi-lingual configs like Wikidata.
  • ✔ Enable a config for no/Norwegian.

The second list involves somewhat more complicated issues or lower priority issues that could use looking at.

  • Either add a reverse number hack and number/period split similar to the one for Thai to Khmer, Lao, and Myanmar, or change icu_token_repair to not merge those specific scripts with adjacent numbers.
  • Investigate identifying tokens generated by the ICU tokenizer that are all symbols or all Common characters or something similar, and allowing them to merge in certain situations (to handle the micro sign case more generally).
  • See if any parts of the Armenian (hy) analysis chain can do useful things for Western Armenian (hyw) wikis.
  • Consider a minor global French/Italian elision filter for d'– and l'– and English possessive filter for –'s (almost?) everywhere.
  • Try to make homoglyph norm more efficient, especially if I ever get around to expanding it to include Greek
  • Do some timings on the Khmer syllable reording, just to see how terrible it is!
  • Investigate nn/Nynorsk stemmers for nnwiki.