User:TJones (WMF)/Notes/Language Analyzer Harmonization Notes

May 2023 — See TJones_(WMF)/Notes for other projects. See also T219550. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro, Goals, Caveats
The goal of bringing language analyzers "into harmony" is to have as many of the non–language-specific elements of the analyzers to be the same as possible. Some split words on underscores and periods, some don't. Some split CamelCase words and some don't. Some use ASCII folding, some use ICU folding, and some don't use either. Some preserve the original word and have two ouptuts when folding, and some don't. Some use the ICU tokenizer and some use the standard tokenizer (for no particular reason—there are good reasons to use the ICU, Hebrew, Korean, or Chinese tokenizers in particular cases). When there is no language-specific reason for these differences, it's confusing, and we clearly aren't using analysis best practices everywhere.

My design goal is to have all of the relevant upgrades made by default across all language analysis configurations, with only the exceptions having to be explicitly configured.

Our performance goal is to reduce zero-results rate and/or increase the number of results returned for 75% of relevant queries averaged across all wikis. This goal comes with some caveats, left out of the initial statement to keep it reasonably concise.


 * "All wikis" is, in effect, "all reasonably active wikis"—if a wiki has only had twelve searches last month, none with apostrophes, it's hard to meaningfully measure "75% of the queries with apostrophes" in them. More details in "Data Collection" below.
 * I'm also limiting my samples to Wikipedias because they have the most variety of content and queries, and to limit testing scope, allowing more languages to be included.
 * I'm going to ignore wikis with unchanged configs (some elements are already deployed on some wikis), since they will have approximately 0% change in results (there's always a bit of noise).
 * "Relevant" queries are those that have the feature being worked on. So, I will have a collection of queries with apostrophe-like characters in them to test improved apostrophe handling, and a collection of queries with acronyms to test better acronym processing. I'll still test general query corpora to get a sense of the overall impact, and to look for cases where queries without the feature being worked on still get more matches (for example, searching for NASA should get more matches to N.A.S.A. in articles).
 * I'm also applying my usual filters (used for all the unpacking impact analyses) to queries, mostly to filter out porn and other junk. For example, I don't think it is super important whether the query s`wsdfffffffsf actually gets more results once we normalize the backtick/grave accent to an apostrophe.
 * Smaller/lower-activity wikis may get filtered out for having too few relevant queries for a given feature.
 * We are averaging rates across wikis so that wiki size isn't a factor (and neither is sample rate—so, I can oversample smaller wikis without having to worry about a lot of bookkeeping).

Data Collection
I started by including all Wikipedias with 10,000 or more articles. I also gathered the number of active editors and the number of full-text queries (with the usual anti-bot filters) for March 2023. I dropped those with less than 700 monthly queries and fewer than 50 active editors. My original ideas for thresholds had been ~1000 monthly queries and ~100 active editors, but I didn't want or need a super sharp cut off. Limiting by very low active editor counts meant fewer samples to get at the query-gathering step, which is somewhat time-consuming. Limiting by query count also meant less work at the next step of filtering queries, and all later steps, too.

I ran my usual query filters (as mentioned above), and also dropped wikis with fewer than 700 unique queries after filtering. That left 90 Wikipedias to work with. In order of number of unique filtered monthly queries, they are: English, Spanish, French, German, Russian, Japanese, Chinese, Italian, Portuguese, Polish, Arabic, Dutch, Czech, Korean, Indonesian, Turkish, Persian, Vietnamese, Swedish, Hebrew, Ukrainian, Igbo, Finnish, Hungarian, Romanian, Greek, Norwegian, Catalan, Hindi, Thai, Simple English, Danish, Bangla, Slovak, Bulgarian, Swahili, Croatian, Serbian, Tagalog, Slovenian, Lithuanian, Georgian, Tamil, Malay, Uzbek, Estonian, Albanian, Azerbaijani, Latvian, Armenian, Marathi, Burmese, Malayalam, Afrikaans, Urdu, Basque, Mongolian, Telugu, Sinhala, Kazakh, Macedonian, Khmer, Kannada, Bosnian, Egyptian Arabic, Galician, Cantonese, Icelandic, Gujarati, Central Kurdish, Serbo-Croatian, Nepali, Latin, Kyrgyz, Belarusian, Esperanto, Norwegian Nynorsk, Assamese, Tajik, Punjabi, Oriya, Welsh, Asturian, Belarusian-Taraškievica, Scots, Luxembourgish, Irish, Alemannic, Breton, & Kurdish.


 * Or, in language codes: en, es, fr, de, ru, ja, zh, it, pt, pl, ar, nl, cs, ko, id, tr, fa, vi, sv, he, uk, ig, fi, hu, ro, el, no, ca, hi, th, simple, da, bn, sk, bg, sw, hr, sr, tl, sl, lt, ka, ta, ms, uz, et, sq, az, lv, hy, mr, my, ml, af, ur, eu, mn, te, si, kk, mk, km, kn, bs, arz, gl, zh-yue, is, gu, ckb, sh, ne, la, ky, be, eo, nn, as, tg, pa, or, cy, ast, be-tarask, sco, lb, ga, als, br, ku.

I sampled 1,000 unique filtered queries from each language (except for those that had fewer than 1000).

I also pulled 1,000 articles from each Wikipedia to use for testing.

I used a combined corpus of the ~1K queries and the 1K articles for each language to test analysis changes. This allows me to see interactions between words/characters that occur more in queries and words/characters that occur more in articles.

Relevant Query Corpora
For each task, I plan to pull a corpus of "relevant" queries for each language for before-and-after impact assessment, by grepping for the relevant characters. For each corpus, I'll also do some preprocessing to remove queries that are unchanged by the analysis upgrades being made.

For example, when looking at apostrophe-like characters, ICU folding already converts typical curly quotes (‘’) to straight quotes ('), so for languages with ICU folding enabled, curly quotes won't be treated any differently, so I plan to remove those queries as "irrelevant". Another example is reversed prime (‵), which causes a word break with the standard tokenizer; apostrophes are stripped at the beginning or ending of words, so reversed prime at the edge of a word isn't actually treated differently from an apostrophe in the same place—though the reasons are very different.

For very large corpora (≫1000, for sure), I'll probably sample the corpus down to a more reasonable size after removing "irrelevant" queries.

I'm going to keep (or sample) the "irrelevant" queries (e.g., words with straight apostrophes or typical curly quotes handled by ICU folding) for before-and-after analysis, because they may still get new matches on words in wiki articles that use the less-common characters, though there are often many, many fewer such words on-wiki—because the WikiGnomes are always WikiGnoming!

Another interesting wrinkle is that French and Swedish use ICU folding with "preserve original", so that both the original form and folded form are indexed (e.g., l’apostrophe is indexed as both l’apostrophe and l'apostrophe). This doesn't change matching, but it may affect ranking. I'm going to turn off the "preserve original" filter for the purpose of removing "irrelevant" queries, since we are focused on matching here.

Some Observations
After filtering porn and likely junk queries and uniquifying queries, the percentage of queries remaining generally ranged from 94.52% (Icelandic—so many unique queries!) to 70.58% (Persian), with a median of 87.31% (Simple English), and a generally smooth distribution across that range.

There were three outliers:


 * Swahili (57.51%) and Igbo (37.56%) just had a lot of junk queries.
 * Vietnamese was even lower at 30.03%, with some junk queries but also an amazing number of repeated queries, many of which are quite complex (not like everyone is searching for just famous names or movie titles or something "simple"). A few queries I looked up on Google seem to exactly match titles or excerpts of web pages. I wonder if there is a browser tool or plugin somewhere that is automatically doing wiki searches based on page content.

Re-Sampling & Zero-Results Rate
I found a bug in my filtering process, which did not properly remove certain very long queries that get 0 results, which I classify as "junk". These accounted for less than 1% of any given sample, but it was still weird to have many samples ranging from 990–999 queries instead of the desired 1,000. Since I hadn't used my baseline samples for anything at that point, I decided to re-sample them. This also gave me an opportunity to compare zero-results rates (ZRR) between the old and new samples.

In the case of very small queries corpora, the old and new samples may largely overlap, or even be identical. (For example, if there are only 800 queries to sample from, my sample "of 1000" is going to include all of them, every time I try to take a sample.) Since this ZRR comparison was not the point of the exercise, I'm just going to throw out what I found as I found it, and not worry about any sampling biases—though they obviously include overlapping samples, and potential effects of the original filtering error.

The actual old/new ZRR for these samples ranged from 6.3%/6.2% (Japanese) to 75.4%/76.1% (Igbo—wow!). The zero-results rate differences from the old to the new sample ranged from -4.2% (Gujarati, 64.3% vs 60.1%) to +5.6% (Dutch, 22.1% vs 27.7%), with a median of 0.0% and mean of -0.2%. Proportional rates ranged from -19.9% (Galician, 17.5% vs 14.6%) to +20.2% (Dutch, 22.1% vs 27.7%, again), with a median of 0.0%, and a mean of -0.5%.

Looking at the graph, there are some minor outliers, but nothing ridiculous, which is nice to see.



"Infrastructure"
I've built up some temporary "infrastructure" to support impact analysis of the harmonization changes. Since every or almost every wiki will need to be reindexed to enable harmonization changes, timing the "before and after" query analyses for the 90 sampled wikis would be difficult.

Instead, I've set up a daily process that runs all 90 samples each day. There's an added bonus of seeing the daily variation in results without any changes.

I will also pull relevant sub-samples for each of the features (apostrophes, acronyms,, etc.) being worked on and run them daily as well.

There's a rather small chance of having a reindexing finish while a sample is being run, so that half the sample is "before" and half is "after". If that happens, I can change my monitoring cadence to every other day for that sample for comparison's sake and it should be ok.

Apostrophes (T315118)
There are some pretty common apostrophe variations that we see all the time, particularly the straight vs curly apostrophes—e.g., ain't vs ain’t. And of course people (or their software) will sometimes curl the apostrophe the wrong way—e.g., ain‘t. But lots of other characters regularly (and some irregularly) get used as apostrophes, or apostrophes get used for them—e.g., Hawai'i or Hawai’i or Hawai‘i when the correct Hawaiian letter is the okina: Hawaiʻi.

A while back, we worked on a ticket (T311654) for the Nias Wikipedia to normalize some common apostrophe-like variants, and at the time I noted that we should generalize that across languages and wikis as much as possible. ICU normalization and ICU folding already do some of this (see the table below)—especially for the usual ‘curly’ apostrophes/single quotes, but those cases are common enough that we should take care of them even when the ICU plugin is not available. It'd also be nice if the treatment of these characters was more consistent across languages, and not dependent on the specific tokenizer and filters configured for a language.

There are many candidate "apostrophe-like" characters. The list below is distillation of the list of Unicode Confusables for apostrophe, characters I had already known were potential candidates from various Phab tickets and my own analysis experience (especially working on Turkish apostrophes), and the results of data-mining for apostrophe-like contexts (e.g., Hawai_i).

Key


 * x'x—It's hard to visually distinguish all the vaguely apostrophe-like characters on-screen, so after ordering them, I put a letter (or two) before them and an x after them. The letter before makes it easier to see where each one is/was when looking at the analysis output, and the x after doesn't seem to be modified by any of the analyzers I'm working with. And x'x is an easy shorthand to refer to a character without having to specify its full name.
 * Also, apostrophe-like characters sometimes get treated differently at the margins of a word. (Schrodinger's apostrophe: inside a word it's an apostrophe, at the margins, it's a single quote.) Putting it between two alpha characters gives it the most apostrophe-like context.
 * Desc.—The Unicode description of the character
 * #q—The number of occurrences of this character (in any usage) in my 90-language full query sample. Samples can be heavily skewed: Hebrew letter yod occurs a lot in Hebrew queries—shocker! Big wiki samples are larger, so English is over-represented. Primary default sort key.
 * #wiki samp—The number of occurrences of this character in my 90-language 1K Wikipedia sample. Samples can be skewed by language (as with Hebrew yod above), but less so by sample size. All samples are 1K articles, but some wikis have longer average articles. Secondary default sort key.
 * UTF—UTF codepoint for the character. Tertiary default sort key.
 * Example—An actual example of the character being used in an apostrophe-like way. Most come from English Wikipedia article or query samples. Others I had to look harder to find—in other samples, or using on-wiki search.
 * Just because a word or a few words exist with the character used in an apostrophe-like way doesn't mean it should be treated as an apostrophe. When looking for words matching the Hawai_i pattern, I found Hawai*i, Hawai,i, and Hawai«i, too. I don't think anyone would suggest that asterisks, commas, or guille­mets should be treated as apostrophes.
 * I never found a real example of Hebrew yod being used as an apostrophe. I only found two instances of it embedded in a Latin-script word (e.g. Archיologiques), and there it looked like an encoding error, since it has clearly replaced é. I fixed both of those (through my volunteer account).
 * I really did find an example of apostrophe's using a real apostrophe!
 * std tok (is)—What does the standard tokenizer (exemplified by the is/Icelandic analyzer) do to this character?
 * icu tok (my)—What does the ICU tokenizer (exemplified by the my/Myanmar analyzer) do to this character?
 * heb tok (he)—What does the HebMorph tokenizer (exemplified by the he/Hebrew analyzer) do to this character?
 * nori tok (ko)—What does the Nori tokenizer (exemplified by the ko/Korean analyzer) do to this character?
 * smart cn (zh)—What does the SmartCN tokenizer (exemplified by the zh/Chinese analyzer) do to this character?
 * icu norm (de)—What does the ICU normalizer filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
 * icu fold (de)—What does the ICU folding filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
 * icu norm (wsp)—What does the ICU normalizer filter do to this character, after going through a whitespace tokenizer? (The whitespace tokenizer just splits on spaces, tabs, newlines, etc. There's no language for this, so it was a custom config.)
 * icu norm + fold (wsp)—What does the ICU normalizer filter + the ICU folding filter do to this character, after going through a whitespace tokenizer? (We never enable the ICU folding filter without enabling ICU normalization first—so this is a more "typical" config.)
 * icu fold (wsp)—What does the ICU folding filter do to this character, after going through a whitespace tokenizer, without ICU normalization first?
 * Tokenizer and Normalization Sub-Key
 * split means the tokenizer splits on this character—at least in the context of being between Latin characters. Specifically non-Latin characters get split by the ICU tokenizer between Latin characters in general because it always splits on script changes. (General punctuation doesn't belong to a specific script.) So, the standard tokenizer splits a‵x to a and x.
 * split/keep means the tokenizer splits before and after the character, but keeps the character. So, the ICU tokenizer splits dߴx to d, ߴ, and x.
 * → ? means the tokenizer or filter converts the character to another character. So, the HebMorph tokenizer tokenizer c‛x as c'x (with an apostrophe).
 * The most common conversion is to an apostrophe. The SmartCN tokenzier converts most punctuation to a comma. The ICU normalizer converts some characters to space plus another character (I don't get the reasoning, so I wonder if this might be a bug); I've put those in square brackets, though the space doesn't really show up, and put a mini-description in parens, e.g. "(sp + U+301)". Fullwidth grave accent gets normalized to a regular grave accent by ICU normalization.
 * split/keep → ,—which is common in the SmartCN tokenizer column—means that text is split before and after the character, the character is not deleted, but it is converted to a comma. So, the SmartCN tokenizer tokenizes a‵x as a + , + x.
 * delete means the tokenizer or filter deletes the character. So, ICU folding converts dߴx to dx.
 * Nias—For reference, these are the characters normalized specifically for nia/Nias in Phab ticket T311654.
 * apos-like—After reviewing the query and Wikipedia samples, this character does seem to commonly be used in apostrophe-like ways. (In cases of the rarer characters, like bꞌx, I had to go looking on-wiki for examples.)
 * + means it is, – means it isn't, == means this is the row for the actual apostrophe!
 * transitive—This character is not regularly used in an apostrophe-like way, but it is normalized by a tokenizer or filter into a character that is regularly used in an apostrophe-like way.
 * apos is x-like?—While the character is not used in apostrophe-like way (i.e., doesn't appear in Hawai_i, can_t, don_t, won_t, etc.), apostrophes are used where this character should be.
 * + means it is, – means it isn't, blank means I didn't check (because it was already apostrophe-like or transitively apostrophe-like).
 * final fold—Should this character get folded to an apostrophe by default? If it is apostrophe-like, transitively apostrophe-like, or apostrophes get used where it gets used—i.e., a + in any of the three rpevious columns—then the answer is yes (+).

Character-by-Character Notes

 * a‵x (reversed prime): This character is very rarely used anywhere, but it is normalized to apostrophe by ICU folding
 * bꞌx (Latin small letter saltillo): This is used in some alphabets to represent a glottal stop, and apostrophes are often used to represent a glottal stop, so they are mixed up. In the English Wikipedia article for Mi'kmaq (apostrophe in the title), miꞌkmaq (with saltillo) is used 144 times, while mi'kmaq (with apostrophe) is used 78 times—on the same page!
 * c‛x (single high-reversed-9 quotation mark): used as a reverse quote and an apostrophe.
 * dߴx (N'ko high tone apostrophe): This seems to be an N'ko character almost always used for N'ko things. It's uncommon off the nqo/N'ko Wikipedia, and on the nqo/N'ko Wikipedia the characters do not seem to be not interchangeable.
 * e῾x (Greek dasia): A Greek character almost always used for Greek things.
 * fʽx (modifier letter reversed comma): Commonly used in apostrophe-like ways.
 * g᾿x (Greek psili): A Greek character almost always used for Greek things.
 * h᾽x (Greek koronis): A Greek character almost always used for Greek things.
 * i՚x (Armenian apostrophe): An Armenian character almost always used for Armenian things, esp. in Western Armenian—however, the non-Armenian apostophe is often used for the Armenian apostrophe.
 * j｀x (fullwidth grave accent): This is actually pretty rare. It is mostly used in kaomoji, like (*´ω｀*), and for quotes. But it often gets normalized to a regular grave accent, so it should be treated like one, i.e., folded to an apostrophe.
 * It's weird that there's no fullwidth acute accent in Unicode.
 * k՝x (Armenian comma): An Armenian character almost always used for Armenian things, and it generally appears at the edge of words (after the words), so it would usually be stripped as an apostrophe, too.
 * lʾx (modifier letter right half ring): On the Nias list, and frequently used in apostrophe-like ways.
 * mˈx (modifier letter vertical line): This is consistently used for IPA transcriptions, and apostrophes don't show up there very often.
 * n＇x (fullwidth apostrophe): Not very common, but does get normalized to a regular apostrophe by ICU normalization and ICU folding, so why fight it?
 * oʹx (modifier letter prime): Consistently used on-wiki as palatalization in Slavic names, but apostrophes are used for that, too.
 * pʿx (modifier letter left half ring): On the Nias list, and frequently used in apostrophe-like ways.
 * q′x (prime): Consistently used for coordinates, but so are apostrophes.
 * rˊx (modifier letter acute accent): Used for bopomofo to mark tone; only occurs in queries from Chinese Wikipedia.
 * sˋx (modifier letter grave accent): Used as an apostrophe in German and Chinese queries.
 * t΄x (Greek tonos): A Greek character almost always used for Greek things.
 * uʼx (modifier letter apostrophe): Not surprising that a apostrophe variant is used as an apostrophe.
 * v׳x (Hebrew punctuation geresh): A Hebrew character almost always used for Hebrew things... however, it is converted to apostrophe by both the Hebrew tokenizer and ICU folding.
 * wʻx (modifier letter turned comma): Often used as an apostrophe.
 * x´x (acute accent): Often used as an apostrophe.
 * y`x (grave accent): Often used as an apostrophe.
 * z‘x (left single quotation mark): Often used as an apostrophe.
 * za’x (right single quotation mark): The curly apostrophe, so of course it's used as an apostrophe.
 * zb'x (apostrophe): The original!
 * zcיx (Hebrew letter yod): A Hebrew character almost always used for Hebrew things. The most examples because it is an actual Hebrew letter. Showed up on the confusabled list, but is never used as an apostrophe. Only examples are encoding issues: Palיorient, Archיologiques → Paléorient, Archéologiques

Apostrophe-Like Characters, The Official List™
The final set of 19 apostrophe-like characters to be normalized is [`´ʹʻʼʽʾʿˋ՚׳‘’‛′‵ꞌ＇｀]—i.e.:


 * ` (U+0060): grave accent
 * ´ (U+00B4): acute accent
 * ʹ (U+02B9): modifier letter prime
 * ʻ (U+02BB): modifier letter turned comma
 * ʼ (U+02BC): modifier letter apostrophe
 * ʽ (U+02BD): modifier letter reversed comma
 * ʾ (U+02BE): modifier letter right half ring
 * ʿ (U+02BF): modifier letter left half ring
 * ˋ (U+02CB): modifier letter grave accent
 * ՚  (U+055A): Armenian apostrophe
 * ׳ (U+05F3): Hebrew punctuation geresh
 * ‘ (U+2018): left single quotation mark
 * ’ (U+2019): right single quotation mark
 * ‛ (U+201B): single high-reversed-9 quotation mark
 * ′ (U+2032): prime
 * ‵ (U+2035): reversed prime
 * ꞌ (U+A78C): Latin small letter saltillo
 * ＇ (U+FF07): fullwidth apostrophe
 * ｀ (U+FF40): fullwidth grave accent

Other Observations

 * Since ICU normalization converts some of the apostrophe-like characters above to ́ (U+301, combining acute accent),  ̓ (U+313, combining comma above), and  ̔ (U+314, combining reversed comma above), I briefly investigated those, too. They are all used as combining accent characters and not as separate apostrophe-like characters. The combining commas above are both used in Greek, which makes sense, since they are on the list because Greek accents are normalized to them.
 * In French examples, I sometimes see 4 where I'd expect an apostrophe, especially in all-caps. Sure enough, looking at the AZERTY keyboard you can see that 4 and the apostrophe share a key!
 * The  in the Hebrew analyzer often generates multiple output tokens for a given input token—this is old news. However, looking at some detailed examples, I noticed that sometimes the multiple tokens (or some subset of the multiple tokens) are the same! Indexing two copies of a token on top of each other doesn't seem helpful—and it might skew token counts for relevance.

The filter for Nias that normalized some of the relevant characters was called. Since the new filter is a generalization of that, it is also called. There's no conflict because with the new generic, as there's no longer a need for a Nias-specific filter, or any Nias-specific config at all.

I tested the new  filter on a combination of ~1K general queries and 1K Wikipedia articles per language (across the 90 harmonization languages). The corpus for each language was run through the analysis config for that particular language. (Languages that already have ICU folding, for example, already fold typical ‘curly’ quotes, so there'd be no change for them, but for other languages there would be.)

I'm not going to give detailed notes on all 90 languages, just note general trends and highlight some interesting examples.


 * In general, there are lots of names and English, French, & Italian words with apostrophes everywhere (O´Reilly, R`n`R, d‘Europe, dell’arte).


 * There are also plenty of native apostrophe-like characters in some languages; the typical right curly apostrophe (’) is by far the most common. (e.g., ইক'নমিক vs ইক’নমিক, з'ездам vs з’ездам, Bro-C'hall vs Bro-C’hall)


 * Plenty of coordinates with primes (e.g., 09′15) across many languages—though coordinates with apostrophes are all over, too.


 * Half-rings (ʿʾ) are most common in Islamic names.


 * Encoding errors (e.g., Р’ Р±РѕР№ РёРґСѓС‚ РѕРґРЅРё В«СЃС‚Р°СЂРёРєРёВ» instead of В бой идут одни «старики») sometimes have apostrophe-like characters in them. Converting them to apostrophes doesn't help.. it's just kinda funny.


 * Uzbek searchers really like to mix it up with their apostrophe-like options. The apostrophe form o'sha will now match o`sha, oʻsha, o‘sha, o’sha, o`sha, oʻsha, o‘sha, and o’sha—all of which exist in my samples!

I don't always love how the apostrophes are treated (e.g.,  in English is too aggressive), but for now it's good that all versions of a word with different apostrophe-like characters in it are at least treated the same.

There may be a few instances where the changes decrease the number of results a query gets, but it is usually an increase in precision. For example, l‍´autre would no longer match autre because the tokenizer isn't splitting on ´. However, it will match l'autre. Having to choose between them isn't great—I'm really leaning toward enabling French elision processing everywhere—but in a book or movie title, an exact match is definitely better. (And having to randomly match l to make the autre match is also arguably worse.)

(T219108)
is enabled on English- and Italian-language wikis, so—assuming it does a good job and in the name of harmonization—we should look at enabling it everywhere. In particular, it splits up CamelCase words, which is generally seen as a positive thing, and was the original issue in the Phab ticket.

The  filter is a  token filter, but the   docs say it should be deprecated in favor of. I made the change and ran a test on 1K English Wikipedia articles, and there were no changes in the analysis output, so I switched to the  version before making any further changes.

Also,, as a   token filter, needs to be the first token filter if possible. (We already knew it needed to come before homoglyph_norm.) If any other filter makes any changes,  can lose the ability to track offsets into the original text. Being able to track those changes gives better (sub-word) highlighting, and probably better ranking and phrase matching.

So Many Options, and Some Zombie Code
The  filter has a lot of options! Options enabling,  , and   are commented out in our code, saying they are potentially useful, but they cause indexing errors. The docs say they cause problems for  queries. The docs seem to say you can fix the indexing problem with the  filter, but still warn against using them with   queries, so I think we're just gonna ignore them (and remove the commented out lines from our code).

Apostrophes, English Possessives, and Italian Elision
In the English analysis chain, the  stemmer currently comes after , so it does nothing, since   splits on apostrophes. However,  has a   setting, which is sensibly off by default, but we can turn that on, just for English, which results in a nearly 90% reduction in s tokens.

After too much time looking at apostrophes (for Turkish and in general), always splitting on apostrophes seems like a bad idea to me. We can disable it in  by recategorizing apostrophes as "letters", which is nice, but that also disables removing English possessive –'s ... so we can put   back... what a rollercoaster!

In the Italian analysis chain,  comes before , and has been set to be case sensitive. That's kind of weird, but I've never dug into it before—though I did just blindly reimplement it as-is when I refactored the Italian analysis chain. All of our other elision filters are case insensitive and the Elastic monolithic analyzer reimplementation/unpacking specifies case insensitivity for Italian, too. I think it was an error a long time ago because the default value is case sensitive, and I'm guessing someone just didn't specify it explicitly, and unintentionally got the case sensitive version.

Anyway,  splits up all of the leftover capitalized elision candidates, which makes the content part more searchable, but with a lot of extra bits. The  filter removes some of them, but not all. Making  case insensitive seems like the right thing to do, and as I mentioned above, splitting on apostrophes seems bad in general.

Apostrophe Hell, Part XLVII
Now may not be the time to add extra complexity, but I can't help but note that d'– and l'– are overwhelmingly French or Italian, and dell'— is overwhelmingly Italian. Similarly, –'s, –'ve, –'re, and –'ll are overwhelmingly English. Some of the others addressed in Turkish are also predominantly in one language (j'n'–, j't'–, j'–, all'–, nell'–, qu'–, un'–, sull'–, dall'–)... though J's and Nell's exist, just to keep things complicated.

All that said, a simple global French/Italian elision filter for d'– and l'– and English possessive filter for –'s would probably improve recall almost everywhere.

CamelCase (via )
Splitting CamelCase seems like a good idea in general (it was the original issue in what became the  Phab ticket). In the samples I have, actual splits seem to be largely Latin script, with plenty of Cyrillic, and some Armenian, too.

Splitting CamelCase isn't great for Irish, because inflected capitalized words like bhFáinní get split into bh + Fáinní. Normally the stemmer would remove the bh, so the end result isn't terrible, but all those bh– 's are like having all the English possessive –'s in the index. However, we already have some hyphenation cleanup to remove stray h, n, and t, so adding bh (and b, g, and m, which are similar CamelCased inflection bits) to that mini stop word list works, and the plain index can pick still up instances like B.B. King.

Irish also probably has more McNames than other wikis, but they are everywhere. Proximity and the plain index will boost those reasonably well.

Splitting CamelCase often splits non-homoglyph multi-script tokens, like OpenМировой—some of which may be parsing errors in my data, but any of which could be real or even typos on-wiki. Anyway, splitting them seems generally good, and prevents spurious homoglyph corrections.

Splitting CamelCase is not great for iPad, LaTeX, chemical formulas, hex values, saRcAStiC sPonGEboB, and random strings of ASCII characters (as in URLs, sometimes), but proximity and the plain index take care of them, and we take a minor precision hit (mitigated by ranking) for a bigger, better recall increase.

Splitting CamelCase is good.

Others Things That Are Aggressively Split
definitely lives up to its name. Running it on non-English, non-Italian samples showed just how aggressive it is.

The Good
 * Splits web domains on periods, so en.wikipedia.org → en + wikipedia + org
 * Splits on colons

The Bad (or at least not Good) The Ugly
 * Splitting between letters and numbers is okay sometimes, but often bad, e.g. j2se → j + 2 + se
 * Splitting on periods in IPA is not terrible, since people probably don't search it much; ˈsɪl.ə.bəl vs ˈsɪləbəl already don't match anyway.
 * Splitting on periods and commas in numbers is.. unclear. Splitting on the decimal divider isn't terrible, but breaking up longer numbers into ones, thousands, millions, etc. sections is not good.
 * On the other hand, having some systems use periods for decimals and commas for dividing larger numbers (3,141,592.653) and some doing it the other way around (3.141.592,653), and the Indian system (31,41,592.653)—plus the fact that the ones, thousands, millions, etc. sections are sometimes also called periods—makes it all an unrecoverable mess anyway.
 * Splitting acronyms, so N.A.S.A. → N + A + S + A —Nooooooooooo!
 * (Spoiler: there's a fix coming!)
 * Splitting on soft hyphens is terrible—an invisible character with no semantic meaning can un pre dictably and ar bi trar i ly break up a word? Un ac cept able!
 * Splitting on other invisibles, like various joiners and non-joiners and bidi marks, seems pretty terrible in other languages, especially in Indic scripts.

Conclusion Summary So Far
splits on all the things  splits on, so early on I was thinking I could get rid of   (and repurpose the ticket for just dealing with acronyms), but   splits too many things, including invisibles, which ICU normalization handles much more nicely.

I could configure away all of 's bad behavior, but given the overlap between ,  , and the tokenizers, it looks to be easiest to reimplement the CamelCase splitting, which is the only good thing   does that   doesn't do or can't do.

So, the plan is...


 * Disable  for English and Italian (but leave it for   and , used by the ShortTextIndexField, because I'm not aware of all the details of what's going on over there).
 * Create and enable a CamelCase filter to pick up the one good thing that  does that   can't do.
 * Enable  and the CamelCase filter everywhere.
 * Create an acronym filter to undo the bad things —and  !—do to acronyms.
 * Fix  to be case insensitive.

At this point, disabling  and enabling a new CamelCase filter on English and Italian are linked to prevent a regression, but the CamelCase filter doesn't depend on   or the acronym filter.

Enabling  and the new acronym filter should be linked, though, to prevent   from doing bad things to acronyms. (Example Bad Things: searching for N.A.S.A. on English Wikipedia does bring up NASA as the first result, but the next few are N/A, S/n, N.W.A, Emerald Point N.A.S., A.N.T. Farm, and M.A.N.T.I.S. Searching for M.A.N.T.I.S. brings up Operation: H.O.T.S.T.U.F.F./Operation: M.I.S.S.I.O.N., B.A.T.M.A.N., and lots of articles with "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z" navigation in them, among others.)

I had linked  and   in my head, because they both split up acronyms, but since the plan is to not enable   in any text filters, we don't need the acronym fix to accompany it.

But Wait, There's More: CamelCase Encore
So, I created a simple CamelCase  filter,. After my experience with Thai, I was worried about regex lookaheads breaking offset tracking. ( I now wonder if in the Thai case it's because I merged three  filters into one for efficiency. Nope, they're evil.)

However, the Elastic docs provide a simple but very general CamelCase char_filter:

My original formulation was pretty similar, except I used  and , and no lookahead, instead capturing the uppercase letter. But I tested their method, and it works fine in terms of offset mapping. (Apparently, I was wildly mistaken, and lookaheads probably are evil, as I feared.)

However, there are rare cases[†] where CamelCase chunks end in combining diacritics or common invisibles (joiners, non-joiners, zero-width spaces, soft hyphens, and bidi marks being the most common). Fortunately  and   cover pretty much the right things. I tried adding  to the lookbehind, but it was really, really sloooooooow. However, allowing 0–9 combining marks or invisibles seems like overkill when you spell it out like that, and there was no noticeable speed difference using  instead of   on my machine. Adding the possessive quantifier (overloaded —why do they do that?) to the range should only make it faster. My final pattern, with lookbehind, lookahead, and optional possessive combining marks and invisibles:

(Overly observant readers will note a formatting difference. The Elastic example is a JSON snippet, mine is in PHP snippet. I left it because it amuses me, and everything should be clear from context.)

[†] As noted above in the Not A Conclusion, I had originally linked the CamelCase filter with  and the acronym filter. The combining diacritics and common invisibles are much more relevant to acronym processing—which I've already worked on as I'm writing this, and which made me go back and look for CamelCase cases—of which there are a few.

In Conclusion—No, Really, I Mean It!
So, the plan for this chunk of harmonization is:


 * Disable  for English and Italian.
 * Create and enable.
 * Fix  to be case insensitive.

And we can worry about enabling  and handling acronyms in the next chunk of harmonization.

Appendix: CamelCase Observations
 (Technically this doesn't violate the terms of the conclusion being conclusive, since it's just some extra observations about the data, for funsies.) 

The kinds of things that show up in data focused on examples of CamelCase—some we can help with (✓), some we cannot (✗):


 * ✓ Cut and PasteCut and Paste / Double_NamebotDouble_Namebot
 * ✗ mY cAPSLCOCK kEY iS bROKEN
 * ✓ mySpaceBarIsBrokenButMyShiftKeyIsFine
 * ✓ lArt dElision sans lApstrophe
 * ✗ ÐºÐ¾Ð´Ð¸ÑÐ¾Ð²Ð°Ð½Ð¸Ðµ is hard.. oops, I mean, РєРѕРґРёСЂРѕРІР°РЅРёРµ is hard
 * ✗ "Wiki Loves Chemistry"?: 2CH2COOH + NaCO2 → 2CH2COONa + H2O + CO2
 * ✗ WéIRD UPPèRCäSE FUñCTIôNS / wÉird lowÈrcÄse fuÑctiÔns
 * ✓ Названиебот,ПробелПоедание (Namebot,SpaceEating)
 * ✓ Lots of English examples in languages without an upper/lowercase distinction

I think the CamelCase fix is going to be very helpful for people who double-paste something (if it starts with uppercase and ends with lowercase, like Mr RogersMr Rogers). On the one hand, it's probably a rare mistake for any given person, but on the other, it still happens many times per day.

and Acronyms (T170625)
We—especially David and I—have been talking about "fixing acronyms" for years. On all wikis, NASA and N.A.S.A. do not match. And while they are not technically acronyms, the same problem arises for initials in names, such as J.R.R. Tolkien and JRR Tolkien; those ought to match! (I'd like to get J. R. R. Tolkien (with spaces) in on the game, too, but that's a different and more difficult issue.)

Long before either David or I were on the search team, English and Italian were configured to use  in the text field. Generally this is a good thing, because it breaks up things like en.wikipedia.org and word_break_helper into searchable pieces. However, it also breaks up acronyms like N.A.S.A. into single letters. This is especially egregious for NASA on English-language wikis, where a is a stop word (and thus not strictly required)—lots of one-letter words are stop words in various languages, so it's not just an English problem.

Anyway... there are three goals for this task:
 * Merge acronyms into words (so NASA and N.A.S.A. match).
 * Apply  everywhere (once acronyms are mostly safe)
 * Extend  to any other necessary characters, particularly colon

Merging Acronyms
I originally thought I would have to create a new plugin with a new filter to handle acronyms. Certainly the basic pattern of letter-period-letter-period... would be easy to match. However, I realized we could probably get away with a regular expression in a character filter, which would also avoid some potential tokenization problems that might prevent some acronyms from being single tokens.

We can't just delete periods between letters, since that would convert en.wikipedia.org to enwikipediaorg. Rather we want a period between two single letters. Probably. Certainly, that does the right thing for N.A.S.A. (converts to NASA.) and en.wikipedia.org (nothing happens).

However... and there is always a however... as noted above in the camelCase discussion, sometimes our acronyms can have combining diacritics or common invisibles (joiners, non-joiners, zero-width spaces, soft hyphens, and bidi marks being the most common). A simple example would be something like T.É.T.S or þ.á.m. or İ.T.Ü.—except that in those cases, Latin characters with diacritics are normalized into single code points.

Indic languages written with abugidas are a good example where more complex units than single letters can be used in acronyms or initials. We'll come back to that in more detail later.

So, what we need are single (letter-based) graphemes separated by periods. Well, and the full-width period (．), obviously... and maybe... sigh.

I checked the Unicode confusables list for period and got a lot of candidates, including Arabic-Indic digit zero (٠), extended Arabic-Indic digit zero (۰), Syriac supralinear full stop (܁), musical symbol combining augmentation dot (𝅭), Syriac sublinear full stop (܂), one-dot leader (․), Kharoshthi punctuation dot (𐩐), Lisu letter tone mya ti (ꓸ), and middle dot (·). Vai full stop (꘎) was also on the list, but that does not look like something someone would accidentally use as a period. Oddly, full-width period is not on the confusables list.

Given an infinite number of monkeys typing on an infinite number of typewriters  a large enough number of WikiGnomes cutting-and-pasting, you will find examples of anything and everything, but the only characters I found regularly being used as periods in acronym-like contexts were actually full-width periods across languages, and one-dot leaders in Armenian. (Middle dot also gets used more than the others, but not a whole lot, and in both period-like and comma-like ways, so I didn't feel comfortable using it as an acronym separator.)

So, we want single graphemes—consisting of a letter, zero or more combining characters or invisibles—separated by periods or full-width periods (or one-dot leaders in the case of Armenian). A "single grapheme" is one that is not immediately preceded or followed by another letter-based grapheme (which may also be several Unicode code points). We also have to take into account the fact that an acronym could be the first or last token in a string being processed, and we have to explicitly account for "not immediately preceded or followed by" to include the case when there is nothing there at all—at the beginning or end of the string.

For Armenian, it turns out that one-dot leader is used pretty much anywhere periods are, though only about 10% as often, so I added a filter to convert one-dot leaders to periods for Armenian.

My original somewhat ridiculous regex started off with a look-behind for 1.) a start of string (^) or non-letter (\P{L}), followed by 2.) a letter-based grapheme—a letter (\p{L}), followed by optional combining marks (\p{M}) or invisibles (\p{Cf}) —then 3.) the period or full-width period ([.．]) , followed by 4.) optional invisibles , then a capture group with 5.) another letter-based grapheme ; and a look-ahead for 6.) a non-letter or end of string ($).

Some notes:
 * In all its hideous, color-coded glory:
 * (1) and (2) in the look-behind aren't part of the matching string, (3) is the period we are trying to drop, (4) is invisible characters we drop anyway, (5) is the following letter, which we want to hold on to, and (6) is in the look-ahead, and not part of the matching string. In the middle of a simple acronym, (1) is the previous period and (2) is the previous letter, and (6) is the next period.
 * For reasons of efficiency, possessive matching is used for the combining marks and invisibles, and combining marks and invisibles are limited to no more than 9 in the look-behind. (I have seen 14 Khmer diacritics stacked on top of each other, but that kind of thing is pretty rare.)
 * The very simple look-ahead does not mess up the token's character offsets—phew!
 * And finally—this doesn't work for certain cases that are relatively common not unheard of in Brahmic scripts!—though they are hard to find in Latin texts.
 * Ugh.

First, an example using Latin characters. We want e.f.g. to be treated as an acronym and converted to efg. We don't want ef.g to be affected. As mentioned above, we want to handle diacritics, such as é.f.g. and éf.g, which are not actually a problem because é is a single code point. However, something like e̪ is not. It can only be represented as e +  ̪. Within an acronym, we've got that covered, and d.e̪.f. is converted to de̪f. just fine. But  ̪ is technically "not a letter" so the period in e̪f.g would get deleted, because f is preceded by "not a letter" and thus appears to be a single letter/single grapheme.

In some languages using Brahmic scripts (including Assamese, Gujarati, Hindi, Kannada, Khmer, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sinhala, Tamil, Telugu, and Thai), letters followed by separate combining diacritics are really common, because it's the most typical way of doing things. Basic consonant letters include an inherent vowel—Devanagari/Hindi स is "sa", for example. To change the vowel, add a diacritic: सा (saa) सि (si) सी (sii) सु (su) सू (suu) से (se) सै (sai) सो (so) सौ (sau).

Acronyms with periods in these languages aren't super common, but when they occur, they tend to / seem to / can use the whole grapheme (e.g., से, not स for a word starting with से). The problem is that the vowel sign (e.g., े) is "not a letter", just like  ̪. So—randomly stringing letters together—सेफ.म would have its period removed, because फ is preceded by "not a letter".

The regex to fix this scenario is a little complicated—we need "not a letter", possibly followed by combining chars (rare, but does happen, as in 9̅) or invisibles (also rare, but they are sneaky and can show up anywhere since you can cut-n-paste them without knowing it). The regex that works—instead of (1) above—is something that is not a letter, not a combining mark, and not an invisible ([^\p{L}\p{M}\p{Cf}])—optionally followed by combining marks or invisibles. That allows us to recognize e̪ or से as a grapheme before another letter.

Some notes:
 * Updated, in all its hideous, color-coded glory:
 * The more complicated regex (and all in the look-behind!) didn't noticeably change the indexing time on my laptop.
 * While Latin cases like e̪f.g are possible, the only language samples affected in my test sets were the ones listed above: Assamese, Gujarati, Hindi, Kannada, Khmer, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sinhala, Tamil, Telugu, and Thai. The changes in token counts ranged from 0 to 0.06%, with most below 0.03%—so this is not a huge problem.
 * Hindi, the language with the 0% change in token counts, still had changes. You can change the tokens themselves without changing the number of tokens—they just get split in different places (see e.e.cummings, et al., below)—though not splitting is the more common scenario.
 * In the Kanada sample—the one with the most changes from the regex upgrade—there were some clear examples where the new regex still doesn't work in every case.
 * Ugh.
 * However, these cases seem to be another order of magnitude less common, so I'm going to let them slide for now.

However-however (However2?), I am going to document the scenario that still slips through the cracks in the regex, in case it is a bigger deal than it currently seems. (Comments to that effect from speakers of the languages are welcome!)

As mentioned before, Brahmic scripts have an inherent vowel, so स is "sa". The inherent vowel can be suppressed entirely with a virama. So स् is just "s"—and can be used to make consonant clusters.
 * स (sa) + त (ta) + र (ra) +  ी (ii) = सतरी ("satarii", though the second "a" may get dropped in normal speech.. I'm not sure), and it may or may not be a real word.
 * स (sa) + ् (virama) + त (ta) + ् (virama) + र (ra) + ी (ii) = स्त्री (strii/stree), which means "woman".

So, we have a single grapheme that is letter + virama + letter + virama + letter + combining vowel mark. So, basically, we could allow an extra   or maybe     in several places in our regex—though there are almost 30 distinct virama characters across scripts, and optional characters in the look-behind are complicated.

Plus—just to make life even more interesting!—in Khmer the virama-like role is played by the coeng, and conceptually it seems like it comes before the letter it modifies rather than after... though I guess in a sense both the virama and coeng come between the letters they interact with. (I do recall that for highlighting purposes, you want the virama with the letter before, and the coeng with the letter after. So I guess typographically they break differently.)

Anyway, adjusting the regex for these further cases probably isn't worth it at the moment—though, again, if there are many problem cases, we can look into it. (It might take a proper plugin with a new filter instead of a patter-replace regex filter... though the interaction of such a filter with  would be challenging.)

More notes:
 * Names with connected initials like J.R.R.Tolkien and e.e.cummings are converted to JRR.Tolkien and ee.cummings—which isn't great—until  comes along and breaks them up properly!
 * I don't love that acronyms go through stemming and stop word filtering, but that's what happens to non-acronym versions (now both SARS and S.A.R.S. will be indexed as sar in English, for example)—they do match each other, though, which is the point.
 * If you have an acronymic stop word, like F.O.R., it will get filtered as a stop word. The plain field has to pick up the slack, where it gets broken into individual letters. There's no great solution here.

... (notes on  yet to come)

Add to Hebrew
I got ahead of myself a little and looked into adding  to the Hebrew analysis chain. My analysis analysis tools assumed that there wouldn't be two identical tokens (i.e., identical strings) on top of each other. In the Hebrew analyzer, though, that's possible—common, even! I made a few changes, and the impact of adding  is much bigger and easier to see.

The Hebrew tokenizer assigns every token a type: Hebrew, NonHebrew, or Numeric that I've seen so far.

The Hebrew lemmatizer adds one or more tokens of type Lemma for each Hebrew or NonHebrew token. The problem arises when the lemma output is the same as the token input—which is true for many Hebrew tokens, and true for every nonHebrew token.

I hadn't noticed these before because the majority of tokens from Hebrew-language projects are in Hebrew, and I don't read Hebrew, so I can't trivially notice tokens are the same.

Adding  removes Lemma tokens that are the same as their corresponding Hebrew/NonHebrew token.

For a 10K sample of Hebrew Wikipedia articles, the number of tokens decreased by 19.1%! For 10K Hebrew Wiktionary entries, it was 22.7%!

Things to Do
A list of incidental things to do that I noticed while working on the more focused sub-projects above.

The first list is relatively simple things that should definitely be done.


 * Enable  (almost?) everywhere, and maybe enable Turkish   for languages that distinguish I/ı and İ/i.
 * Unpack Ukrainian even if extra-analysis-ukrainian is not available
 * Add  after   in the Hebrew analysis chain, to remove exact duplicates. (See "Add   to Hebrew" above).
 * Refactor English, Japanese, etc. configs to use AnalysisBuilder

The second list involves somewhat more complicated issues that could use looking at.


 * See if any parts of the Armenian (hy) analysis chain can do useful things for Western Armenian (hyw) wikis.
 * Consider a minor global French/Italian elision filter for d'– and l'– and English possessive filter for –'s (almost?) everywhere.