User:TJones (WMF)/Notes/Language Analyzer Harmonization Notes

May 2023 — See TJones_(WMF)/Notes for other projects. See also T219550. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro, Goals, Caveats
The goal of bringing language analyzers "into harmony" is to have as many of the non–language-specific elements of the analyzers to be the same as possible. Some split words on underscores and periods, some don't. Some split CamelCase words and some don't. Some use ASCII folding, some use ICU folding, and some don't use either. Some preserve the original word and have two ouptuts when folding, and some don't. Some use the ICU tokenizer and some use the standard tokenizer (for no particular reason—there are good reasons to use the ICU, Hebrew, Korean, or Chinese tokenizers in particular cases). When there is no language-specific reason for these differences, it's confusing, and we clearly aren't using analysis best practices everywhere.

My design goal is to have all of the relevant upgrades made by default across all language analysis configurations, with only the exceptions having to be explicitly configured.

Our performance goal is to reduce zero-results rate and/or increase the number of results returned for 75% of relevant queries averaged across all wikis. This goal comes with some caveats, left out of the initial statement to keep it reasonably concise.


 * "All wikis" is, in effect, "all reasonably active wikis"—if a wiki has only had twelve searches last month, none with apostrophes, it's hard to meaningfully measure "75% of the queries with apostrophes" in them. More details in "Data Collection" below.
 * I'm also limiting my samples to Wikipedias because they have the most variety of content and queries, and to limit testing scope, allowing more languages to be included.
 * I'm going to ignore wikis with unchanged configs (some elements are already deployed on some wikis), since they will have approximately 0% change in results (there's always a bit of noise).
 * "Relevant" queries are those that have the feature being worked on. So, I will have a collection of queries with apostrophe-like characters in them to test improved apostrophe handling, and a collection of queries with acronyms to test better acronym processing. I'll still test general query corpora to get a sense of the overall impact, and to look for cases where queries without the feature being worked on still get more matches (for example, searching for NASA should get more matches to N.A.S.A. in articles).
 * I'm also applying my usual filters (used for all the unpacking impact analyses) to queries, mostly to filter out porn and other junk. For example, I don't think it is super important whether the query s`wsdfffffffsf actually gets more results once we normalize the backtick/grave accent to an apostrophe.
 * Smaller/lower-activity wikis may get filtered out for having too few relevant queries for a given feature.
 * We are averaging rates across wikis so that wiki size isn't a factor (and neither is sample rate—so, I can oversample smaller wikis without having to worry about a lot of bookkeeping).

Data Collection
I started by including all Wikipedias with 10,000 or more articles. I also gathered the number of active editors and the number of full-text queries (with the usual anti-bot filters) for March 2023. I dropped those with less than 700 monthly queries and fewer than 50 active editors. My original ideas for thresholds had been ~1000 monthly queries and ~100 active editors, but I didn't want or need a super sharp cut off. Limiting by very low active editor counts meant fewer samples to get at the query-gathering step, which is somewhat time-consuming. Limiting by query count also meant less work at the next step of filtering queries, and all later steps, too.

I ran my usual query filters (as mentioned above), and also dropped wikis with fewer than 700 unique queries after filtering. That left 90 Wikipedias to work with. In order of number of unique filtered monthly queries, they are: English, Spanish, French, German, Russian, Japanese, Chinese, Italian, Portuguese, Polish, Arabic, Dutch, Czech, Korean, Indonesian, Turkish, Persian, Vietnamese, Swedish, Hebrew, Ukrainian, Igbo, Finnish, Hungarian, Romanian, Greek, Norwegian, Catalan, Hindi, Thai, Simple English, Danish, Bangla, Slovak, Bulgarian, Swahili, Croatian, Serbian, Tagalog, Slovenian, Lithuanian, Georgian, Tamil, Malay, Uzbek, Estonian, Albanian, Azerbaijani, Latvian, Armenian, Marathi, Burmese, Malayalam, Afrikaans, Urdu, Basque, Mongolian, Telugu, Sinhala, Kazakh, Macedonian, Khmer, Kannada, Bosnian, Egyptian Arabic, Galician, Cantonese, Icelandic, Gujarati, Central Kurdish, Serbo-Croatian, Nepali, Latin, Kyrgyz, Belarusian, Esperanto, Norwegian Nynorsk, Assamese, Tajik, Punjabi, Oriya, Welsh, Asturian, Belarusian-Taraškievica, Scots, Luxembourgish, Irish, Alemannic, Breton, & Kurdish.


 * Or, in language codes: en, es, fr, de, ru, ja, zh, it, pt, pl, ar, nl, cs, ko, id, tr, fa, vi, sv, he, uk, ig, fi, hu, ro, el, no, ca, hi, th, simple, da, bn, sk, bg, sw, hr, sr, tl, sl, lt, ka, ta, ms, uz, et, sq, az, lv, hy, mr, my, ml, af, ur, eu, mn, te, si, kk, mk, km, kn, bs, arz, gl, zh-yue, is, gu, ckb, sh, ne, la, ky, be, eo, nn, as, tg, pa, or, cy, ast, be-tarask, sco, lb, ga, als, br, ku.

I sampled 1,000 unique filtered queries from each language (except for those that had fewer than 1000).

I also pulled 1,000 articles from each Wikipedia to use for testing.

I used a combined corpus of the ~1K queries and the 1K articles for each language to test analysis changes. This allows me to see interactions between words/characters that occur more in queries and words/characters that occur more in articles.

Relevant Query Corpora
For each task, I plan to pull a corpus of "relevant" queries for each language for before-and-after impact assessment, by grepping for the relevant characters. For each corpus, I'll also do some preprocessing to remove queries that are unchanged by the analysis upgrades being made.

For example, when looking at apostrophe-like characters, ICU folding already converts typical curly quotes (‘’) to straight quotes ('), so for languages with ICU folding enabled, curly quotes won't be treated any differently, so I plan to remove those queries as "irrelevant". Another example is reversed prime (‵), which causes a word break with the standard tokenizer; apostrophes are stripped at the beginning or ending of words, so reversed prime at the edge of a word isn't actually treated differently from an apostrophe in the same place—though the reasons are very different.

For very large corpora (≫1000, for sure), I'll probably sample the corpus down to a more reasonable size after removing "irrelevant" queries.

I'm going to keep (or sample) the "irrelevant" queries (e.g., words with straight apostrophes or typical curly quotes handled by ICU folding) for before-and-after analysis, because they may still get new matches on words in wiki articles that use the less-common characters, though there are often many, many fewer such words on-wiki—because the WikiGnomes are always WikiGnoming!

Another interesting wrinkle is that French and Swedish use ICU folding with "preserve original", so that both the original form and folded form are indexed (e.g., l’apostrophe is indexed as both l’apostrophe and l'apostrophe). This doesn't change matching, but it may affect ranking. I'm going to turn off the "preserve original" filter for the purpose of removing "irrelevant" queries, since we are focused on matching here.

Some Observations
After filtering porn and likely junk queries and uniquifying queries, the percentage of queries remaining generally ranged from 94.52% (Icelandic—so many unique queries!) to 70.58% (Persian), with a median of 87.31% (Simple English), and a generally smooth distribution across that range.

There were three outliers:


 * Swahili (57.51%) and Igbo (37.56%) just had a lot of junk queries.
 * Vietnamese was even lower at 30.03%, with some junk queries but also an amazing number of repeated queries, many of which are quite complex (not like everyone is searching for just famous names or movie titles or something "simple"). A few queries I looked up on Google seem to exactly match titles or excerpts of web pages. I wonder if there is a browser tool or plugin somewhere that is automatically doing wiki searches based on page content.

Re-Sampling & Zero-Results Rate
I found a bug in my filtering process, which did not properly remove certain very long queries that get 0 results, which I classify as "junk". These accounted for less than 1% of any given sample, but it was still weird to have many samples ranging from 990–999 queries instead of the desired 1,000. Since I hadn't used my baseline samples for anything at that point, I decided to re-sample them. This also gave me an opportunity to compare zero-results rates (ZRR) between the old and new samples.

In the case of very small queries corpora, the old and new samples may largely overlap, or even be identical. (For example, if there are only 800 queries to sample from, my sample "of 1000" is going to include all of them, every time I try to take a sample.) Since this ZRR comparison was not the point of the exercise, I'm just going to throw out what I found as I found it, and not worry about any sampling biases—though they obviously include overlapping samples, and potential effects of the original filtering error.

The actual old/new ZRR for these samples ranged from 6.3%/6.2% (Japanese) to 75.4%/76.1% (Igbo—wow!). The zero-results rate differences from the old to the new sample ranged from -4.2% (Gujarati, 64.3% vs 60.1%) to +5.6% (Dutch, 22.1% vs 27.7%), with a median of 0.0% and mean of -0.2%. Proportional rates ranged from -19.9% (Galician, 17.5% vs 14.6%) to +20.2% (Dutch, 22.1% vs 27.7%, again), with a median of 0.0%, and a mean of -0.5%.

Looking at the graph, there are some minor outliers, but nothing ridiculous, which is nice to see.



"Infrastructure"
I've built up some temporary "infrastructure" to support impact analysis of the harmonization changes. Since every or almost every wiki will need to be reindexed to enable harmonization changes, timing the "before and after" query analyses for the 90 sampled wikis would be difficult.

Instead, I've set up a daily process that runs all 90 samples each day. There's an added bonus of seeing the daily variation in results without any changes.

I will also pull relevant sub-samples for each of the features (apostrophes, acronyms, word_break_helper, etc.) being worked on and run them daily as well.

There's a rather small chance of having a reindexing finish while a sample is being run, so that half the sample is "before" and half is "after". If that happens, I can change my monitoring cadence to every other day for that sample for comparison's sake and it should be ok.

Apostrophes (T315118)
There are some pretty common apostrophe variations that we see all the time, particularly the straight vs curly apostrophes—e.g., ain't vs ain’t. And of course people (or their software) will sometimes curl the apostrophe the wrong way—e.g., ain‘t. But lots of other characters regularly (and some irregularly) get used as apostrophes, or apostrophes get used for them—e.g., Hawai'i or Hawai’i or Hawai‘i when the correct Hawaiian letter is the okina: Hawaiʻi.

A while back, we worked on a ticket (T311654) for the Nias Wikipedia to normalize some common apostrophe-like variants, and at the time I noted that we should generalize that across languages and wikis as much as possible. ICU normalization and ICU folding already do some of this (see the table below)—especially for the usual ‘curly’ apostrophes/single quotes, but those cases are common enough that we should take care of them even when the ICU plugin is not available. It'd also be nice if the treatment of these characters was more consistent across languages, and not dependent on the specific tokenizer and filters configured for a language.

There are many candidate "apostrophe-like" characters. The list below is distillation of the list of Unicode Confusables for apostrophe, characters I had already known were potential candidates from various Phab tickets and my own analysis experience (especially working on Turkish apostrophes), and the results of data-mining for apostrophe-like contexts (e.g., Hawai_i).

Key


 * x'x—It's hard to visually distinguish all the vaguely apostrophe-like characters on-screen, so after ordering them, I put a letter (or two) before them and an x after them. The letter before makes it easier to see where each one is/was when looking at the analysis output, and the x after doesn't seem to be modified by any of the analyzers I'm working with. And x'x is an easy shorthand to refer to a character without having to specify its full name.
 * Also, apostrophe-like characters sometimes get treated differently at the margins of a word. (Schrodinger's apostrophe: inside a word it's an apostrophe, at the margins, it's a single quote.) Putting it between two alpha characters gives it the most apostrophe-like context.
 * Desc.—The Unicode description of the character
 * #q—The number of occurrences of this character (in any usage) in my 90-language full query sample. Samples can be heavily skewed: Hebrew letter yod occurs a lot in Hebrew queries—shocker! Big wiki samples are larger, so English is over-represented. Primary default sort key.
 * #wiki samp—The number of occurrences of this character in my 90-language 1K Wikipedia sample. Samples can be skewed by language (as with Hebrew yod above), but less so by sample size. All samples are 1K articles, but some wikis have longer average articles. Secondary default sort key.
 * UTF—UTF codepoint for the character. Tertiary default sort key.
 * Example—An actual example of the character being used in an apostrophe-like way. Most come from English Wikipedia article or query samples. Others I had to look harder to find—in other samples, or using on-wiki search.
 * Just because a word or a few words exist with the character used in an apostrophe-like way doesn't mean it should be treated as an apostrophe. When looking for words matching the Hawai_i pattern, I found Hawai*i, Hawai,i, and Hawai«i, too. I don't think anyone would suggest that asterisks, commas, or guille­mets should be treated as apostrophes.
 * I never found a real example of Hebrew yod being used as an apostrophe. I only found two instances of it embedded in a Latin-script word (e.g. Archיologiques), and there it looked like an encoding error, since it has clearly replaced é. I fixed both of those (through my volunteer account).
 * I really did find an example of apostrophe's using a real apostrophe!
 * std tok (is)—What does the standard tokenizer (exemplified by the is/Icelandic analyzer) do to this character?
 * icu tok (my)—What does the ICU tokenizer (exemplified by the my/Myanmar analyzer) do to this character?
 * heb tok (he)—What does the HebMorph tokenizer (exemplified by the he/Hebrew analyzer) do to this character?
 * nori tok (ko)—What does the Nori tokenizer (exemplified by the ko/Korean analyzer) do to this character?
 * smart cn (zh)—What does the SmartCN tokenizer (exemplified by the zh/Chinese analyzer) do to this character?
 * icu norm (de)—What does the ICU normalizer filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
 * icu fold (de)—What does the ICU folding filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
 * icu norm (wsp)—What does the ICU normalizer filter do to this character, after going through a whitespace tokenizer? (The whitespace tokenizer just splits on spaces, tabs, newlines, etc. There's no language for this, so it was a custom config.)
 * icu norm + fold (wsp)—What does the ICU normalizer filter + the ICU folding filter do to this character, after going through a whitespace tokenizer? (We never enable the ICU folding filter without enabling ICU normalization first—so this is a more "typical" config.)
 * icu fold (wsp)—What does the ICU folding filter do to this character, after going through a whitespace tokenizer, without ICU normalization first?
 * Tokenizer and Normalization Sub-Key
 * split means the tokenizer splits on this character—at least in the context of being between Latin characters. Specifically non-Latin characters get split by the ICU tokenizer between Latin characters in general because it always splits on script changes. (General punctuation doesn't belong to a specific script.) So, the standard tokenizer splits a‵x to a and x.
 * split/keep means the tokenizer splits before and after the character, but keeps the character. So, the ICU tokenizer splits dߴx to d, ߴ, and x.
 * → ? means the tokenizer or filter converts the character to another character. So, the HebMorph tokenizer tokenizer c‛x as c'x (with an apostrophe).
 * The most common conversion is to an apostrophe. The SmartCN tokenzier converts most punctuation to a comma. The ICU normalizer converts some characters to space plus another character (I don't get the reasoning, so I wonder if this might be a bug); I've put those in square brackets, though the space doesn't really show up, and put a mini-description in parens, e.g. "(sp + U+301)". Fullwidth grave accent gets normalized to a regular grave accent by ICU normalization.
 * split/keep → ,—which is common in the SmartCN tokenizer column—means that text is split before and after the character, the character is not deleted, but it is converted to a comma. So, the SmartCN tokenizer tokenizes a‵x as a + , + x.
 * delete means the tokenizer or filter deletes the character. So, ICU folding converts dߴx to dx.
 * Nias—For reference, these are the characters normalized specifically for nia/Nias in Phab ticket T311654.
 * apos-like—After reviewing the query and Wikipedia samples, this character does seem to commonly be used in apostrophe-like ways. (In cases of the rarer characters, like bꞌx, I had to go looking on-wiki for examples.)
 * + means it is, – means it isn't, == means this is the row for the actual apostrophe!
 * transitive—This character is not regularly used in an apostrophe-like way, but it is normalized by a tokenizer or filter into a character that is regularly used in an apostrophe-like way.
 * apos is x-like?—While the character is not used in apostrophe-like way (i.e., doesn't appear in Hawai_i, can_t, don_t, won_t, etc.), apostrophes are used where this character should be.
 * + means it is, – means it isn't, blank means I didn't check (because it was already apostrophe-like or transitively apostrophe-like).
 * final fold—Should this character get folded to an apostrophe by default? If it is apostrophe-like, transitively apostrophe-like, or apostrophes get used where it gets used—i.e., a + in any of the three rpevious columns—then the answer is yes (+).

Character-by-Character Notes

 * a‵x (reversed prime): This character is very rarely used anywhere, but it is normalized to apostrophe by ICU folding
 * bꞌx (Latin small letter saltillo): This is used in some alphabets to represent a glottal stop, and apostrophes are often used to represent a glottal stop, so they are mixed up. In the English Wikipedia article for Mi'kmaq (apostrophe in the title), miꞌkmaq (with saltillo) is used 144 times, while mi'kmaq (with apostrophe) is used 78 times—on the same page!
 * c‛x (single high-reversed-9 quotation mark): used as a reverse quote and an apostrophe.
 * dߴx (N'ko high tone apostrophe): This seems to be an N'ko character almost always used for N'ko things. It's uncommon off the nqo/N'ko Wikipedia, and on the nqo/N'ko Wikipedia the characters do not seem to be not interchangeable.
 * e῾x (Greek dasia): A Greek character almost always used for Greek things.
 * fʽx (modifier letter reversed comma): Commonly used in apostrophe-like ways.
 * g᾿x (Greek psili): A Greek character almost always used for Greek things.
 * h᾽x (Greek koronis): A Greek character almost always used for Greek things.
 * i՚x (Armenian apostrophe): An Armenian character almost always used for Armenian things, esp. in Western Armenian—however, the non-Armenian apostophe is often used for the Armenian apostrophe.
 * j｀x (fullwidth grave accent): This is actually pretty rare. It is mostly used in kaomoji, like (*´ω｀*), and for quotes. But it often gets normalized to a regular grave accent, so it should be treated like one, i.e., folded to an apostrophe.
 * It's weird that there's no fullwidth acute accent in Unicode.
 * k՝x (Armenian comma): An Armenian character almost always used for Armenian things, and it generally appears at the edge of words (after the words), so it would usually be stripped as an apostrophe, too.
 * lʾx (modifier letter right half ring): On the Nias list, and frequently used in apostrophe-like ways.
 * mˈx (modifier letter vertical line): This is consistently used for IPA transcriptions, and apostrophes don't show up there very often.
 * n＇x (fullwidth apostrophe): Not very common, but does get normalized to a regular apostrophe by ICU normalization and ICU folding, so why fight it?
 * oʹx (modifier letter prime): Consistently used on-wiki as palatalization in Slavic names, but apostrophes are used for that, too.
 * pʿx (modifier letter left half ring): On the Nias list, and frequently used in apostrophe-like ways.
 * q′x (prime): Consistently used for coordinates, but so are apostrophes.
 * rˊx (modifier letter acute accent): Used for bopomofo to mark tone; only occurs in queries from Chinese Wikipedia.
 * sˋx (modifier letter grave accent): Used as an apostrophe in German and Chinese queries.
 * t΄x (Greek tonos): A Greek character almost always used for Greek things.
 * uʼx (modifier letter apostrophe): Not surprising that a apostrophe variant is used as an apostrophe.
 * v׳x (Hebrew punctuation geresh): A Hebrew character almost always used for Hebrew things... however, it is converted to apostrophe by both the Hebrew tokenizer and ICU folding.
 * wʻx (modifier letter turned comma): Often used as an apostrophe.
 * x´x (acute accent): Often used as an apostrophe.
 * y`x (grave accent): Often used as an apostrophe.
 * z‘x (left single quotation mark): Often used as an apostrophe.
 * za’x (right single quotation mark): The curly apostrophe, so of course it's used as an apostrophe.
 * zb'x (apostrophe): The original!
 * zcיx (Hebrew letter yod): A Hebrew character almost always used for Hebrew things. The most examples because it is an actual Hebrew letter. Showed up on the confusabled list, but is never used as an apostrophe. Only examples are encoding issues: Palיorient, Archיologiques → Paléorient, Archéologiques

Apostrophe-Like Characters, The Official List™
The final set of 19 apostrophe-like characters to be normalized is [`´ʹʻʼʽʾʿˋ՚׳‘’‛′‵ꞌ＇｀]—i.e.:


 * ` (U+0060): grave accent
 * ´ (U+00B4): acute accent
 * ʹ (U+02B9): modifier letter prime
 * ʻ (U+02BB): modifier letter turned comma
 * ʼ (U+02BC): modifier letter apostrophe
 * ʽ (U+02BD): modifier letter reversed comma
 * ʾ (U+02BE): modifier letter right half ring
 * ʿ (U+02BF): modifier letter left half ring
 * ˋ (U+02CB): modifier letter grave accent
 * ՚  (U+055A): Armenian apostrophe
 * ׳ (U+05F3): Hebrew punctuation geresh
 * ‘ (U+2018): left single quotation mark
 * ’ (U+2019): right single quotation mark
 * ‛ (U+201B): single high-reversed-9 quotation mark
 * ′ (U+2032): prime
 * ‵ (U+2035): reversed prime
 * ꞌ (U+A78C): Latin small letter saltillo
 * ＇ (U+FF07): fullwidth apostrophe
 * ｀ (U+FF40): fullwidth grave accent

Other Observations

 * Since ICU normalization converts some of the apostrophe-like characters above to ́ (U+301, combining acute accent),  ̓ (U+313, combining comma above), and  ̔ (U+314, combining reversed comma above), I briefly investigated those, too. They are all used as combining accent characters and not as separate apostrophe-like characters. The combining commas above are both used in Greek, which makes sense, since they are on the list because Greek accents are normalized to them.
 * In French examples, I sometimes see 4 where I'd expect an apostrophe, especially in all-caps. Sure enough, looking at the AZERTY keyboard you can see that 4 and the apostrophe share a key!
 * The  in the Hebrew analyzer often generates multiple output tokens for a given input token—this is old news. However, looking at some detailed examples, I noticed that sometimes the multiple tokens (or some subset of the multiple tokens) are the same! Indexing two copies of a token on top of each other doesn't seem helpful—and it might skew token counts for relevance.

and Acronyms (T170625)
...

Things to Do
A list of incidental things to do that I noticed while working on the more focused sub-projects above.

The first list is relatively simple things that should definitely be done.


 * Add  after   in the Hebrew analysis chain, to remove exact duplicates.

The second list involves somewhat more complicated issues that could use looking at.


 * See if any parts of the Armenian (hy) analysis chain can do useful things for Western Armenian (hyw) wikis.