User:TJones (WMF)/Notes/Unpacking Notes

See TJones_(WMF)/Notes for other projects. See also T272606. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Why We Are Here
The purpose of unpacking analyzers is to enable them to be customized and upgraded with improvements that can be both language-specific (e.g., custom ICU folding) or generic (e.g., ICU normalization, or homoglyph processing).

Note: One of our higher-level objectives has been to increase support for languages in emerging regions, so lately I have been prioritizing languages on that list. That list currently includes: Afrikaans, Arabic, Bengali, Cantonese, French, Hindi, Indonesian, Korean, Malayalam, Malaysian (Malay), Portuguese, Spanish, Tagalog, Telugu, Thai, and Ukrainian. Languages in bold still need to be unpacked.* Languages in italics have already been unpacked. Languages in neither don't have monolithic analyzers and don't need to be unpacked.

[*] I'm currently working on Thai, and Arabic is done but waiting to be deployed with Arabic. Ukrainian is a special case as a third-parrty analyzer, which I will look into separately after Arabic and Thai.

The Unpacking Process
Gather Data Run Baselines Unpack Analyzers Re-enable Analyzer Upgrades Repair Unpacked & Upgraded Analyzers Enable ICU Folding Compare Final Analyzer to Baseline Merge Your Patch Prep Query Data Reindexing and Before-And-After Analysis
 * Gather 10K articles (without repeats) each from Wikipedia and Wiktionary for each language (custom Perl script, )
 * Manual review/editing: remove leading white space, dedupe lines, review potential HTML tags ( search for  )
 * Gather 10K/4weeks query data from Wikipedia for each language (Jupyter notebook,  on  )
 * Per language (I've been working on three at a time recently to somewhat streamline the process):
 * set language to target and reindex
 * run  as baseline for wiki and wikt 10K samples
 * Disable homoglyphs and icu_norm upgrades at these locations in
 * in
 * in
 * Per language:
 * unpack analyzer in  /
 * https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
 * ignore
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * should be zero diffs in count files; otherwise unpacking is not correct
 * Re-enable homoglyphs and icu_norm upgrades
 * Per language:
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * run  —   for  /  for wiki and wikt;   comparison for wiki and wikt
 * solo—just trying to get the lay of the land
 * look at potential problem stems
 * look at largest Type Group Counts
 * anything around 20+ is interesting; well over 20 is surprising (but not necessarily wrong)
 * look at Tokens Generated per Input Token; usually expect 1 in baseline; some 2s with homoglyphs
 * look at Final Type Lengths; 1s are often CJK, longest are often URLs, German, spaceless languages, or  encoded
 * comparison—see what changed
 * expect dotted-I regression
 * lots of hidden characters removed (soft hyphens, bidi marks, joiners and non-joiners)
 * Super- and subscript characters get converted, ß to ss, too
 * Regularization of non-Latin characters is common, particularly, Greek ς to σ
 * investigate anything that doesn’t make sense
 * Per language:
 * Make any needed “repairs” to accommodate ICU normalization
 * possibly just
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * run  —   for repaired for wiki and wikt;   comparison for wiki and wikt
 * solo—just trying to get the lay of the land
 * comparison—look for expected changes (maybe just dotted-I)
 * Per language:
 * enable ICU Folding
 * add language code to, and any folding exceptions to
 * add  to   list, usually in last place
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * run  —   for repaired for wiki and wikt;   comparison for wiki and wikt
 * solo—potential problem stems can show systematic changes, even if they aren’t really problems
 * elision (l’elision, d’elision, qu’elision, s’etc.) can throw this off
 * comparison—look for expected changes (rare characters and variants folded, diacritics folded, etc.)
 * Per language:
 * run  —   comparison for wiki and wikt
 * comparison—look at the overall impact of unpacking, upgrades, and ICU folding
 * Token delta: expect small numbers (<100) unless something “interesting” happened
 * New Collision Stats gives a sense of the overall impact, # of tokens that merge into other groups.
 * Typically < 3% on each number, with higher values in Wiktionary
 * Possibly a few Lost pre-analysis tokens
 * Net Gains: expect plenty of changes; high-impact changes are usually—
 * one- or two- letter tokens (e.g., a picks up á, à, ă, â, å, ä, ã, ā, ə, ɚ)
 * something with a lot of variants that includes a folded character (e.g., abc, abcs, l'abs, l'abcs, d'abc, d'abcs, qu'abc, qu'abcs, etc. (with straight quotes) picks up l’abs, l’abcs, d’abc, d’abcs, qu’abc, qu’abcs, etc. (with curly quotes)
 * or a diacriticless typo (Francois) picks up all the forms with diacritics (François—it’s hard to find an example in English)
 * Don’t expect any New Splits, Found pre-analysis tokens, or Net Losses unless there was additional customization
 * Summarize findings (here and in Phabricator)
 * When everything looks good and makes sense, submit the patch.
 * When the patch is merged, it’s time to reindex.
 * Before reindexing, using the 10K Wikipedia query sample:
 * Filter “bad queries” and randomly sample 3K queries (using a custom Perl script, )
 * Review the “bad queries” to make sure the filters are behaving reasonably for the given language
 * While reindexing Wikipedia for a given language, kick off “brute-force” sampling (using a custom Perl script, )
 * The brute-force script runs the same 3K queries every 10 minutes while reindexing
 * Let it run 2–3 more iterations after reindexing is complete
 * You may have to throw out a query run if reindexing finished in the middle of the run
 * Using time stamps from the reindexing and query runs figure out the smallest gap between a “before” and an “after” query run and compare them (using a custom Perl script, ), noting differences in zero results rate, increases and decreases in results counts, and changes in top results.
 * Use similarly spaced pre-reindexing runs and post-reindexing as controls to get a handle on normal variability and compare to the before-and-after results.
 * Comparing the earliest and latest pre-reindexing runs also allows you to judge what is random fluctuation and what is directional. e.g.:
 * if 10-minute interval comparisons all give 2-3% changes in top result, and a 60-minute interval gives 2.3% changes in top results, it’s probably random noise.
 * If 10-minute interval comparisons all give 2-3% changes in increased results, and a 60-minute interval gives 6% change in increased results, it’s probably partly noise overlaying a general increasing trend.
 * Summarize findings (here and in Phabricator)

Post-Reindexing Top Result Changes
I've seen something of a trend across wikis: The number of searches that have their top result change decreases dramatically after reindexing. It is possible that there is some effect from changed word stats from merging words after ICU Normalization or ICU Folding (e.g., resume and résumé are counted together). And of course new content may have been added to the Wiki that rightfully earns a place as the new top result for a given query.

However, after consulting with the Elasticsearch Brain Trust™, we decided that best explanation for this is increased consistency across shards after reindexing.

The most common cause of short term changes in top results is having the query served by a different shard. In addition to having different statistics for uncommon words that are spread unevenly across shards, word statistics are not immediately updated when documents are deleted or changed. Over time the shards are more likely to differ from each other.

After reindexing, every shard has a reasonably balanced brand-spanking new index with no history of deletions and changes, so the shards are likely more similar in their stats (and thus in their reporting of the top result).

Spanish Notes (T277699)

 * Usual 10K sample each from Wikipedia and Wiktionary.
 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
 * Enabled homoglyphs and found a few examples in each sample
 * Enabled ICU normalization and saw the usual normalization
 * Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Potential concerns:
 * 1ª and 1º are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
 * However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
 * nº gets mapped to no, which is a stop word. pº gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
 * Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
 * Updated test fixtures for Spanish and multi-language tests.


 * Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.

Tokenization/Indexing Impacts


 * Spanish Wikipedia (eswiki)
 * There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
 * About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.


 * Spanish Wiktionary (eswikt)
 * There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
 * Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.

Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia (T282808)
Summary
 * While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.

Background
 * I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
 * I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
 * I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
 * I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.

Expected Results
 * Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.

Control Results
 * The number of queries getting zero results held steady at 19.3%
 * The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
 * The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
 * The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
 * The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
 * These results are also generally consistent with the control tests I ran in April and May.

Reindexing Results
 * The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
 * The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
 * The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
 * The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
 * The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
 * The number of queries that changed their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.

Post-Reindex Control
 * The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.

German/Dutch/Portuguese Notes (T281379)

 * Usual 10K sample each from Wikipedia and Wiktionary for each language.
 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
 * Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
 * Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * German required customization to maintain ß for stopword processing.
 * Enabled custom ICU folding for each language, saw lots of the usual folding effects.
 * Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
 * Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)

German Notes
General German


 * ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched  in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
 * Sorting all this out also seems to have fixed T87136.
 * There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.

German Wikipedia


 * Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
 * It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
 * About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.

German Wiktionary


 * Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
 * The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
 * About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Dutch Notes
General Dutch


 * Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
 * There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.

Dutch Wikipedia


 * Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
 * About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.

Dutch Wiktionary


 * The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
 * About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Portuguese Notes
Portuguese Wikipedia


 * There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
 * The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and ﬁ and ﬂ ligatures being expanded to fi and fl.
 * Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
 * About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.

Portuguese Wiktionary


 * There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
 * Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓).
 * About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.

Impact Tool Filtering Improvements During German, Dutch, Portuguese Testing
While working on German, I discovered that 28 of the filtered German queries should not have been filtered (28 out of 10K isn't too, too many, though). Sequences of 6+ consonants are not too uncommon in German (e.g., Deutschschweizer, "German-speaking Swiss person", or Angstschweiß, "cold sweat"), but they do follow certain patterns, which I've now incorporated into my filtering.

I also added additional filtering for more URLs, email addresses, Cyrillic-flavored junk, and very long queries (≥100 characters) that get 0 results.

I tested these filtering changes on German, Dutch, Portuguese, Spanish, English, Khmer, Basque, Catalan, and Danish query corpora.

Unpacking + ICU Norm + ICU Folding + ß/ss Split Impact on German Wikipedia (T284185)
Summary
 * While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for German Wikipedia. Folding diacritics had a noticeable impact on the zero results rate and the total number of results returned. For example, searching for surangama sutra now finds Śūraṅgama-sūtra. Reindexing in general seems to decrease variability in the top result.
 * I also disabled the folding of ß to ss in the plain field, which had a small negative impact on recall in certain corner cases. (See T87136 for rationale.)

Background
 * I pulled a 10K sample of German Wikipedia queries from April of 2021, and filtered 134 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
 * I later discovered that 28 of the filtered queries should not have been filtered (28 out of 10K isn't too, too many, though). Sequences of 6+ consonants are not too uncommon in German (e.g., Deutschschweizer, "German-speaking Swiss person", or Angstschweiß, "cold sweat"), but they do follow certain patterns, which I've now incorporated into my filtering.
 * I used a brute-force strategy to attempt to detect the impact of reindexing on German Wikipedia, similar to the method used on Spanish Wikipedia. A number of control diffs were run every ~10 minutes before and after reindexing.
 * I compared each iteration against the subsequent one, and compared the first and last runs before reindexing to get insight into "trends" vs "noise" in the comparisons.

Control Results
 * The number of queries getting zero results held steady at 22.0%
 * The number of queries getting a different number of results increases slightly over time (0.3% to 1.6% in 10 minute intervals; 3.6% over 90 minutes)
 * The number of queries getting fewer results is noise (0.0% to 0.4% in 10 minute intervals; 0.5% over 90 minutes)
 * The number of queries getting more results increases slightly over time (0.2% to 1.5% in 10 minute intervals; 3.2% over 90 minutes)
 * The number of queries changing their top result is noise (1.5% to 2.2% in 10 minute intervals; 1.9% over 90 minutes)

Reindexing Results
 * While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for German Wikipedia. Folding diacritics had a noticeable impact on the zero results rate and the total number of results returned. For example, searching for surangama sutra now finds Śūraṅgama-sūtra. Reindexing in general seems to decrease variability in the top result.
 * The zero results rate dropped to 21.7% (-0.3% absolute change; -1.4% relative change).
 * The number of queries getting a different number of results increased to 13.6% (vs. the 0.3%–1.6% range seen in control).
 * The number of queries getting fewer results was about 4 times the max of the control range (1.8% vs 0.0%–0.4%). 7 of 54 involve ss or ß, but I don't see a pattern for the rest. 37 of 54 only got 1 fewer result, so the impact is not large.
 * The number of queries getting more results was 11.5% (vs the control range of 0.2%–1.5%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
 * The number of queries that changed their top result was 4.0% (vs. the control range of 1.5%–2.2%; that's less than 2x increase). I looked at some of these, and some are definitely the result of folding allowing for matching words in the top result.

Post-Reindex Control
 * The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.2% vs 1.5%–2.2%).

Observations
 * The most dramatic decrease in results (both in absolute terms and percentage-wise), was for the query was heisst s.w.a.t. ("what does S.W.A.T. do?"): from 3369 down to 67 results. Currently,  is configured for the plain field, but not the text field (as before), and ß no longer maps to ss in the plain field.   breaks up s.w.a.t. into four separate letters in the plain field (but not the text field), improving recall. So, the query in the plain field is was + heisst + s + w + a + t, while the text field query is heisst/heißt + s.w.a.t. Since heißt is much more common than heisst (68K vs 2K results), the plain query returns many fewer results.
 * On the one hand, enabling  everywhere would be nice, but we also need proper acronym support! (T170625)

Unpacking + ICU Norm + ICU Folding Impact on Dutch Wikipedia (T284185)
Summary
 * While unpacking an analyzer should have no impact on results, adding ICU folding had a likely minor impact for Dutch Wikipedia. There was a small decrease in zero-results queries, a general increase in recall (both attributable to ICU Folding—buthusbankje matches bûthúsbankje, or a curly quote is converted to a straight quote), and a decrease in changes to top queries (a general side-effect of reindexing).

Background
 * I pulled a 10K sample of Dutch Wikipedia queries from April of 2021, and filtered 125 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
 * I used a brute-force strategy to attempt to detect the impact of reindexing on Dutch Wikipedia, similar to the method used on Spanish Wikipedia. A number of control diffs were run every ~10 minutes before and after reindexing.
 * I was unable to time the query runs with reindexing just right, so the reindexing finished during one of the query runs. I had to drop that one, so comparisons are across every other run (i.e., ~20 minutes apart). I also compared the first and last runs before and after reindexing to try to get insight into "trends" vs "noise" in the comparisons, but the shorter total time (~30 minutes) wasn't really long enough to let the signal emerge from the noise.

Control Results
 * The number of queries getting zero results held steady at 23.3%
 * The number of queries getting a different number of results is hard to judge (0.7% to 1.1% in 20 minute intervals; 1.2% over 30 minutes)
 * The number of queries getting fewer results is possibly noise (0.2% to 0.8% in 20 minute intervals; 0.8% over 30 minutes)
 * The number of queries getting more results is probably noise (0.3% to 0.8% in 20 minute intervals; 0.5% over 30 minutes)
 * The number of queries changing their top result is probably noise (1.2% to 1.4% in 20 minute intervals; 1.2% over 30 minutes)

Reindexing Results
 * While unpacking an analyzer should have no impact on results, adding ICU folding had a likely minor impact for Dutch Wikipedia. There was a small decrease in zero-results queries, a general increase in recall (both attributable to ICU Folding), and a decrease in changes to top queries (a general side-effect of reindexing).
 * The zero results rate dropped to 23.2% (-0.1% absolute change; -0.4% relative change).
 * The number of queries getting a different number of results increased to 8.0% (vs. the 0.7%–1.1% range seen in control).
 * The number of queries getting fewer results was within the control range (0.3% vs 0.2%–0.8%).
 * The number of queries getting more results was 7.5% (vs the control range of 0.3%–0.8%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
 * The number of queries that changed their top result was 3.4% (vs. the control range of 1.2%–1.4%).

Post-Reindex Control
 * The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.1% vs 1.2%–1.4%).

Observations
 * Zero-results changes are all due to ICU folding, so that buthusbankje matches bûthúsbankje, or a curly quote is converted to a straight quote. These are all fairly rare words that got ≤5 results with ICU Folding.
 * Large increases in number of results and changes in the top result are largely obviously from ICU folding.

Unpacking + ICU Norm + ICU Folding Impact on Portuguese Wikipedia (T284185)
Summary
 * ICU folding increases recall for some queries, affecting zero results rate and the total number of results returned. Missing tildes (a instead of ã, or o instead of õ) are the biggest source of changes, so this is a very good change for Portuguese searchers who omit them!

Background
 * I pulled the usual sample of 10K queries from Portuguese Wikipedia (April 2021), filtered 149 queries, and randomly sampled 3K from the remainder.
 * I used a brute-force diff strategy, with control diffs before and after (at ~10 minute intervals).
 * The before/after time difference was 15 minutes because of the exact time reindexing finished.

Control Results
 * The number of queries getting zero results held steady at 18.7%
 * The number of queries getting a different number of results is increasing (0.8% in 10 minute intervals; 1.5% over 20 minutes)
 * The number of queries getting fewer results is noise (0.1% to 0.3% in 10 minute intervals; 0.3% over 20 minutes)
 * The number of queries getting more results is increasing (0.6% to 0.7% in 10 minute intervals; 1.2% over 20 minutes)
 * The number of queries changing their top result is noise (1.1% to 1.4% in 10 minute intervals; 1.4% over 20 minutes)

Reindexing Results
 * The zero results rate dropped to 18.3% (-0.4% absolute change; -2.1% relative change).
 * The number of queries getting a different number of results increased to 15.3% (vs. the 0.8% seen in control).
 * The number of queries getting fewer results was similar to the control range (1.0% in 15 minutes vs 0.6%–0.7% in 10 minutes and 1.2% in 20 minutes).
 * The number of queries getting more results was 13.8% (vs the control range of 0.6%–0.7%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
 * The number of queries that changed their top result was 3.4% (vs. the control range of 1.2%–1.4%).

Post-Reindex Control
 * The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.1% vs 1.2%–1.4%).

Observations
 * Zero-results changes are mostly obviously due to ICU folding.
 * Large increases in number of results and changes in the top result are largely obviously from ICU folding. Particularly sao matching são—which increased hits from 300 to 21K!
 * The one query I couldn't figure out was 1926~. The absolute increase is fairly large (~5K) but the relative increase it not (2.3%—out of 218K).
 * Overall, missing tildes (a instead of ã, or o instead of õ) are the biggest sources of changes.

Basque, Catalan, and Danish Notes (T283366)

 * Usual 10K sample each from Wikipedia and Wiktionary for each language.
 * Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.


 * Stemming observations:
 * Catalan Wikipedia had up to 180(!) distinct tokens in stemming groups.
 * Basque Wikipedia had up to 200(!!) distinct tokens in stemming groups.
 * Danish Wikipedia had a mere 30 distinct tokens in its largest stemming group.


 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.


 * Enabled homoglyphs and found a handful of examples in all six samples.
 * Catalan Wikipedia had two mixed–Cyrillic/Greek/Latin tokens!
 * Found Greek/Latin examples in all three Wikipedias and Danish Wiktionary, and Greek/Cyrillic in Catalan Wikipedia.


 * Enabled ICU normalization and saw the usual normalizations.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Most common normalizations: lots of ß and invisibles (soft-hyphen, bidi marks, etc.) all around; 1ª, 1º for Basque and Catalan Wikipedias, and some full-width characters for Catalan Wikipedia.
 * Catalan Wikipedia also loses a lot (12K+ out of 4.1M) of "E⎵" and "O⎵" tokens, where ⎵ represents a "zero-width no-break space" (U+FEFF). "e" and "o" are stop words—"o" means "or", but "e" just seems to refer to the letter; weird. The versions with U+FEFF seem to be used exclusively in coordinates ("E" stands for "est", which is "east"; "O" stands for "oest", which is "west"). Since the coords are very exact (e.g., "42.176388888889°N,3.0416666666667°E"), I don't think many people are searching for them specifically, and if they are, the plain field will help them out.


 * Enabled custom ICU folding for each language, saw lots of the usual folding effects.
 * Exempted [ñ] for Basque and [æ, ø, å] for Danish. [ç] was unclear for Basque and Catalan, but I let it be folded to c for both for the first pass.
 * ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
 * Basque: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
 * Catalan Wiktionary: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
 * Catalan Wikipedia:
 * Lots of high-impact collisions (ten or more distinct words merged into another group—often two largish groups merging). They came in three flavors:
 * The majority are ç → c; most look ok
 * A few ñ → n; these look good; mostly low frequency Spanish cognates merging with Catalan ones
 * Single letters merging with diacritical variants, like [eː, e̞, e͂, ê, ē, Ĕ, ɛ, ẹ, ẽ, ẽː] merging with [È, É, è, é]
 * Surprisingly, lots of Japanese Katakana changes, deleting the prolonged sound mark ー.
 * Danish: Also straightened a fair number of curly quotes.

Overall Impact

 * There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters. (But see Catalan Wikipedia.)
 * ICU folding is the biggest source of changes in all wikis—as expected.
 * Generally, the merges that resulted from ICU folding were significant, but not extreme (0.5% to 1.5% of tokens being redistributed into 1% to 3% of stemming groups).
 * Basque Wiktionary: 649 tokens (1.111% of tokens) were merged into 473 groups (2.330% of groups)
 * Basque Wikipedia: 27,620 tokens (1.175% of tokens) were merged into 3,244 groups (1.325% of groups)
 * Catalan Wiktionary: 840 tokens (0.520% of tokens) were merged into 400 groups (1.181% of groups)
 * Catalan Wikipedia:
 * 12.7K fewer tokens out of 4.1M (see "E⎵" and "O⎵" above)
 * 39,099 tokens (0.943% of tokens) were merged into 2,513 groups (0.967% of groups)
 * Danish Wiktionary: 1,515 tokens (1.387% of tokens) were merged into 904 groups (2.788% of groups)
 * Danish Wikipedia: 20,778 tokens (0.611% of tokens) were merged into 2,990 groups (1.023% of groups)

An Unexpected Experiment
David needed to reindex over 800 wikis for the  →   rename, including all of the large wikis covered by unpacking Catalan, Danish, and Basque. (There was another small wiki for the Denmark Wikimedia chapter, which I reindexed.)

Because I couldn't control the exact timing of the reindexing, I ran 5 pre-reindex control query runs at 10 minute intervals for comparison, and then ran follow-up query runs at approximately 1-day intervals (usually ±15 minutes, sometimes ±2 hours).

The exact number of pre-reindex controls and post-reindex controls for each language differed because they were reindexed on different days.

General Notes
Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary
 * Catalan has a very large improvement in zero-results rate (8.1% relative improvement, or 1 in 12), largely driven by the fact that people type -cio for -ció (which is cognate with Spanish -ción and English -tion).
 * In general, the impact on Danish was very mild; the general variability in Danish query results is lower than for other wikis.
 * Basque improvements are in large part due to queries in Spanish that are missing the expected Spanish accents.

Background
 * I pulled a sample of 10K Wikipedia queries from April of 2021 (1 week each for Catalan and Danish, the whole month for Basque). I filtered obvious porn, urls, and other junk queries from each sample (ca:237, da:396, eu:438, urls most common category in all cases) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Catalan Wikipedia (T284691)
Reindexing Results
 * Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
 * The zero results rate dropped from 14.9% to 13.7% (-1.2% absolute change; -8.1% relative change).
 * The number of queries that got more results right after reindexing was 30.4%, vs. the pre-reindex control of 17.1% and post-reindex control of 14.8–17.6%.
 * The number of queries that changed their top result right after reindexing was 6.2%, vs. the pre-reindex control of 1.0% and post-reindex control of 0.6–2.0%.

Observations
 * The most common cause of improvement in zero-results is matching -cio in the query with -ció in the text, and they generally look very good.
 * Some of the most common causes of an increased number of results include -cio/-ció, other accents missing in queries, and c/ç matches. Not all of the highest impact c/ç matches look great, but these are edge cases. From the earlier analysis chain analysis (see above), I expect c/ç matches are overall a good thing, though we should keep an eye out for reports of problems.

Unpacking + ICU Norm + ICU Folding Impact on Danish Wikipedia (T284691)
Reindexing Results
 * Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
 * The zero results rate dropped from 28.6% to 28.2% (-0.4% absolute change; -1.4% relative change).
 * The number of queries that got more results right after reindexing was 9.0%, vs. the pre-reindex control of 2.1–3.1% and post-reindex control of 2.0–3.0%.
 * The number of queries that changed their top result right after reindexing was 1.7%, vs. the pre-reindex control of 0.7–0.9% and post-reindex control of 0.2–0.9%.

Observations
 * Generally the impact on Danish Wikipedia was very muted compared to most others we've seen so far.

Unpacking + ICU Norm + ICU Folding Impact on Basque Wikipedia (T284691)
Reindexing Results
 * Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
 * The zero results rate dropped from 24.4% to 23.1% (-1.3% absolute change; -5.3% relative change).
 * The number of queries that got more results right after reindexing was 21.6%, vs. the pre-reindex control of 6.9–7.9% and post-reindex control of 7.2–10.0%.
 * The number of queries that changed their top result right after reindexing was 4.0%, vs. the pre-reindex control of 0.2–0.6% and post-reindex control of 0.1–0.2%.

Observations
 * A lot of the rescued zero-results and some of the other improved queries are in Spanish, and are missing the expected Spanish accents.

Unexpected Experiment, Unexpected Results!
The results of this unexpected experiment are actually very good. With fairly different behavior from all three of these samples (Catalan with big improvements, Basque with more typical improvements, and Danish with smaller improvements and generally less variability), the impacts—especially now that we know where to expect them—are easy to detect at one-day intervals, despite the general variability in results over time. This means I can back off my sampling rate from ~10 minutes (which is sometimes hard to achieve) to something a little easier to handle—like half-hourly or hourly.

Czech, Finnish, and Galician Notes (T284578)

 * Usual 10K sample each from Wikipedia and Wiktionary for each language.
 * Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, IPA transcriptions (in Wiktionary) etc.


 * Stemming observations:
 * Czech Wikipedia had 37 distinct tokens in its largest stemming group.
 * The Czech stemmer stems single letters c → k, z → h, č → k, and ž → h (though plain z is a stop word) and ek → k and eh → h. This seems like an over-aggressive stemmer... looking at the code, it is modifying endings even when there is nothing that looks like a stem. I will submit a ticket or maybe work on a patch as a 10% project.
 * Finnish Wikipedia had 61 distinct tokens in its largest stemming group.
 * Galician Wikipedia had 66 distinct tokens in its largest stemming group.
 * Since I can recognize some cognates in other Romance languages, I can say that the largest group is a little aggressive; it includes Ester, Estaban, estación, estato, estella, estiño, plus many forms of estar.
 * Galician also has a very large number of words in other scripts, which lead to some very long tokens, like the 132-character \u-encoded version of 𐍀𐌰𐌿𐍂𐍄𐌿𐌲𐌰𐌻𐌾𐌰, Gothic for "Portugal".
 * Galician Wiktionary likes to use superscript numbers for different meanings of the same word, so the entry for canto has canto¹ through canto⁴, which get indexed as canto1 through canto4—there a fair number of such tokens. Fortunately, the unnumbered version should always be on the same page.


 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).


 * Enabled homoglyphs and found plenty of examples.
 * There are some Greek/Latin examples in Czech
 * Including "incorrect" Greek letters in IPA on cswikt (oddly, there are some Greek letters that are commonly used in IPA and others that have Latin equivalents that are used instead, and for a couple it's a free-for-all!)
 * There are Cyrillic/Greek and Latin/Greek examples in Finnish Wikipedia and Galician Wiktionary.
 * Galician Wikipedia had lots of Latin/Greek tokens—though many seem to be abbreviations for scientific terms... but there are a few actual mistakes in there, too.


 * Enabled ICU normalization and saw the usual normalizations.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Most common normalizations:
 * Czech: the usual various character regularizations, invisibles (bidi, zero-width (non)joiners, soft hyphens), a few #ª ordinals
 * Finnish: mostly ß/ss & soft hyphens
 * Galician: lots of #ª ordinals, lots of invisibles


 * Enabled custom ICU folding for each language, saw lots of the usual folding effects.
 * Exempted [Áá, Čč, Ďď, Éé, Ěě, Íí, Ňň, Óó, Řř, Šš, Ťť, Úú, Ůů, Ýý, and Žž] for Czech.
 * Exempted [Åå, Ää, Öö] for Finnish.
 * Exempted [Ññ] for Galician.
 * ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
 * Czech: lots more tokens with Latin + diacritics than usual, since the list of exemptions is pretty big, and exempts some characters used in other languages, like French and Polish.
 * Finnish: lots of š and ž, which are supposed to be used in loan words and foreign names, but are often simplified to s or z (or sh and zh, but that is probably outside our scope).
 * Galician: Nothing really sticks out as particularly common; just a collection of the usual folding mergers.

General Notes
Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary
 * The Czech and Finnish Wikipedia samples showed clear but rather muted impact on user query results. The Galician results are a little more robust and show a more consistent pattern of searchers not using standard accents (rather than just problems with "foreign" diacritics).

Background
 * I pulled a sample of 10K Wikipedia queries from approximately July of 2021 (1 week each for Czech and Finnish, June through August for Galician). I filtered obvious porn, urls, and other junk queries from each sample (Czech:152, Finnish:226, Galician:928, urls are the most common category in all cases, with numbers and junk being common for all, as well. Galician also had a lot of porn queries, and overall more useless queries, which is a trend on smaller wikis). I randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Czech Wikipedia (T290079)
Reindexing Results
 * The zero results rate dropped from 23.8% to 23.6% (-0.2% absolute change; -0.8% relative change).
 * The number of queries that got more results right after reindexing was 8.4%, vs. the pre-reindex control of 0.2–0.7% and post-reindex control of 0.1–0.7%.
 * The number of queries that changed their top result right after reindexing was 1.6%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0%.

Observations
 * Generally the impact on Czech Wikipedia was rather muted. Changes in results were generally from missing diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Finnish Wikipedia (T290079)
Reindexing Results
 * The zero results rate dropped from 24.6% to 24.4% (-0.2% absolute change; -0.8% relative change).
 * The number of queries that got more results right after reindexing was 9.1%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
 * The number of queries that changed their top result right after reindexing was 4.0%, vs. the pre-reindex control of 0.4–0.5% and post-reindex control of 0.0–0.1%.

Observations
 * * Generally the impact on Finnish Wikipedia was also muted. Changes in results were generally from missing diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Galician Wikipedia (T290079)
Reindexing Results
 * The zero results rate dropped from 18.1% to 17.5% (-0.6% absolute change; -3.3% relative change).
 * The number of queries that got more results right after reindexing was 18.6%, vs. the pre-reindex control of 0.2–0.5% and post-reindex control of 0.1–0.7%.
 * The number of queries that changed their top result right after reindexing was 4.1%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.

Observations
 * The most common causes of improvement in zero-results came from matching missing accents on words that end with vowel + n. Cognate with what we've seen before, -cion for -ción is common, along with general accents missing from -ón/-ín/-ún endings.
 * The most common causes of an increased number of results and changes in the top result include correcting for missing accents from final vowel + n, and general incorrect (missing, extra, or wrong) diacritics.

Hindi, Irish, Norwegian Notes (T289612)

 * Usual 10K sample each from Wikipedia and Wiktionary for each language.
 * Except for Irish Wiktionary, which is quite small; I used a 1K sample for gawikt.
 * Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, compounds, a bit of likely vandalism; etc.


 * Stemming observations:
 * Irish Wikipedia had 16 distinct tokens in its largest stemming group.
 * Norwegian Wikipedia had 18 distinct tokens in its largest stemming group.
 * Hindi Wikipedia had 46 distinct tokens in its largest stemming group.
 * The first pass at analysis showed 1780 "potential problem" stems in the Hindi Wikipedia data, which are ones where the stemming group has no common prefix and no common suffix. This isn't particularly rare, but there usually aren't so many. It turns out that the majority (~1400) were caused by Devanagari numerals and Arabic numerals (e.g., १ and 1). I added folding rules to my analysis to handle those cases. Another common cause were long versions of vowels, such as अ (a) and आ (ā), which seem to frequently alternate at the beginning of words that have the same stem. A few more folding rules and I got down to a more normal number of "potential problem" stems—just 12—and they were all reasonable.
 * A smattering of mixed-script tokens.
 * Hindi had many non-homoglyph mixed script tokens, mostly Devanagari and another script. Many of these were separated by colons or periods, making me think  could be useful, especially with better acronym handling.


 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).


 * Enabled homoglyphs and ICU normalization and saw the usual stuff.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Though not for Irish! Since Irish has language-specific lowercasing rules, both lowercasing and ICU normalization happen and lowercasing handles İ correctly.
 * Most common normalizations:
 * Irish Wikipedia also uses Mathematical Bold Italic characters (e.g., 𝙄𝙧𝙚𝙡𝙖𝙣𝙙) rather than bold and italic styling in certain cases, such as names of legal cases.
 * One instance of triple diacritics stuck out: gCúbå̊̊
 * Hindi had lots of bi-directional symbols, including on many words that are not RTL.
 * Norwegian had the usual various character regularizations, mostly diacritics, plus a handful of invisibles.


 * Further Customization—Irish
 * Older forms of Irish orthography used an overdot (ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ/ẛ ṫ) to indicate lenition, which is now usually indicated with a following h (bh ch dh fh gh mh ph sh th). Since these are not a commonly occurring characters, it is easy enough to do the mapping (, etc.) as a character filter. It doesn't cause a lot of changes, but it does create a handful of good mergers.
 * Another feature of Gaelic script is that its lowercase i is dotless (ı). However, since there is no distinction between i and ı in Irish, i is generally used in printing and electronic text. ICU folding already converts ı to i.
 * As an example, amhráin ("songs") appears in my corpus both in its modern form, and its older form, aṁráın (with dotted ṁ and dotless ı). Adding the overdot character filter (plus the existing ICU folding) allows these to match!


 * Enabled custom ICU folding for each language, saw lots of the usual folding effects.
 * Nothing exempted for Irish or Hindi.
 * Exempted Ææ, Øø, and Åå for Norwegian.
 * ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
 * Irish uses a fair number of acute accents to mark long vowels, though it seems to sometimes be omitted (perhaps as a mistake). There are quite a few mergers between diacriticked (or partly diacriticked) forms and fully diacriticked forms, such as cailíochta and cáilíochta. There are a few potential incorrect forms—I recognize some English words that happen to look like forms of Irish words—but there aren't a lot, and some of them are already conflated by the current search.
 * Hindi: Most folding affects Latin words, and most of the Hindi words that were affected had bidi and other invisible characters stripped.
 * Norwegian Wiktionary had a surprising number of apparently Romance-language words that had their non-Norwegian diacritics normalized away.

Overall Impact

 * There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters.
 * ICU folding is the biggest source of changes in all wikis—as expected.
 * Irish Wikipedia: 134,095 tokens (15.887% of tokens) were merged into 2,524 groups (2.822% of groups).
 * Irish Wiktionary: 130 tokens (1.272% of tokens) were merged into 44 groups (1.074% of groups).
 * Irish Wiktionary mergers may be less numerous because of the smaller 1K sample size.
 * Irish had a much bigger apparent impact (15.887% of tokens), which is partially an oddity of accounting.
 * Looking at amhrán ("song") as an example, the original main stemming group consisted of amhrán, Amhrán, amhránaíocht, Amhránaíocht, amhránaíochta, Amhránaíochta, d’amhrán, nAmhrán, and tAmhrán. Another group without acute accents—possibly typos—consisted of amhran and Amhran. The larger group (which has more members that are also more common) is counted as merging into the smaller group because the new folded stem is amhran, not amhrán, giving 9 mergers rather than 2.
 * Hindi Wiktionary: 4 tokens (0.002% of tokens) were merged into 4 groups (0.012% of groups).
 * Hindi Wikipedia: 296 tokens (0.019% of tokens) were merged into 150 groups (0.128% of groups).
 * Hindi was barely affected by ICU folding, since it doesn't do much to Hindi text.
 * Norwegian Wiktionary: 1,310 tokens (1.229% of tokens) were merged into 990 groups (4.302% of groups)
 * Norwegian Wikipedia: 6,731 tokens (0.424% of tokens) were merged into 1,633 groups (0.979% of groups)
 * Generally, the merges that resulted from ICU folding in Norwegian were significant, but not extreme.

General Notes
Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary
 * Specific new matches in all three (Irish, Hindi, & Norwegian) Wikipedias are good.
 * The impact overall on the zero-results rate is fairly small for all three.
 * The zero-results rate for Hindi Wikipedia, independent of recent changes, it really high (60+%), so I investigated a bit. Transliteration of Latin queries to Devanagari could have a sizable impact.
 * Irish and Norwegian had a sizable increase in total results, and a noticeable increase in top results. Hindi had much smaller increases for both.
 * Irish changes were dominated by Irish diacritics (which are not part of the alphabet), while the Norwegian changes were dominated by foreign diacritics.

Background
 * I tried to pull a sample of 10K Wikipedia queries from June–August of 2021 (1 week in July each for Hindi and Norwegian, almost three months for Irish). I was only able to get 2,543 queries for Irish Wikipedia. I filtered obvious porn, urls, and other junk queries from each sample (Irish:959, Hindi:528, Norwegian:250, with urls and porn being the most common categories) and randomly sampled 3000 queries from the remainder (there were only 1448 unique queries left for the Irish sample).

Unpacking + ICU Norm + ICU Folding Impact on Irish Wikipedia (T294257)
Reindexing Results
 * The zero results rate dropped from 32.6% to 30.5% (-2.1% absolute change; -6.4% relative change).
 * The number of queries that got more results right after reindexing was 12.3%, vs. the pre-reindex control of 0% and post-reindex control of 0%.
 * The number of queries that changed their top result right after reindexing was 5.4%, vs. the pre-reindex control of 0.2% and post-reindex control of 0%.

Observations
 * The most common cause of improvement in zero-results is matching missing Irish diacritics.
 * The most common cause of an increased number of results is also matching missing Irish diacritics.
 * Unaccented versions of names like Seamus, Padraig, and O Suilleabhain now can find the accented versions (Séamus, Pádraig, Ó Súilleabháin).
 * Not all diacritical matches are the best. Irish bé matches English be, which occurs in titles of English works. bé matches are still ranked highly because of exact matches.
 * The most common cause of changes in the top result is—you guessed it!—matching missing Irish diacritics; often with a near exact title match.
 * The negligible or zero changes in number of results and top results stems from, I believe, the small size and low activity of the wiki; basically, there is virtually no noise at the 15–30 minute scale.

Unpacking + ICU Norm + ICU Folding Impact on Hindi Wikipedia (T294257)
Reindexing Results
 * The zero results rate dropped from 62.1% to 62.0% (-0.1% absolute change; -0.2% relative change).
 * The number of queries that got more results right after reindexing was 2.3%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
 * The number of queries that changed their top result right after reindexing was 0.9%, vs. the pre-reindex control of 0.1% and post-reindex control of 0%.

Observations
 * The most common cause of improvement in zero-results is matching missing foreign diacritics. (e.g., shito/shitō and nippo/nippō)
 * The most common causes of an increased number of results are matching missing foreign diacritics, removal of invisibles, and—to a much lesser degree—ICU normalization of some Hindi and other Brahmic accents, including Devanagari and Odia/Oriya virama and Sanskrit udātta.
 * The most common causes of changes in the top result are the same as for the increased number of results, since there is a lot of overlap (i.e., searches that got more results often changed their top result).

Hindi Wikipedia Zero Results Queries
Because the zero results rate was so high, I decided there was no time like the present to do a little investigating into why. I did a little diffing into the 1,861 queries that got no results. (A reminder where this sample comes from: 10K Hindi Wikipedia queries were extracted from the search logs, 528 were filtered as porn, URLs, numbers-only, other junk, etc., and the remainder was deduped, leaving 9,060 unique queries. A random sub-sample of 3K was chosen from there, and the 1,861 (62.0%) of those that got zero results are under discussion here.)

The large majority (84%) of zero-results queries are in the Latin script, with Devanagari (13%) and mixed Latin + Devanagari (2%) making up most of the rest.


 * 1566 (84.1%) Latin
 * 244 (13.1%) Devanagari
 * 43 (2.3%) Latin + Devanagari
 * 4 (0.2%) Gujarati
 * 1 Gurmukhi (Punjabi) + Devanagari
 * 1 CJK
 * 1 emoji
 * 1 misc/wtf (punct + Devanagari combining chars)

I reviewed a random sample of 50 of the Latin queries, and divided them into two broad (and easy for me to discern) categories—English and non-English. The non-English generally looks like transliterated Devanagari/Hindi, but I did not explicitly verify that in all cases. There are a relatively small number of English queries, and larger number of mixed English and non-English queries, and the majority (~70%) are non-English.

50 Latin sample


 * 34 non-English
 * 13 Mixed English + non-English
 * 2 English
 * 1 ???

I took a separate random sample of 20 non-English queries and used Google Translate in Hindi to conver them to Devanagari. About half couldn't be automatically converted (I didn't dig into that to figure out why), but 25% got some Wikipedia results after conversion, and 15% that got no Wikipedia results got some sister-search (Wiktionary, etc.) results. The remaining 15% got no results.

20 non-English sample


 * 9 can't convert
 * 5 some results
 * 3 sister search results
 * 3 no results

Taking this naive calculation with a huge grain of salt (or at least with huge error bars), 84.1% of zero-result queries are in Latin script, 68% of those are likely transliterated Devanagari, and 40% of those get results when transliterated back to Devanagari. That's 22.9% (probably ±314.59%)... actually, the math nerd in me couldn't let it go... using the Wilson Score Interval and the standard error propagation formula for multiplication, I get 23.3% ± 11.9%.

So, in very round numbers, almost ¼ of non-junk zero-result queries (and likely at least ⅒ and at most ⅓) on Hindi Wikipedia could be rehabilitated with some sort of decent Latin-to-Devanagari transliteration. The number could be noticeably higher, too—most optimistically doubled—if the queries that Google Translate could not automatically convert got some sort of results with a more robust transliteration scheme; on the other hand, they could all be junk, too. It is also possible that the mixed English and transliterated Devanagari zero-result queries could get some results—though transliterating the right part of the mixed queries could present a significant challenge.

I have opened a ticket with this info (T297761) to go on our backlog as a possible future improvement for Hindi.

I also looked at a random sample of 20 of the zero-result Devanagari queries. The most common grouping is what I call "homework". These are queries that are phrased like all or part of a typical homework question, or other information-seeking question. Something like What is the airspeed velocity of an unladen swallow?, How does aspirin find a headache,or hyperbolic geometry parallel lines.

I also found four names, one porn query, and three I couldn't readily decipher.

20 Devanagari sample
 * 12 "homework"
 * 4 names
 * 1 porn
 * 3 ???

Homework-type questions in general sometimes benefit from removing stop words, but sometimes there are too many specific but only semi-relevant content words to find a match.

Unpacking + ICU Norm + ICU Folding Impact on Norwegian Wikipedia (T294257)
Reindexing Results
 * The zero results rate dropped from 26.4% to 26.2% (-0.2% absolute change; -0.8% relative change).
 * The number of queries that got more results right after reindexing was 9.2%, vs. the pre-reindex control of 0.1–0.3% and post-reindex control of 0.1–0.2%.
 * The number of queries that changed their top result right after reindexing was 4.3%, vs. the pre-reindex control of 1.1–1.2% and post-reindex control of 0%.

Observations
 * The most common cause of improvement in zero-results is matching missing foreign diacritics. (e.g., Butragueno/Butragueño and Bockmann/Böckmann)
 * The most common cause of an increased number of results is matching foreign diacritics.
 * The most common cause of changes in the top result is matching foreign diacritics.

Bengali Notes (T294067)
The situation with Bangla/Bengali is a little different than others I've worked on so far. The Bengali analyzer from Elasticsearch has not been enabled, so I need to enable it, verify it with speakers, and unpack it so that we don't have any regressions in terms of handling ICU normalization or homoglyph normalization.

Since enabling a new analyzer is more complex than the other unpacking projects, I've put the details on their own page.

General Notes
Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries), and any unexpected impacts.

Summary
 * Bengali Wikipedia had a very high zero-results rate (49.0%), and introducing stemming (and other changes—but mostly stemming) provided results for about ⅐ of zero-results queries, lowering the zero-results rate to 42.3%—which is still very high, but definitely better.

Background
 * I pulled a sample of 10K Wikipedia queries from one week in July of 2022. I filtered obvious porn, urls, and other junk queries from the sample (185 queries filtered, porn and strings of numbers are the most common categories) and randomly sampled 3000 unique queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Bengali Wikipedia (T315265)
Reindexing Results
 * The zero results rate dropped from 49.0% to 42.3% (-6.7% absolute change; -13.7% relative change).
 * The number of queries that got more results right after reindexing was 33.0%, vs. the pre-reindex control of 0.0-0.2% and post-reindex control of 0.0-0.6%.
 * The number of queries that changed their top result right after reindexing was 19.6%, vs. the pre-reindex control of 0.1%% and post-reindex control of 0.0%.

Observations
 * The most common causes of all changes seems to be stemming.

Arabic and Thai Notes (T294147)

 * Usual 10K sample each from Wikipedia and Wiktionary for each language.
 * Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, chemical names, etc.

Since Thai became so involved, I'm going to split my notes on Arabic and Thai, rather than have them interleaved as I usually do.

Arabic Notes

 * Some Arabic observations:
 * Lots of (invisible) bi-directional markers everywhere.
 * There are a number of empty tokens, which result from 1 to 4 tatweel characters (ـ), which are used to elongate words or characters to justify text. They are rightly ignored, but there are a few hundred instances where it appears by itself, creating empty tokens.
 * There are a handful of homoglyph tokens—Cyrillic/Latin and Greek/Latin... gotta work on those Greek homoglyphs!


 * Stemming observations:
 * Arabic Wikipedia had 98(!) distinct tokens in its largest stemming group.


 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.


 * For Arabic, enabling homoglyphs and ICU normalization resulted in the usual stuff.
 * A smattering of mixed-script tokens.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Most common normalizations:
 * Arabic had lots of loose bidi marks and a few zero-width (non)joiners and non-breaking spaces that get cleaned up.
 * Arabic Wiktionary had a fair number of long s (ſ) characters that are properly folded to s, and an fi digraph (ﬁ) is folded to fi.


 * Enabled custom ICU folding, saw lots of the usual folding effects.
 * Arabic-specific ICU folding includes:
 * ignoring inverted damma ( ٗ — "The bulk of Arabic script is written without ḥarakāt"—short vowel marks, including damma)
 * ignoring dagger alif (  ٰ — "it is seldom written")
 * converting Farsi yeh (ی) to Arabic yeh (ي — which might not make sense in Farsi, but does in Arabic)
 * converting keheh (ک) to kaf (ك — again makes sense to convert "foreign" letter variants to native ones)
 * removing hamza from waw (ؤ to و) and yeh (ئ to ي) — these were a little less obviously good to me; thanks to Mike R. for giving me the lowdown. While not perfect, these normalizations are generally positive and reflect the way hamza is often used in practice.
 * A note as to scale: in my sample of 10K Arabic Wikipedia articles, there are 211K distinct token types in the text before language analysis (from 1.6M total tokens), and 112K distinct token types after analysis. Of those 112K types, only 550 are affected by all these ICU folding changes. In my 10K-entry Wiktionary sample, only 57 types are affected by all Arabic ICU folding normalizations—out of 27K distinct token types after language analysis (39K types before analysis; 100K tokens total).

Overall Impact for Arabic

 * There were few token count differences in most cases, mostly from fewer solo combining characters and Arabic tatweel.
 * ICU folding is the biggest source of changes—as expected.
 * Generally, the merges that resulted from ICU folding were significant, but not extreme (1.1% to 2.1% of tokens being redistributed into 0.5% to 1% of stemming groups).

Thai Notes

 * Some Lots of Thai observations:
 * I was surprised to see that the only stemming groups with multiple members in the Thai Wiktionary data are numbers! For example, 1, ๑ (Thai), ໑ (Lao), and ᧑ (New Tai Lue) are all analyzed as 1. I checked the wiki page for Thai grammar, and it is indeed analytic—with apparently no inflections! (English is usually classified as (kinda) analytic—but Thai really means it!)
 * In the Thai Wikipedia data, there are two non-number (and non-Thai) groups with multiple members, and they both feature our old friend dotted I, as in Istanbul / İstanbul.
 * Looking more closely at the built-in Thai analysis chain, there is no Thai stemmer. Being analytic, I guess it doesn't need one—neat!
 * Thai is the only Elastic built-in analyzer that doesn't use the "standard" tokenizer; there is a specific Thai tokenizer. This leads to some differences—and it kind of looks like the Thai tokenizer is lagging behind the standard and ICU tokenizers for non-Thai characters.
 * There are a lot of really long Thai tokens. The longest is 204 characters: สู่⎵⎵การส่งเสริมความก้าวหน้าของโรคมะเร็งความเข้าใจที่ดีขึ้นของอณูชีววิทยาและชีววิทยาของเซลล์ที่ได้จากการวิจัยโรคมะเร็งได้นำไปสู่⎵⎵การรักษาใหม่จำนวนมากสำหรับโรคมะเร็งนับแต่ประธานาธิบดีนิกสันแห่งสหรัฐประกาศ
 * The ⎵ here represents a zero width space (U+200B). A lot of the really long tokens (but not quite all) have zero width spaces in them. Removing them gives much more reasonable tokenizations—in this case 29 separate tokens.
 * The good-ish news is that the plain field, using the ICU tokenizer, isn't freaked out by the zero width spaces, and generated 49 tokens—so ~20 tokens are potential stop words that are not ignored.
 * The better news is that this is easily fixed with a char_filter before tokenization.
 * There are a few mixed Latin/Cyrillic homoglyph tokens in the Wikipedia data that should be fixed by the homoglyph filter.
 * There are a fair number of tokens with bidi characters—including Arabic and Hebrew, but also CJK, Thai, Latin, and others. ICU normalization should fix those.
 * A few other invisibles show up in output tokens: zero width joiners & non-joiners, zero width spaces, and variation selectors in Myanmar text. Variation selectors are new to me, and I'm not sure whether ICU normalization and/or ICU folding will clean them up, but we'll see.
 * There are a surprising number of tokens with a single double quote in them. For example, Let"s, which looks like a typo. Others, like CD"Just, don't appear on-wiki and seem to be caused by errors in my export process. Not sure if it's from the export itself or my subsequent clean up.
 * The standard tokenizer and the ICU tokenizer strip double quotes from the edges of tokens, and only allow them inside Hebrew tokens (where they frequently substitute for gershayim, which usually indicate Hebrew acronyms. Not sure if this is worth fixing since it may be an artifact of my export process.
 * There are a lot—thousands!—of hyphenated tokens; mostly Latin, but also in plenty of other scripts. And also plenty of other separators...
 * Other dash-like separators remaining in tokens include: – en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), － fullwidth hyphen-minus (U+FF0D), and the ‧ hyphenation point (U+2027).
 * It's not clear whether we should break on hyphenation points, they are mostly used to break up syllables in words on Thai Wiktionary. However, if we break on hyphens (which would generally be a good thing), then things would be more consistent if we also break on hyphenation points, since hyphens are used to break up syllables, too.
 * I also learned that the usual hyphen, which also functions as a minus sign (- U+002D "HYPHEN-MINUS") and which I thought of as the hyphen, is not the only hyphen... there is also ‐ (U+2010 "HYPHEN"). Since I had been labeling the typical "hyphen-minus" as "hyphen" in my reports, it took me a while to realize that the character called just "hyphen" is distinct. Fun times!
 * These could all be cleaned up with a char_filter.
 * There are a fair number of tokens using fullwidth Latin characters. So, ＩＭＰＯＳＳＩＢＬＥ gets normalized to ｉｍｐｏｓｓｉｂｌｅ, rather than the more searchable impossible. I expect ICU normalization or ICU folding to take care of this.
 * Percent signs (%) are not skipped during tokenizing, which means that there are percentages (15.3%) in the final tokens. URL-encoded strings (like %E2%80%BD) get parsed the wrong-way around, with the percent signs attached to the preceding rather than following letters/numbers (%E2%80%BD is parsed as E2% + 80% + BD).
 * There are also a lot of tokens that end with an ampersand (&). These seems to mostly come from URLs as well, with the little twist that the ampersand is only attached to the token if the character right before the ampersand is a number (including non-Arabic numerals. Tokenizing the (very artificial) URL fragment q=1&q=๑&q=໑&q=١&q=१&q=১&q=੧&q=૧&q=୧&q=༡&id=xyz gives 10 tokens that are all normalized (presumably by the ubiquitous  filter) to 1&. (In practice, non-ASCII letters are often URL-encoded, so all natural examples are Arabic digits (0-9) with or without plain-ASCII Latin character accompaniments.) These can be broken up by a char_filter or cleaned up with a token filter.
 * In summary, it looks like Thai could benefit from an even more aggressive version of  to "split on"—by converting them to spaces—hyphens (both hyphen-minus and "true" hyphens), en dashes, em dashes, horizontal bars, full-width hyphens, and ampersands, and probably on hyphenation points, percent signs, and double quotes. We should also either delete or convert to spaces (not sure yet) zero width spaces as early as possible, so that the Thai tokenizer isn't confused by them.


 * Stemming observations:
 * Thai Wikipedia had only 1(!!!) distinct token in its Thai, non-number groups!
 * Numbers have up to four different numeral systems in my samples.


 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.


 * For Thai, enabling homoglyphs and ICU normalization had a bigger-than-usual impact!
 * As usual, a smattering of mixed-script tokens.
 * As usual, the expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map.
 * A lot of tokens with invisibles (mostly zero width spaces) were grouped with their invisible-less counterparts.
 * There are a sprinkling of fullwidth Latin tokens (e.g., ＩＭＰＯＳＳＩＢＬＥ) that are now grouped with their normal Latin counterparts.
 * Unexpectedly, 1.0% of Wiktionary tokens (~2900) and 0.8% of Wikipedia tokens (~28K) were lost! The lost tokens were lost because, after ICU normalization, they were identified as stop words.
 * The vast majority of the normalization was from ำ (SARA AM) to  ํ + า (NIKHAHIT + SARA AA). According to the Wiktionary entry for  ำ (SARA AM), the combined single glyph is preferred, but apparently the Thai stop word list doesn't know that. The two versions are virtually identical—I couldn't see any difference in a dozen fonts that I checked, though I do not have an expert eye. For example, ทำ is not a stop word, but ทํา is (ICU normalization converts the former to the latter.)
 * The slim remainder of tokens (all in the Thai Wikipedia sample) had either zero width spaces (90 tokens) or bidi marks (2 tokens) that left stop words after they were removed.
 * A few other non–stop word tokens with SARA AM merged with their NIKHAHIT + SARA AA counterparts.

An Excursus on Tokenization
Before continuing with ICU folding for Thai, I decided to look into and make some implementation decisions about tokenization.

The ICU tokenizer has algorithms/dictionaries for a lot of spaceless languages, including Thai, so it is a potentially viable alternative to the Thai tokenizer. However, neither tokenizer is perfect.

Some issue include....


 * Thai analysis in general
 * There are two obsolete characters, ฃ and ฅ, that have been replaced with the similar looking and similar sounding ข and ค, respectively. Easily fixed with a character filter.
 * Similar to Khmer—but apparently (thankfully!) nowhere near as complex—Thai has some ambiguous representations of diacritics and characters with diacritics.
 * ำ (SARA AM, U+0E33) and ํ + า (NIKHAHIT, U+0E4D + SARA AA, U+0E32) look identical: ทำ vs ทํา.
 * When sara am is combined with some other diacritics, there are three combinations that look the same, in many fonts (occurrences are in Thai Wikipedia, and are found with regular expressions, which can time out and give incomplete results, so counts are sometimes not exact):
 * กล่ำ = ก + ล + ่ +  ำ	(≥8900 occs)
 * กลํ่า = ก + ล + ํ +  ่ + า	(80 occs)
 * กล่ํา = ก + ล + ่ +  ํ + า	(6 occs)
 * There is a fourth combination that looks the same in some fonts/applications:
 * กลำ่ = ก + ล + ำ +  ่	(≥2 occs)
 * Approximately 1% of apparent instances of sara am and ่ (MAI EK, U+0E48) are in the wrong order.
 * This split of sara am also occurs around  ้ (MAI THO, U+0E49), but not any other Thai diacritics.
 * ึ (SARA UE, U+0E36) and ิ  +  ํ (SARA I, U+0E34 + NIKHAHIT, U+0E4D) and  ํ  +  ิ (NIKHAHIT, U+0E4D + SARA I, U+0E34) often look identical (depending on font and application): กึ vs กิํ vs กํิ
 * All of these can potentially screw up tokenization, and certainly will mess with matching in general. Fortunately, all of these can also be fixed with character filters.


 * Thai Tokenizer
 * — Doesn't split on em dash, en dash, hyphen-minus, hyphen, horizontal bar, fullwidth hyphen, double quote, colon, or hyphenation point. Easily fixed with character filter similar to.
 * ⬆⬆ Splits on periods between Thai characters.
 * — Zero width spaces, obsolete ฃ and ฅ, and improperly ordered/normalized diacritics can break the tokenization process and result in absurdly long tokens (~200 characters in the extreme). Readily fixable with character filters.
 * ⬇︎⬇︎ Sometimes the nikhahit character—which is generally kind of rare ("infrequently used")—can also cause tokenization to go awry, resulting in overly long tokens. I can't find a way to handle this. The character doesn't seem to be incorrect, so I can't delete it or substitute it.
 * ⬇⬇⬇ I found a buffering bug in the Thai tokenizer. When a string to be analyzed is over 1024 characters long, it gets broken up into 1024-character chunks, even if that splits a token. This can also cause weird long-distance effects ("spooky action at a distance"?) depending on how text is chunked and submitted for tokenization. I saw a split of an English token (BONUS → B + ONUS) that had spaces on either side of it—so there is no effort made in the tokenizer to prevent this problem, evven when it is straightforward.
 * ⬇ The tokenizer treats some characters, like some symbols & emoji, Ahom (𑜒𑜑𑜪𑜨), and Grantha (𑌗𑍍𑌰𑌨𑍍𑌥) essentially like punctuation, and ignores them entirely.
 * ⬇ Oddly ignores New Tai Lue (ᦟᦲᧅᦷᦎᦺᦑᦟᦹᧉ) tokens starting with ᦵ, ᦶ, ᦷ, ᦺ, ᧚, and has complicated/inconsistent processing of tokens with ᧞.
 * ⬇ Prefers longer tokens for compound words (e.g., พรรคประชาธิปัตย์ ("Democratic party") is one token instead of two).
 * — Sometimes fails to split some other tokens that the ICU tokenizer splits (these are harder to assess).


 * ICU Tokenizer
 * ⬆ Allows tokens for Ahom, Grantha, and some symbols and emoji to come through.
 * ⬆⬆ Better parsing for CJK, Khmer, Lao, New Tai Lue, and other spaceless languages
 * ⬇ Explodes homoglyph tokens: creates a new token whenever a new script is encountered, so a Latin token with a Cyrillic character in the middle gets broken into three tokens, making it unfindable. This is a known problem with the ICU tokenizer, but we still use it elsewhere.
 * ⬇ Relatedly, maintains "current" script set across spaces and assigns digits the "current" script, so the presence of particular earlier tokens can affect the parsing of later letter-number tokens. E.g., x 3a is parsed as x + 3a, while ร 3a is parsed as ร + 3 + a. This is a known problem with the ICU tokenizer, but we still use it elsewhere.
 * ⬆ Prefers multiple shorter tokens for compound words (e.g., พรรคประชาธิปัตย์ ("Democratic party") is two tokens, พรรค + ประชาธิปัตย์, instead of one).
 * — Sometimes splits some other tokens that the Thai tokenizer does not split (these are harder to assess).
 * — Allows apostrophes in Thai tokens. These tokens are kind of screwy because they are part of a pronunciation guide in Wiktionary, and they get split oddly no matter what. (For comparison, it would be like saying that ดอกจัน "asterisk" is pronounced dok'chan, and then parsing that as do + k'chan, because do is an English word.) These aren't super common and aren't real words; the word-looking non-word parts that are split off are more of a concern than the bogus tokens with apostrophes in them.
 * — Digits (Arabic 0-9 and Thai ๐-๙) glom on to non-digit words/tokens. This is reasonable in languages with spaces, where tokens are more obvious, but in a spaceless language, this seems like a bad idea—it renders both the number and the word it attaches to almost unsearchable. This was originally " ⬇⬇ ", but it can be fixed with a character filter, but it's a little ugly.
 * — Doesn't split on colon, hyphenation point, middot, semicolons (between numbers), or underscores. Easily fixed with character filter similar to.
 * - Doesn't split on periods between Thai characters. Should also be fixable with a character filter.

And the Winner is...
... the ICU tokenizer.

I'm personally rather annoyed by the behavior of the ICU tokenizer on homoglyph/mixed-script tokens and the parsing of mixed letter-number tokens, but they aren't super common, and we already accept that behavior for other languages that use the ICU tokenizer.

The improvements to rarer characters and scripts (emoji, Ahom, Grantha, and surely others not in my samples), improved parsing for other spaceless languages, and especially the handling of invisibles (like the zero width space) and the parsing of Thai compounds are all a lot better—so it makes sense to try to use the ICU tokenizer.

I added a character filter that handles:
 * the obsolete characters, by replacing them with the preferred characters.
 * re-ordering diacritics (fortunately there are only a few cases to handle—nothing as complex as the Khmer situation).
 * breaking words/tokens on colon, hyphenation point, middot, semicolons, and underscores, by replacing them with spaces.

Separating numbers from Thai tokens, and splitting only on periods in Thai words turned out to be an interesting exercise in regexes and character filters. The big problem occurs when a single letter is the second half of one regex match and the first half of the next regex match.

For example, I'd like ด3ด to be tokenized as ด + 3 + ด. But once the regex matches "ด3" in order to put a space between them, it continues matching after the 3, and so can't put a space between 3 and the next ด. The obvious (but potentially inefficient) solution is using a lookahead, but there's a bug that screws up the offsets into the original text, by excluding the final character. This screws up my tools—a minor problem—and would also screw up highlighting—a moderate problem.

Similarly, I want ด.ด.ด to be tokenized as ด + ด + ด, but for now—see T170625—I only want to replace periods with spaces between Thai characters (matching the behavior of the Thai tokenizer). In the case of single letters separated by spaces, we have the same regex-matching problem as above.

Rather than trying to do something extra cunning with lookaheads and complex replacements,* the most straightforward solutions is to break the number padding into two steps—Thai letter + number and number + Thai letter—and to run the exact same period replacement filter a second time to pick up any leftover strays. (I was a little worried about efficiency, but regex lookaheads aren't exactly the most efficient thing ever, and the simplified regexes are very simple, so there's no real difference on my virtual machine running Elasticsearch and CirrusSearch.)

____ * As an exercise for the interested reader, this is the single pattern I would have used for spacing out numbers, if not for lookaheads resulting in incorrect offsets:

I also configured the  to check for the ICU plugin before assuming the ICU tokenizer is available. and I configured some additional char_filters (deleting zero width spaces and splitting on many dash-like things, and double quotes) to accommodate some of the Thai tokenizer's weaknesses, if the ICU tokenizer isn't available.

New Tokenizer Results
Comparing the ICU tokenizer to the Thai tokenizer, a lot is going on. Beyond what has been mentioned so far...


 * There are generally more tokens.
 * My Thai Wiktionary sample had 21% more tokens! (61K/291K)
 * My Thai Wikipedia sample had 4% more tokens (142K/3.4M)
 * The vast majority of new tokens are Thai (46K for Wiktionary, 132K for Wikipedia). The Wikipedia data also showed a dramatic decrease in the number of Thai types (distinct Thai words)—from 103K with the Thai tokenizer, down to 41K with the ICU tokenizer. The average Thai type length also dropped from 5.3 to 4.5 for the Wiktionary sample and 7.6 to 5.1 for the Wikipedia sample. These are both indicative of longer, more distinctive phrases (like พรรคประชาธิปัตย์, "Democratic party") being broken into smaller words (like พรรค + ประชาธิปัตย์), many of which are then also independently seen elsewhere.
 * There are also hundreds to thousands more tokens from Chinese, Japanese, and Lao, because the ICU tokenizer knows how to do basic segmenting for these languages. The Wiktionary sample had about 9K more Chinese tokens with the ICU tokenizer!
 * There are thousands more Latin tokens (and to a much lesser degree many other scripts) because of splitting on hyphens and other separators.
 * Multi-script tokens (including those with homoglyphs) are split up by the ICU tokenizer.
 * So Cоветские, which starts with a Latin C, is split into c and оветские, instead of being fixed by the homoglyph plugin (which runs after tokenization). There are aren't very many of these—less than 10 in each 10K document sample.
 * On the other hand, fairly ridiculous tokens like ๆThe no longer exist.
 * There are no longer any ridiculously long Thai tokens.
 * The Thai Wikipedia sample with the Thai tokenizer had 50 tokens that were at least 50 characters long, including 2 over 200 characters long. The longest Thai token with the ICU Tokenizer is 20 characters long. These longer tokens seem to be names and technical terms, not whole sentences, so they are much more reasonable.
 * There are more long non-Thai tokens because the ICU tokenizer recognizes rare scripts, and they are converted to \u-encoding for Unicode (usually at a 12-to-1 increase in length.. e.g., Gothic 	𐌰𐍄𐍄𐌰 is indexed as \uD800\uDF30\uD800\uDF44\uD800\uDF44\uD800\uDF30.
 * There's evidence of the buffering bug in the Thai tokenizer when comparing it to the ICU tokenizer output. One example is that the only instance of all lowercase hitman had disappeared. Looking in the original text, I see that it was part of the name Whitman! So, that's an improvement!

Back to Our Regularly Scheduled Program

 * Enabled custom ICU folding, saw lots of the usual folding effects.
 * I exempted Thai diacritics from folding.
 * Words like กลอง / กล่อง / กล้อง ("drum" / "box" / "camera")—which differ only by diacritics—are clearly distinct words and should not be folded together on Thai-language wikis.
 * ICU folding changes include:
 * ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around—in dozens of scripts.
 * Modifying diacritics in general and Arabic "tatweel" & Japanese "katakana-hiragana prolonged sound mark" when occurring alone became empty tokens (and were filtered).
 * Thai Wiktionary:
 * The biggest changes to Thai tokens are the stripping of non-Thai diacritics (e.g., mācron, cîrcumflex, tĩlde, uṉderline, etc., e.g., ด̄, ด̂, ด̃, ด̰, or ด̱), and the removal of modifier primesʹ and double primesʺ.
 * The most common token mergers were similar spellings and phonetic spellings, or either with stress markers, e.g., Japan, jaːpan, jaːˈpɑn, and jāpān.
 * Thai Wikipedia:
 * Very few Thai tokens were affected.
 * The most common token mergers are Latin tokens with diacritics, and a fair number of quotes being straightened.

Overall Impact for Thai

 * There are a lot more Thai Wiktionary tokens (~58K, ~20%) and a few more Thai Wikipedia tokens (~114K, ~3%), but many fewer distinct tokens, which comes from the ICU tokenizer dividing up words more finely, as discussed above.
 * ICU tokenization is the biggest source of changes in both wikis.
 * Thai Wiktionary: 3.1K tokens (1.1% of tokens) were merged into 1.4K groups (2.6% of groups).
 * Thai Wikipedia: 11K tokens (0.3% of tokens) were merged into 1.4K groups (0.8% of groups).