User:TJones (WMF)/Notes/Unpacking Notes

From mediawiki.org

See TJones_(WMF)/Notes for other projects. See also T272606. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Why We Are Here[edit]

The purpose of unpacking analyzers is to enable them to be customized and upgraded with improvements that can be both language-specific (e.g., custom ICU folding) or generic (e.g., ICU normalization, or homoglyph processing).

The Unpacking Process[edit]

Gather Data

  • Gather 10K articles (without repeats) each from Wikipedia and Wiktionary for each language (custom Perl script, wikitext.pl)
    • Manual review/editing: remove leading white space, dedupe lines, review potential HTML tags ( search for <[a-z]+ )
  • Gather 10K/4weeks query data from Wikipedia for each language (Jupyter notebook, Sample_Queries.ipynb on stat1007)

Run Baselines

  • Per language (I've been working on three at a time recently to somewhat streamline the process):
    • set language to target and reindex
    • run analyze_counts.pl as baseline for wiki and wikt 10K samples

Unpack Analyzers

Re-enable Analyzer Upgrades

  • Re-enable homoglyphs and icu_norm upgrades
  • Per language:
    • set language to target and reindex
    • run analyze_counts.pl as upgraded for wiki and wikt 10K samples
    • run compare_counts.plsolo for baseline/upgraded for wiki and wikt; baseline_vs_upgraded comparison for wiki and wikt
      • solo—just trying to get the lay of the land
        • look at potential problem stems
        • look at largest Type Group Counts
          • anything around 20+ is interesting; well over 20 is surprising (but not necessarily wrong)
        • look at Tokens Generated per Input Token; usually expect 1 in baseline; some 2s with homoglyphs
        • look at Final Type Lengths; 1s are often CJK, longest are often URLs, German, spaceless languages, or \u encoded
      • comparison—see what changed
        • expect dotted-I regression
        • lots of hidden characters removed (soft hyphens, bidi marks, joiners and non-joiners)
        • Super- and subscript characters get converted, ß to ss, too
        • Regularization of non-Latin characters is common, particularly, Greek ς to σ
        • investigate anything that doesn’t make sense

Repair Unpacked & Upgraded Analyzers

  • Per language:
    • Make any needed “repairs” to accommodate ICU normalization
      • possibly just dotted_I_fix
    • set language to target and reindex
    • run analyze_counts.pl as repaired for wiki and wikt 10K samples
    • run compare_counts.plsolo for repaired for wiki and wikt; upgraded_vs_repaired comparison for wiki and wikt
      • solo—just trying to get the lay of the land
      • comparison—look for expected changes (maybe just dotted-I)

Enable ICU Folding

  • Per language:
    • enable ICU Folding
      • add language code to $languagesWithIcuFolding, and any folding exceptions to getICUSetFilter()
      • add asciifolding to filter list, usually in last place
    • set language to target and reindex
    • run analyze_counts.pl as folded for wiki and wikt 10K samples
    • run compare_counts.plsolo for repaired for wiki and wikt; repaired_vs_folded comparison for wiki and wikt
      • solo—potential problem stems can show systematic changes, even if they aren’t really problems
        • elision (l’elision, d’elision, qu’elision, s’etc.) can throw this off
      • comparison—look for expected changes (rare characters and variants folded, diacritics folded, etc.)

Compare Final Analyzer to Baseline

  • Per language:
    • run compare_counts.plbaseline_vs_folded comparison for wiki and wikt
      • comparison—look at the overall impact of unpacking, upgrades, and ICU folding
        • Token delta: expect small numbers (<100) unless something “interesting” happened
        • New Collision Stats gives a sense of the overall impact, # of tokens that merge into other groups.
          • Typically < 3% on each number, with higher values in Wiktionary
        • Possibly a few Lost pre-analysis tokens
        • Net Gains: expect plenty of changes; high-impact changes are usually—
          • one- or two- letter tokens (e.g., a picks up á, à, ă, â, å, ä, ã, ā, ə, ɚ)
          • something with a lot of variants that includes a folded character (e.g., abc, abcs, l'abs, l'abcs, d'abc, d'abcs, qu'abc, qu'abcs, etc. (with straight quotes) picks up l’abs, l’abcs, d’abc, d’abcs, qu’abc, qu’abcs, etc. (with curly quotes)
          • or a diacriticless typo (Francois) picks up all the forms with diacritics (François—it’s hard to find an example in English)
        • Don’t expect any New Splits, Found pre-analysis tokens, or Net Losses unless there was additional customization
    • Summarize findings (here and in Phabricator)

Merge Your Patch

  • When everything looks good and makes sense, submit the patch.
    • When the patch is merged, it’s time to reindex.

Prep Query Data

  • Before reindexing, using the 10K Wikipedia query sample:
    • Filter “bad queries” and randomly sample 3K queries (using a custom Perl script, run_queries.pl)
      • Review the “bad queries” to make sure the filters are behaving reasonably for the given language

Reindexing and Before-And-After Analysis

  • While reindexing Wikipedia for a given language, kick off “brute-force” sampling (using a custom Perl script, brute.pl)
    • The brute-force script runs the same 3K queries every 10 minutes while reindexing
    • Let it run 2–3 more iterations after reindexing is complete
    • You may have to throw out a query run if reindexing finished in the middle of the run
    • Using time stamps from the reindexing and query runs figure out the smallest gap between a “before” and an “after” query run and compare them (using a custom Perl script, comp_queries.pl), noting differences in zero results rate, increases and decreases in results counts, and changes in top results.
    • Use similarly spaced pre-reindexing runs and post-reindexing as controls to get a handle on normal variability and compare to the before-and-after results.
    • Comparing the earliest and latest pre-reindexing runs also allows you to judge what is random fluctuation and what is directional. e.g.:
      • if 10-minute interval comparisons all give 2-3% changes in top result, and a 60-minute interval gives 2.3% changes in top results, it’s probably random noise.
      • If 10-minute interval comparisons all give 2-3% changes in increased results, and a 60-minute interval gives 6% change in increased results, it’s probably partly noise overlaying a general increasing trend.
  • Summarize findings (here and in Phabricator)

Post-Reindexing Top Result Changes[edit]

I've seen something of a trend across wikis: The number of searches that have their top result change decreases dramatically after reindexing. It is possible that there is some effect from changed word stats from merging words after ICU Normalization or ICU Folding (e.g., resume and résumé are counted together). And of course new content may have been added to the Wiki that rightfully earns a place as the new top result for a given query.

However, after consulting with the Elasticsearch Brain Trust™, we decided that best explanation for this is increased consistency across shards after reindexing.

The most common cause of short term changes in top results is having the query served by a different shard. In addition to having different statistics for uncommon words that are spread unevenly across shards, word statistics are not immediately updated when documents are deleted or changed. Over time the shards are more likely to differ from each other.

After reindexing, every shard has a reasonably balanced brand-spanking new index with no history of deletions and changes, so the shards are likely more similar in their stats (and thus in their reporting of the top result).

Spanish Notes (T277699)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a few examples in each sample
  • Enabled ICU normalization and saw the usual normalization
    • Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Potential concerns:
      • and are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
      • However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
      • gets mapped to no, which is a stop word. gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
  • Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
  • Updated test fixtures for Spanish and multi-language tests.
  • Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.

Tokenization/Indexing Impacts

  • Spanish Wikipedia (eswiki)
    • There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
    • About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.
  • Spanish Wiktionary (eswikt)
    • There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
    • Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.

Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia (T282808)[edit]

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.

Background

  • I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
  • I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
  • I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.

Expected Results

  • Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.

Control Results

  • The number of queries getting zero results held steady at 19.3%
  • The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
  • The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
  • The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
  • The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
  • These results are also generally consistent with the control tests I ran in April and May.

Reindexing Results

  • The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
  • The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
  • The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
  • The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
  • The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.

Post-Reindex Control

  • The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.

German/Dutch/Portuguese Notes (T281379)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
  • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
  • Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • German required customization to maintain ß for stopword processing.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
    • Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)

German Notes[edit]

General German

  • ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched unicodeSetFilter in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
    • Sorting all this out also seems to have fixed T87136.
  • There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.

German Wikipedia

  • Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
  • It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from word_break_helper, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
  • About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.

German Wiktionary

  • Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
  • The longest tokens in my German Wiktionary sample are of this sort: \uD800­\uDF30­\uD800­\uDF3D­\uD800­\uDF33­\uD800­\uDF30­\uD800­\uDF43­\uD800­\uDF44­\uD800­\uDF30­\uD800­\uDF3F­\uD800­\uDF39­\uD800­\uDF3D (here with extra soft hyphens so it will wrap), which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
  • About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Dutch Notes[edit]

General Dutch

  • Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
  • There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.

Dutch Wikipedia

  • Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
  • About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.

Dutch Wiktionary

  • The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
  • About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Portuguese Notes[edit]

Portuguese Wikipedia

  • There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
  • The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and fi and fl ligatures being expanded to fi and fl.
  • Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
  • About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.

Portuguese Wiktionary

  • There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
  • Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓).
  • About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.

DE/NL/PT Reindexing Impacts[edit]

Impact Tool Filtering Improvements During German, Dutch, Portuguese Testing[edit]

While working on German, I discovered that 28 of the filtered German queries should not have been filtered (28 out of 10K isn't too, too many, though). Sequences of 6+ consonants are not too uncommon in German (e.g., Deutschschweizer, "German-speaking Swiss person", or Angstschweiß, "cold sweat"), but they do follow certain patterns, which I've now incorporated into my filtering.

I also added additional filtering for more URLs, email addresses, Cyrillic-flavored junk, and very long queries (≥100 characters) that get 0 results.

I tested these filtering changes on German, Dutch, Portuguese, Spanish, English, Khmer, Basque, Catalan, and Danish query corpora.

Unpacking + ICU Norm + ICU Folding + ß/ss Split Impact on German Wikipedia (T284185)[edit]

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for German Wikipedia. Folding diacritics had a noticeable impact on the zero results rate and the total number of results returned. For example, searching for surangama sutra now finds Śūraṅgama-sūtra. Reindexing in general seems to decrease variability in the top result.
  • I also disabled the folding of ß to ss in the plain field, which had a small negative impact on recall in certain corner cases. (See T87136 for rationale.)

Background

  • I pulled a 10K sample of German Wikipedia queries from April of 2021, and filtered 134 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
    • I later discovered that 28 of the filtered queries should not have been filtered (28 out of 10K isn't too, too many, though). Sequences of 6+ consonants are not too uncommon in German (e.g., Deutschschweizer, "German-speaking Swiss person", or Angstschweiß, "cold sweat"), but they do follow certain patterns, which I've now incorporated into my filtering.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on German Wikipedia, similar to the method used on Spanish Wikipedia. A number of control diffs were run every ~10 minutes before and after reindexing.
  • I compared each iteration against the subsequent one, and compared the first and last runs before reindexing to get insight into "trends" vs "noise" in the comparisons.

Control Results

  • The number of queries getting zero results held steady at 22.0%
  • The number of queries getting a different number of results increases slightly over time (0.3% to 1.6% in 10 minute intervals; 3.6% over 90 minutes)
    • The number of queries getting fewer results is noise (0.0% to 0.4% in 10 minute intervals; 0.5% over 90 minutes)
    • The number of queries getting more results increases slightly over time (0.2% to 1.5% in 10 minute intervals; 3.2% over 90 minutes)
  • The number of queries changing their top result is noise (1.5% to 2.2% in 10 minute intervals; 1.9% over 90 minutes)

Reindexing Results

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for German Wikipedia. Folding diacritics had a noticeable impact on the zero results rate and the total number of results returned. For example, searching for surangama sutra now finds Śūraṅgama-sūtra. Reindexing in general seems to decrease variability in the top result.
  • The zero results rate dropped to 21.7% (-0.3% absolute change; -1.4% relative change).
  • The number of queries getting a different number of results increased to 13.6% (vs. the 0.3%–1.6% range seen in control).
    • The number of queries getting fewer results was about 4 times the max of the control range (1.8% vs 0.0%–0.4%). 7 of 54 involve ss or ß, but I don't see a pattern for the rest. 37 of 54 only got 1 fewer result, so the impact is not large.
    • The number of queries getting more results was 11.5% (vs the control range of 0.2%–1.5%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 4.0% (vs. the control range of 1.5%–2.2%; that's less than 2x increase). I looked at some of these, and some are definitely the result of folding allowing for matching words in the top result.

Post-Reindex Control

  • The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.2% vs 1.5%–2.2%).

Observations

  • The most dramatic decrease in results (both in absolute terms and percentage-wise), was for the query was heisst s.w.a.t. ("what does S.W.A.T. do?"): from 3369 down to 67 results. Currently, word_break_helper is configured for the plain field, but not the text field (as before), and ß no longer maps to ss in the plain field. word_break_helper breaks up s.w.a.t. into four separate letters in the plain field (but not the text field), improving recall. So, the query in the plain field is was + heisst + s + w + a + t, while the text field query is heisst/heißt + s.w.a.t. Since heißt is much more common than heisst (68K vs 2K results), the plain query returns many fewer results.
    • On the one hand, enabling word_break_helper everywhere would be nice, but we also need proper acronym support! (T170625)

Unpacking + ICU Norm + ICU Folding Impact on Dutch Wikipedia (T284185)[edit]

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding had a likely minor impact for Dutch Wikipedia. There was a small decrease in zero-results queries, a general increase in recall (both attributable to ICU Folding—buthusbankje matches bûthúsbankje, or a curly quote is converted to a straight quote), and a decrease in changes to top queries (a general side-effect of reindexing).

Background

  • I pulled a 10K sample of Dutch Wikipedia queries from April of 2021, and filtered 125 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on Dutch Wikipedia, similar to the method used on Spanish Wikipedia. A number of control diffs were run every ~10 minutes before and after reindexing.
  • I was unable to time the query runs with reindexing just right, so the reindexing finished during one of the query runs. I had to drop that one, so comparisons are across every other run (i.e., ~20 minutes apart). I also compared the first and last runs before and after reindexing to try to get insight into "trends" vs "noise" in the comparisons, but the shorter total time (~30 minutes) wasn't really long enough to let the signal emerge from the noise.

Control Results

  • The number of queries getting zero results held steady at 23.3%
  • The number of queries getting a different number of results is hard to judge (0.7% to 1.1% in 20 minute intervals; 1.2% over 30 minutes)
    • The number of queries getting fewer results is possibly noise (0.2% to 0.8% in 20 minute intervals; 0.8% over 30 minutes)
    • The number of queries getting more results is probably noise (0.3% to 0.8% in 20 minute intervals; 0.5% over 30 minutes)
  • The number of queries changing their top result is probably noise (1.2% to 1.4% in 20 minute intervals; 1.2% over 30 minutes)

Reindexing Results

  • While unpacking an analyzer should have no impact on results, adding ICU folding had a likely minor impact for Dutch Wikipedia. There was a small decrease in zero-results queries, a general increase in recall (both attributable to ICU Folding), and a decrease in changes to top queries (a general side-effect of reindexing).
  • The zero results rate dropped to 23.2% (-0.1% absolute change; -0.4% relative change).
  • The number of queries getting a different number of results increased to 8.0% (vs. the 0.7%–1.1% range seen in control).
    • The number of queries getting fewer results was within the control range (0.3% vs 0.2%–0.8%).
    • The number of queries getting more results was 7.5% (vs the control range of 0.3%–0.8%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 3.4% (vs. the control range of 1.2%–1.4%).

Post-Reindex Control

  • The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.1% vs 1.2%–1.4%).

Observations

  • Zero-results changes are all due to ICU folding, so that buthusbankje matches bûthúsbankje, or a curly quote is converted to a straight quote. These are all fairly rare words that got ≤5 results with ICU Folding.
  • Large increases in number of results and changes in the top result are largely obviously from ICU folding.

Unpacking + ICU Norm + ICU Folding Impact on Portuguese Wikipedia (T284185)[edit]

Summary

  • ICU folding increases recall for some queries, affecting zero results rate and the total number of results returned. Missing tildes (a instead of ã, or o instead of õ) are the biggest source of changes, so this is a very good change for Portuguese searchers who omit them!

Background

  • I pulled the usual sample of 10K queries from Portuguese Wikipedia (April 2021), filtered 149 queries, and randomly sampled 3K from the remainder.
  • I used a brute-force diff strategy, with control diffs before and after (at ~10 minute intervals).
    • The before/after time difference was 15 minutes because of the exact time reindexing finished.

Control Results

  • The number of queries getting zero results held steady at 18.7%
  • The number of queries getting a different number of results is increasing (0.8% in 10 minute intervals; 1.5% over 20 minutes)
    • The number of queries getting fewer results is noise (0.1% to 0.3% in 10 minute intervals; 0.3% over 20 minutes)
    • The number of queries getting more results is increasing (0.6% to 0.7% in 10 minute intervals; 1.2% over 20 minutes)
  • The number of queries changing their top result is noise (1.1% to 1.4% in 10 minute intervals; 1.4% over 20 minutes)

Reindexing Results

  • The zero results rate dropped to 18.3% (-0.4% absolute change; -2.1% relative change).
  • The number of queries getting a different number of results increased to 15.3% (vs. the 0.8% seen in control).
    • The number of queries getting fewer results was similar to the control range (1.0% in 15 minutes vs 0.6%–0.7% in 10 minutes and 1.2% in 20 minutes).
    • The number of queries getting more results was 13.8% (vs the control range of 0.6%–0.7%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 3.4% (vs. the control range of 1.2%–1.4%).

Post-Reindex Control

  • The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.1% vs 1.2%–1.4%).

Observations

  • Zero-results changes are mostly obviously due to ICU folding.
  • Large increases in number of results and changes in the top result are largely obviously from ICU folding. Particularly sao matching são—which increased hits from 300 to 21K!
    • The one query I couldn't figure out was 1926~. The absolute increase is fairly large (~5K) but the relative increase it not (2.3%—out of 218K).
  • Overall, missing tildes (a instead of ã, or o instead of õ) are the biggest sources of changes.

Basque, Catalan, and Danish Notes (T283366)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.
  • Stemming observations:
    • Catalan Wikipedia had up to 180(!) distinct tokens in stemming groups.
    • Basque Wikipedia had up to 200(!!) distinct tokens in stemming groups.
    • Danish Wikipedia had a mere 30 distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a handful of examples in all six samples.
    • Catalan Wikipedia had two mixed–Cyrillic/Greek/Latin tokens!
    • Found Greek/Latin examples in all three Wikipedias and Danish Wiktionary, and Greek/Cyrillic in Catalan Wikipedia.
  • Enabled ICU normalization and saw the usual normalizations.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations: lots of ß and invisibles (soft-hyphen, bidi marks, etc.) all around; 1ª, 1º for Basque and Catalan Wikipedias, and some full-width characters for Catalan Wikipedia.
    • Catalan Wikipedia also loses a lot (12K+ out of 4.1M) of "E⎵" and "O⎵" tokens, where ⎵ represents a "zero-width no-break space" (U+FEFF). "e" and "o" are stop words—"o" means "or", but "e" just seems to refer to the letter; weird. The versions with U+FEFF seem to be used exclusively in coordinates ("E" stands for "est", which is "east"; "O" stands for "oest", which is "west"). Since the coords are very exact (e.g., "42.176388888889°N,3.0416666666667°E"), I don't think many people are searching for them specifically, and if they are, the plain field will help them out.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [ñ] for Basque and [æ, ø, å] for Danish. [ç] was unclear for Basque and Catalan, but I let it be folded to c for both for the first pass.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Basque: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
    • Catalan Wiktionary: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
    • Catalan Wikipedia:
      • Lots of high-impact collisions (ten or more distinct words merged into another group—often two largish groups merging). They came in three flavors:
        • The majority are ç → c; most look ok
        • A few ñ → n; these look good; mostly low frequency Spanish cognates merging with Catalan ones
        • Single letters merging with diacritical variants, like [eː, e̞, e͂, ê, ē, Ĕ, ɛ, ẹ, ẽ, ẽː] merging with [È, É, è, é]
      • Surprisingly, lots of Japanese Katakana changes, deleting the prolonged sound mark ー.
    • Danish: Also straightened a fair number of curly quotes.

Overall Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters. (But see Catalan Wikipedia.)
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (0.5% to 1.5% of tokens being redistributed into 1% to 3% of stemming groups).
    • Basque Wiktionary: 649 tokens (1.111% of tokens) were merged into 473 groups (2.330% of groups)
    • Basque Wikipedia: 27,620 tokens (1.175% of tokens) were merged into 3,244 groups (1.325% of groups)
    • Catalan Wiktionary: 840 tokens (0.520% of tokens) were merged into 400 groups (1.181% of groups)
    • Catalan Wikipedia:
      • 12.7K fewer tokens out of 4.1M (see "E⎵" and "O⎵" above)
      • 39,099 tokens (0.943% of tokens) were merged into 2,513 groups (0.967% of groups)
    • Danish Wiktionary: 1,515 tokens (1.387% of tokens) were merged into 904 groups (2.788% of groups)
    • Danish Wikipedia: 20,778 tokens (0.611% of tokens) were merged into 2,990 groups (1.023% of groups)

CA/DA/EU Reindexing Impacts[edit]

An Unexpected Experiment[edit]

David needed to reindex over 800 wikis for the ores_articletopicsweighted_tags rename, including all of the large wikis covered by unpacking Catalan, Danish, and Basque. (There was another small wiki for the Denmark Wikimedia chapter, which I reindexed.)

Because I couldn't control the exact timing of the reindexing, I ran 5 pre-reindex control query runs at 10 minute intervals for comparison, and then ran follow-up query runs at approximately 1-day intervals (usually ±15 minutes, sometimes ±2 hours).

The exact number of pre-reindex controls and post-reindex controls for each language differed because they were reindexed on different days.

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • Catalan has a very large improvement in zero-results rate (8.1% relative improvement, or 1 in 12), largely driven by the fact that people type -cio for -ció (which is cognate with Spanish -ción and English -tion).
  • In general, the impact on Danish was very mild; the general variability in Danish query results is lower than for other wikis.
  • Basque improvements are in large part due to queries in Spanish that are missing the expected Spanish accents.

Background

  • I pulled a sample of 10K Wikipedia queries from April of 2021 (1 week each for Catalan and Danish, the whole month for Basque). I filtered obvious porn, urls, and other junk queries from each sample (ca:237, da:396, eu:438, urls most common category in all cases) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Catalan Wikipedia (T284691)[edit]

Reindexing Results

  • Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
  • The zero results rate dropped from 14.9% to 13.7% (-1.2% absolute change; -8.1% relative change).
  • The number of queries that got more results right after reindexing was 30.4%, vs. the pre-reindex control of 17.1% and post-reindex control of 14.8–17.6%.
  • The number of queries that changed their top result right after reindexing was 6.2%, vs. the pre-reindex control of 1.0% and post-reindex control of 0.6–2.0%.

Observations

  • The most common cause of improvement in zero-results is matching -cio in the query with -ció in the text, and they generally look very good.
  • Some of the most common causes of an increased number of results include -cio/-ció, other accents missing in queries, and c/ç matches. Not all of the highest impact c/ç matches look great, but these are edge cases. From the earlier analysis chain analysis (see above), I expect c/ç matches are overall a good thing, though we should keep an eye out for reports of problems.

Unpacking + ICU Norm + ICU Folding Impact on Danish Wikipedia (T284691)[edit]

Reindexing Results

  • Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
  • The zero results rate dropped from 28.6% to 28.2% (-0.4% absolute change; -1.4% relative change).
  • The number of queries that got more results right after reindexing was 9.0%, vs. the pre-reindex control of 2.1–3.1% and post-reindex control of 2.0–3.0%.
  • The number of queries that changed their top result right after reindexing was 1.7%, vs. the pre-reindex control of 0.7–0.9% and post-reindex control of 0.2–0.9%.

Observations

  • Generally the impact on Danish Wikipedia was very muted compared to most others we've seen so far.

Unpacking + ICU Norm + ICU Folding Impact on Basque Wikipedia (T284691)[edit]

Reindexing Results

  • Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
  • The zero results rate dropped from 24.4% to 23.1% (-1.3% absolute change; -5.3% relative change).
  • The number of queries that got more results right after reindexing was 21.6%, vs. the pre-reindex control of 6.9–7.9% and post-reindex control of 7.2–10.0%.
  • The number of queries that changed their top result right after reindexing was 4.0%, vs. the pre-reindex control of 0.2–0.6% and post-reindex control of 0.1–0.2%.

Observations

  • A lot of the rescued zero-results and some of the other improved queries are in Spanish, and are missing the expected Spanish accents.

Unexpected Experiment, Unexpected Results![edit]

The results of this unexpected experiment are actually very good. With fairly different behavior from all three of these samples (Catalan with big improvements, Basque with more typical improvements, and Danish with smaller improvements and generally less variability), the impacts—especially now that we know where to expect them—are easy to detect at one-day intervals, despite the general variability in results over time. This means I can back off my sampling rate from ~10 minutes (which is sometimes hard to achieve) to something a little easier to handle—like half-hourly or hourly.

Czech, Finnish, and Galician Notes (T284578)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, IPA transcriptions (in Wiktionary) etc.
  • Stemming observations:
    • Czech Wikipedia had 37 distinct tokens in its largest stemming group.
      • The Czech stemmer stems single letters c → k, z → h, č → k, and ž → h (though plain z is a stop word) and ek → k and eh → h. This seems like an over-aggressive stemmer... looking at the code, it is modifying endings even when there is nothing that looks like a stem. I will submit a ticket or maybe work on a patch as a 10% project.
    • Finnish Wikipedia had 61 distinct tokens in its largest stemming group.
    • Galician Wikipedia had 66 distinct tokens in its largest stemming group.
      • Since I can recognize some cognates in other Romance languages, I can say that the largest group is a little aggressive; it includes Ester, Estaban, estación, estato, estella, estiño, plus many forms of estar.
      • Galician also has a very large number of words in other scripts, which lead to some very long tokens, like the 132-character \u-encoded version of 𐍀𐌰𐌿𐍂𐍄𐌿𐌲𐌰𐌻𐌾𐌰, Gothic for "Portugal".
      • Galician Wiktionary likes to use superscript numbers for different meanings of the same word, so the entry for canto has canto¹ through canto⁴, which get indexed as canto1 through canto4—there a fair number of such tokens. Fortunately, the unnumbered version should always be on the same page.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and found plenty of examples.
    • There are some Greek/Latin examples in Czech
      • Including "incorrect" Greek letters in IPA on cswikt (oddly, there are some Greek letters that are commonly used in IPA and others that have Latin equivalents that are used instead, and for a couple it's a free-for-all!)
    • There are Cyrillic/Greek and Latin/Greek examples in Finnish Wikipedia and Galician Wiktionary.
    • Galician Wikipedia had lots of Latin/Greek tokens—though many seem to be abbreviations for scientific terms... but there are a few actual mistakes in there, too.
  • Enabled ICU normalization and saw the usual normalizations.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • Czech: the usual various character regularizations, invisibles (bidi, zero-width (non)joiners, soft hyphens), a few #ª ordinals
      • Finnish: mostly ß/ss & soft hyphens
      • Galician: lots of #ª ordinals, lots of invisibles
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [Áá, Čč, Ďď, Éé, Ěě, Íí, Ňň, Óó, Řř, Šš, Ťť, Úú, Ůů, Ýý, and Žž] for Czech.
    • Exempted [Åå, Ää, Öö] for Finnish.
    • Exempted [Ññ] for Galician.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Czech: lots more tokens with Latin + diacritics than usual, since the list of exemptions is pretty big, and exempts some characters used in other languages, like French and Polish.
    • Finnish: lots of š and ž, which are supposed to be used in loan words and foreign names, but are often simplified to s or z (or sh and zh, but that is probably outside our scope).
    • Galician: Nothing really sticks out as particularly common; just a collection of the usual folding mergers.

Czech, Finnish, Galician Reindexing Impacts[edit]

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • The Czech and Finnish Wikipedia samples showed clear but rather muted impact on user query results. The Galician results are a little more robust and show a more consistent pattern of searchers not using standard accents (rather than just problems with "foreign" diacritics).

Background

  • I pulled a sample of 10K Wikipedia queries from approximately July of 2021 (1 week each for Czech and Finnish, June through August for Galician). I filtered obvious porn, urls, and other junk queries from each sample (Czech:152, Finnish:226, Galician:928, urls are the most common category in all cases, with numbers and junk being common for all, as well. Galician also had a lot of porn queries, and overall more useless queries, which is a trend on smaller wikis). I randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Czech Wikipedia (T290079)[edit]

Reindexing Results

  • The zero results rate dropped from 23.8% to 23.6% (-0.2% absolute change; -0.8% relative change).
  • The number of queries that got more results right after reindexing was 8.4%, vs. the pre-reindex control of 0.2–0.7% and post-reindex control of 0.1–0.7%.
  • The number of queries that changed their top result right after reindexing was 1.6%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0%.

Observations

  • Generally the impact on Czech Wikipedia was rather muted. Changes in results were generally from missing diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Finnish Wikipedia (T290079)[edit]

Reindexing Results

  • The zero results rate dropped from 24.6% to 24.4% (-0.2% absolute change; -0.8% relative change).
  • The number of queries that got more results right after reindexing was 9.1%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 4.0%, vs. the pre-reindex control of 0.4–0.5% and post-reindex control of 0.0–0.1%.

Observations

  • Generally the impact on Finnish Wikipedia was also muted. Changes in results were generally from missing diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Galician Wikipedia (T290079)[edit]

Reindexing Results

  • The zero results rate dropped from 18.1% to 17.5% (-0.6% absolute change; -3.3% relative change).
  • The number of queries that got more results right after reindexing was 18.6%, vs. the pre-reindex control of 0.2–0.5% and post-reindex control of 0.1–0.7%.
  • The number of queries that changed their top result right after reindexing was 4.1%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.

Observations

  • The most common causes of improvement in zero-results came from matching missing accents on words that end with vowel + n. Cognate with what we've seen before, -cion for -ción is common, along with general accents missing from -ón/-ín/-ún endings.
  • The most common causes of an increased number of results and changes in the top result include correcting for missing accents from final vowel + n, and general incorrect (missing, extra, or wrong) diacritics.

Hindi, Irish, Norwegian Notes (T289612)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
    • Except for Irish Wiktionary, which is quite small; I used a 1K sample for gawikt.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, compounds, a bit of likely vandalism; etc.
  • Stemming observations:
    • Irish Wikipedia had 16 distinct tokens in its largest stemming group.
    • Norwegian Wikipedia had 18 distinct tokens in its largest stemming group.
    • Hindi Wikipedia had 46 distinct tokens in its largest stemming group.
      • The first pass at analysis showed 1780 "potential problem" stems in the Hindi Wikipedia data, which are ones where the stemming group has no common prefix and no common suffix. This isn't particularly rare, but there usually aren't so many. It turns out that the majority (~1400) were caused by Devanagari numerals and Arabic numerals (e.g., १ and 1). I added folding rules to my analysis to handle those cases. Another common cause were long versions of vowels, such as अ (a) and आ (ā), which seem to frequently alternate at the beginning of words that have the same stem. A few more folding rules and I got down to a more normal number of "potential problem" stems—just 12—and they were all reasonable.
    • A smattering of mixed-script tokens.
      • Hindi had many non-homoglyph mixed script tokens, mostly Devanagari and another script. Many of these were separated by colons or periods, making me think word_break_helper could be useful, especially with better acronym handling.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
      • Though not for Irish! Since Irish has language-specific lowercasing rules, both lowercasing and ICU normalization happen and lowercasing handles İ correctly.
    • Most common normalizations:
      • Irish Wikipedia also uses Mathematical Bold Italic characters (e.g., 𝙄𝙧𝙚𝙡𝙖𝙣𝙙) rather than bold and italic styling in certain cases, such as names of legal cases.
        • One instance of triple diacritics stuck out: gCúbå̊̊
      • Hindi had lots of bi-directional symbols, including on many words that are not RTL.
      • Norwegian had the usual various character regularizations, mostly diacritics, plus a handful of invisibles.
  • Further Customization—Irish
    • Older forms of Irish orthography used an overdot (ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ/ẛ ṫ) to indicate lenition, which is now usually indicated with a following h (bh ch dh fh gh mh ph sh th). Since these are not commonly occurring characters, it is easy enough to do the mapping (ḃ => bh, etc.) as a character filter. It doesn't cause a lot of changes, but it does create a handful of good mergers.
    • Another feature of Gaelic script is that its lowercase i is dotless (ı). However, since there is no distinction between i and ı in Irish, i is generally used in printing and electronic text. ICU folding already converts ı to i.
      • As an example, amhráin ("songs") appears in my corpus both in its modern form, and its older form, aṁráın (with dotted ṁ and dotless ı). Adding the overdot character filter (plus the existing ICU folding) allows these to match!
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Nothing exempted for Irish or Hindi.
    • Exempted Ææ, Øø, and Åå for Norwegian.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
      • Irish uses a fair number of acute accents to mark long vowels, though it seems to sometimes be omitted (perhaps as a mistake). There are quite a few mergers between diacriticked (or partly diacriticked) forms and fully diacriticked forms, such as cailíochta and cáilíochta. There are a few potential incorrect forms—I recognize some English words that happen to look like forms of Irish words—but there aren't a lot, and some of them are already conflated by the current search.
    • Hindi: Most folding affects Latin words, and most of the Hindi words that were affected had bidi and other invisible characters stripped.
    • Norwegian Wiktionary had a surprising number of apparently Romance-language words that had their non-Norwegian diacritics normalized away.

Overall Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters.
  • ICU folding is the biggest source of changes in all wikis—as expected.
    • Irish Wikipedia: 134,095 tokens (15.887% of tokens) were merged into 2,524 groups (2.822% of groups).
    • Irish Wiktionary: 130 tokens (1.272% of tokens) were merged into 44 groups (1.074% of groups).
      • Irish Wiktionary mergers may be less numerous because of the smaller 1K sample size.
      • Irish had a much bigger apparent impact (15.887% of tokens), which is partially an oddity of accounting.
        • Looking at amhrán ("song") as an example, the original main stemming group consisted of amhrán, Amhrán, amhránaíocht, Amhránaíocht, amhránaíochta, Amhránaíochta, d’amhrán, nAmhrán, and tAmhrán. Another group without acute accents—possibly typos—consisted of amhran and Amhran. The larger group (which has more members that are also more common) is counted as merging into the smaller group because the new folded stem is amhran, not amhrán, giving 9 mergers rather than 2.
    • Hindi Wiktionary: 4 tokens (0.002% of tokens) were merged into 4 groups (0.012% of groups).
    • Hindi Wikipedia: 296 tokens (0.019% of tokens) were merged into 150 groups (0.128% of groups).
      • Hindi was barely affected by ICU folding, since it doesn't do much to Hindi text.
    • Norwegian Wiktionary: 1,310 tokens (1.229% of tokens) were merged into 990 groups (4.302% of groups)
    • Norwegian Wikipedia: 6,731 tokens (0.424% of tokens) were merged into 1,633 groups (0.979% of groups)
      • Generally, the merges that resulted from ICU folding in Norwegian were significant, but not extreme.

Irish, Hindi, Norwegian Reindexing Impacts[edit]

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • Specific new matches in all three (Irish, Hindi, & Norwegian) Wikipedias are good.
  • The impact overall on the zero-results rate is fairly small for all three.
    • The zero-results rate for Hindi Wikipedia, independent of recent changes, it really high (60+%), so I investigated a bit. Transliteration of Latin queries to Devanagari could have a sizable impact.
  • Irish and Norwegian had a sizable increase in total results, and a noticeable increase in top results. Hindi had much smaller increases for both.
    • Irish changes were dominated by Irish diacritics (which are not part of the alphabet), while the Norwegian changes were dominated by foreign diacritics.

Background

  • I tried to pull a sample of 10K Wikipedia queries from June–August of 2021 (1 week in July each for Hindi and Norwegian, almost three months for Irish). I was only able to get 2,543 queries for Irish Wikipedia. I filtered obvious porn, urls, and other junk queries from each sample (Irish:959, Hindi:528, Norwegian:250, with urls and porn being the most common categories) and randomly sampled 3000 queries from the remainder (there were only 1448 unique queries left for the Irish sample).

Unpacking + ICU Norm + ICU Folding Impact on Irish Wikipedia (T294257)[edit]

Reindexing Results

  • The zero results rate dropped from 32.6% to 30.5% (-2.1% absolute change; -6.4% relative change).
  • The number of queries that got more results right after reindexing was 12.3%, vs. the pre-reindex control of 0% and post-reindex control of 0%.
  • The number of queries that changed their top result right after reindexing was 5.4%, vs. the pre-reindex control of 0.2% and post-reindex control of 0%.

Observations

  • The most common cause of improvement in zero-results is matching missing Irish diacritics.
  • The most common cause of an increased number of results is also matching missing Irish diacritics.
    • Unaccented versions of names like Seamus, Padraig, and O Suilleabhain now can find the accented versions (Séamus, Pádraig, Ó Súilleabháin).
    • Not all diacritical matches are the best. Irish matches English be, which occurs in titles of English works. matches are still ranked highly because of exact matches.
  • The most common cause of changes in the top result is—you guessed it!—matching missing Irish diacritics; often with a near exact title match.
  • The negligible or zero changes in number of results and top results stems from, I believe, the small size and low activity of the wiki; basically, there is virtually no noise at the 15–30 minute scale.

Unpacking + ICU Norm + ICU Folding Impact on Hindi Wikipedia (T294257)[edit]

Reindexing Results

  • The zero results rate dropped from 62.1% to 62.0% (-0.1% absolute change; -0.2% relative change).
  • The number of queries that got more results right after reindexing was 2.3%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 0.9%, vs. the pre-reindex control of 0.1% and post-reindex control of 0%.

Observations

  • The most common cause of improvement in zero-results is matching missing foreign diacritics. (e.g., shito/shitō and nippo/nippō)
  • The most common causes of an increased number of results are matching missing foreign diacritics, removal of invisibles, and—to a much lesser degree—ICU normalization of some Hindi and other Brahmic accents, including Devanagari and Odia/Oriya virama and Sanskrit udātta.
  • The most common causes of changes in the top result are the same as for the increased number of results, since there is a lot of overlap (i.e., searches that got more results often changed their top result).
Hindi Wikipedia Zero Results Queries[edit]

Because the zero results rate was so high, I decided there was no time like the present to do a little investigating into why. I did a little diffing into the 1,861 queries that got no results. (A reminder where this sample comes from: 10K Hindi Wikipedia queries were extracted from the search logs, 528 were filtered as porn, URLs, numbers-only, other junk, etc., and the remainder was deduped, leaving 9,060 unique queries. A random sub-sample of 3K was chosen from there, and the 1,861 (62.0%) of those that got zero results are under discussion here.)

The large majority (84%) of zero-results queries are in the Latin script, with Devanagari (13%) and mixed Latin + Devanagari (2%) making up most of the rest.

  • 1566 (84.1%) Latin
  • 244 (13.1%) Devanagari
  • 43 (2.3%) Latin + Devanagari
  • 4 (0.2%) Gujarati
  • 1 Gurmukhi (Punjabi) + Devanagari
  • 1 CJK
  • 1 emoji
  • 1 misc/wtf (punct + Devanagari combining chars)

I reviewed a random sample of 50 of the Latin queries, and divided them into two broad (and easy for me to discern) categories—English and non-English. The non-English generally looks like transliterated Devanagari/Hindi, but I did not explicitly verify that in all cases. There are a relatively small number of English queries, and larger number of mixed English and non-English queries, and the majority (~70%) are non-English.

50 Latin sample

  • 34 non-English
  • 13 Mixed English + non-English
  • 2 English
  • 1 ???

I took a separate random sample of 20 non-English queries and used Google Translate in Hindi to conver them to Devanagari. About half couldn't be automatically converted (I didn't dig into that to figure out why), but 25% got some Wikipedia results after conversion, and 15% that got no Wikipedia results got some sister-search (Wiktionary, etc.) results. The remaining 15% got no results.

20 non-English sample

  • 9 can't convert
  • 5 some results
  • 3 sister search results
  • 3 no results

Taking this naive calculation with a huge grain of salt (or at least with huge error bars), 84.1% of zero-result queries are in Latin script, 68% of those are likely transliterated Devanagari, and 40% of those get results when transliterated back to Devanagari. That's 22.9% (probably ±314.59%)... actually, the math nerd in me couldn't let it go... using the Wilson Score Interval and the standard error propagation formula for multiplication, I get 23.3% ± 11.9%.

So, in very round numbers, almost ¼ of non-junk zero-result queries (and likely at least ⅒ and at most ⅓) on Hindi Wikipedia could be rehabilitated with some sort of decent Latin-to-Devanagari transliteration. The number could be noticeably higher, too—most optimistically doubled—if the queries that Google Translate could not automatically convert got some sort of results with a more robust transliteration scheme; on the other hand, they could all be junk, too. It is also possible that the mixed English and transliterated Devanagari zero-result queries could get some results—though transliterating the right part of the mixed queries could present a significant challenge.

I have opened a ticket with this info (T297761) to go on our backlog as a possible future improvement for Hindi.

I also looked at a random sample of 20 of the zero-result Devanagari queries. The most common grouping is what I call "homework". These are queries that are phrased like all or part of a typical homework question, or other information-seeking question. Something like What is the airspeed velocity of an unladen swallow?, How does aspirin find a headache,or hyperbolic geometry parallel lines.

I also found four names, one porn query, and three I couldn't readily decipher.

20 Devanagari sample

  • 12 "homework"
  • 4 names
  • 1 porn
  • 3 ???

Homework-type questions in general sometimes benefit from removing stop words, but sometimes there are too many specific but only semi-relevant content words to find a match.

Unpacking + ICU Norm + ICU Folding Impact on Norwegian Wikipedia (T294257)[edit]

Reindexing Results

  • The zero results rate dropped from 26.4% to 26.2% (-0.2% absolute change; -0.8% relative change).
  • The number of queries that got more results right after reindexing was 9.2%, vs. the pre-reindex control of 0.1–0.3% and post-reindex control of 0.1–0.2%.
  • The number of queries that changed their top result right after reindexing was 4.3%, vs. the pre-reindex control of 1.1–1.2% and post-reindex control of 0%.

Observations

  • The most common cause of improvement in zero-results is matching missing foreign diacritics. (e.g., Butragueno/Butragueño and Bockmann/Böckmann)
  • The most common cause of an increased number of results is matching foreign diacritics.
  • The most common cause of changes in the top result is matching foreign diacritics.

Bengali Notes (T294067)[edit]

The situation with Bangla/Bengali is a little different than others I've worked on so far. The Bengali analyzer from Elasticsearch has not been enabled, so I need to enable it, verify it with speakers, and unpack it so that we don't have any regressions in terms of handling ICU normalization or homoglyph normalization.

Since enabling a new analyzer is more complex than the other unpacking projects, I've put the details on their own page.

Bengali Reindexing Impacts[edit]

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries), and any unexpected impacts.

Summary

  • Bengali Wikipedia had a very high zero-results rate (49.0%), and introducing stemming (and other changes—but mostly stemming) provided results for about ⅐ of zero-results queries, lowering the zero-results rate to 42.3%—which is still very high, but definitely better.

Background

  • I pulled a sample of 10K Wikipedia queries from one week in July of 2022. I filtered obvious porn, urls, and other junk queries from the sample (185 queries filtered, porn and strings of numbers are the most common categories) and randomly sampled 3000 unique queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Bengali Wikipedia (T315265)[edit]

Reindexing Results

  • The zero results rate dropped from 49.0% to 42.3% (-6.7% absolute change; -13.7% relative change).
  • The number of queries that got more results right after reindexing was 33.0%, vs. the pre-reindex control of 0.0-0.2% and post-reindex control of 0.0-0.6%.
  • The number of queries that changed their top result right after reindexing was 19.6%, vs. the pre-reindex control of 0.1%% and post-reindex control of 0.0%.

Observations

  • The most common causes of all changes seems to be stemming.

Arabic and Thai Notes (T294147)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, chemical names, etc.

Since Thai became so involved, I'm going to split my notes on Arabic and Thai, rather than have them interleaved as I usually do.

Arabic Notes[edit]

  • Some Arabic observations:
    • Lots of (invisible) bi-directional markers everywhere.
    • There are a number of empty tokens, which result from 1 to 4 tatweel characters (ـ), which are used to elongate words or characters to justify text. They are rightly ignored, but there are a few hundred instances where it appears by itself, creating empty tokens.
    • There are a handful of homoglyph tokens—Cyrillic/Latin and Greek/Latin... gotta work on those Greek homoglyphs!
  • Stemming observations:
    • Arabic Wikipedia had 98(!) distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • For Arabic, enabling homoglyphs and ICU normalization resulted in the usual stuff.
    • A smattering of mixed-script tokens.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • Arabic had lots of loose bidi marks and a few zero-width (non)joiners and non-breaking spaces that get cleaned up.
      • Arabic Wiktionary had a fair number of long s (ſ) characters that are properly folded to s, and an fi digraph (fi) is folded to fi.
  • Enabled custom ICU folding, saw lots of the usual folding effects.
    • Arabic-specific ICU folding includes:
      • ignoring inverted damma ( ٗ — "The bulk of Arabic script is written without ḥarakāt"—short vowel marks, including damma)
      • ignoring dagger alif (  ٰ — "it is seldom written")
      • converting Farsi yeh (ی) to Arabic yeh (ي — which might not make sense in Farsi, but does in Arabic)
      • converting keheh (ک) to kaf (ك — again makes sense to convert "foreign" letter variants to native ones)
      • removing hamza from waw (ؤ to و) and yeh (ئ to ي) — these were a little less obviously good to me; thanks to Mike R. for giving me the lowdown. While not perfect, these normalizations are generally positive and reflect the way hamza is often used in practice.
      • A note as to scale: in my sample of 10K Arabic Wikipedia articles, there are 211K distinct token types in the text before language analysis (from 1.6M total tokens), and 112K distinct token types after analysis. Of those 112K types, only 550 are affected by all these ICU folding changes. In my 10K-entry Wiktionary sample, only 57 types are affected by all Arabic ICU folding normalizations—out of 27K distinct token types after language analysis (39K types before analysis; 100K tokens total).

Overall Impact for Arabic[edit]

  • There were few token count differences in most cases, mostly from fewer solo combining characters and Arabic tatweel.
  • ICU folding is the biggest source of changes—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (1.1% to 2.1% of tokens being redistributed into 0.5% to 1% of stemming groups).

Thai Notes[edit]

  • Some Lots of Thai observations:
    • I was surprised to see that the only stemming groups with multiple members in the Thai Wiktionary data are numbers! For example, 1, ๑ (Thai), ໑ (Lao), and ᧑ (New Tai Lue) are all analyzed as 1. I checked the wiki page for Thai grammar, and it is indeed analytic—with apparently no inflections! (English is usually classified as (kinda) analytic—but Thai really means it!)
      • In the Thai Wikipedia data, there are two non-number (and non-Thai) groups with multiple members, and they both feature our old friend dotted I, as in Istanbul / İstanbul.
      • Looking more closely at the built-in Thai analysis chain, there is no Thai stemmer. Being analytic, I guess it doesn't need one—neat!
    • Thai is the only Elastic built-in analyzer that doesn't use the "standard" tokenizer; there is a specific Thai tokenizer. This leads to some differences—and it kind of looks like the Thai tokenizer is lagging behind the standard and ICU tokenizers for non-Thai characters.
    • There are a lot of really long Thai tokens. The longest is 204 characters: สู่⎵⎵การส่งเสริมความก้าวหน้าของโรคมะเร็งความเข้าใจที่ดีขึ้นของอณูชีววิทยาและชีววิทยาของเซลล์ที่ได้จากการวิจัยโรคมะเร็งได้นำไปสู่⎵⎵การรักษาใหม่จำนวนมากสำหรับโรคมะเร็งนับแต่ประธานาธิบดีนิกสันแห่งสหรัฐประกาศ
      • The ⎵ here represents a zero width space (U+200B). A lot of the really long tokens (but not quite all) have zero width spaces in them. Removing them gives much more reasonable tokenizations—in this case 29 separate tokens.
      • The good-ish news is that the plain field, using the ICU tokenizer, isn't freaked out by the zero width spaces, and generated 49 tokens—so ~20 tokens are potential stop words that are not ignored.
      • The better news is that this is easily fixed with a char_filter before tokenization.
    • There are a few mixed Latin/Cyrillic homoglyph tokens in the Wikipedia data that should be fixed by the homoglyph filter.
    • There are a fair number of tokens with bidi characters—including Arabic and Hebrew, but also CJK, Thai, Latin, and others. ICU normalization should fix those.
    • A few other invisibles show up in output tokens: zero width joiners & non-joiners, zero width spaces, and variation selectors in Myanmar text. Variation selectors are new to me, and I'm not sure whether ICU normalization and/or ICU folding will clean them up, but we'll see.
    • There are a surprising number of tokens with a single double quote in them. For example, Let"s, which looks like a typo. Others, like CD"Just, don't appear on-wiki and seem to be caused by errors in my export process. Not sure if it's from the export itself or my subsequent clean up.
      • The standard tokenizer and the ICU tokenizer strip double quotes from the edges of tokens, and only allow them inside Hebrew tokens (where they frequently substitute for gershayim, which usually indicate Hebrew acronyms. Not sure if this is worth fixing since it may be an artifact of my export process.
    • There are a lot—thousands!—of hyphenated tokens; mostly Latin, but also in plenty of other scripts. And also plenty of other separators...
      • Other dash-like separators remaining in tokens include: – en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), - fullwidth hyphen-minus (U+FF0D), and the ‧ hyphenation point (U+2027).
        • It's not clear whether we should break on hyphenation points, they are mostly used to break up syllables in words on Thai Wiktionary. However, if we break on hyphens (which would generally be a good thing), then things would be more consistent if we also break on hyphenation points, since hyphens are used to break up syllables, too.
      • I also learned that the usual hyphen, which also functions as a minus sign (- U+002D "HYPHEN-MINUS") and which I thought of as the hyphen, is not the only hyphen... there is also ‐ (U+2010 "HYPHEN"). Since I had been labeling the typical "hyphen-minus" as "hyphen" in my reports, it took me a while to realize that the character called just "hyphen" is distinct. Fun times!
      • These could all be cleaned up with a char_filter.
    • There are a fair number of tokens using fullwidth Latin characters. So, IMPOSSIBLE gets normalized to impossible, rather than the more searchable impossible. I expect ICU normalization or ICU folding to take care of this.
    • Percent signs (%) are not skipped during tokenizing, which means that there are percentages (15.3%) in the final tokens. URL-encoded strings (like %E2%80%BD) get parsed the wrong-way around, with the percent signs attached to the preceding rather than following letters/numbers (%E2%80%BD is parsed as E2% + 80% + BD).
    • There are also a lot of tokens that end with an ampersand (&). These seems to mostly come from URLs as well, with the little twist that the ampersand is only attached to the token if the character right before the ampersand is a number (including non-Arabic numerals. Tokenizing the (very artificial) URL fragment q=1&q=๑&q=໑&q=١&q=१&q=১&q=੧&q=૧&q=୧&q=༡&id=xyz gives 10 tokens that are all normalized (presumably by the ubiquitous decimal_digit filter) to 1&. (In practice, non-ASCII letters are often URL-encoded, so all natural examples are Arabic digits (0-9) with or without plain-ASCII Latin character accompaniments.) These can be broken up by a char_filter or cleaned up with a token filter.
    • In summary, it looks like Thai could benefit from an even more aggressive version of word_break_helper to "split on"—by converting them to spaces—hyphens (both hyphen-minus and "true" hyphens), en dashes, em dashes, horizontal bars, full-width hyphens, and ampersands, and probably on hyphenation points, percent signs, and double quotes. We should also either delete or convert to spaces (not sure yet) zero width spaces as early as possible, so that the Thai tokenizer isn't confused by them.
  • Stemming observations:
    • Thai Wikipedia had only 1(!!!) distinct token in its Thai, non-number groups!
      • Numbers have up to four different numeral systems in my samples.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • For Thai, enabling homoglyphs and ICU normalization had a bigger-than-usual impact!
    • As usual, a smattering of mixed-script tokens.
    • As usual, the expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map.
    • A lot of tokens with invisibles (mostly zero width spaces) were grouped with their invisible-less counterparts.
    • There are a sprinkling of fullwidth Latin tokens (e.g., IMPOSSIBLE) that are now grouped with their normal Latin counterparts.
    • Unexpectedly, 1.0% of Wiktionary tokens (~2900) and 0.8% of Wikipedia tokens (~28K) were lost! The lost tokens were lost because, after ICU normalization, they were identified as stop words.
      • The vast majority of the normalization was from ำ (SARA AM) to ํ + า (NIKHAHIT + SARA AA). According to the Wiktionary entry for ำ (SARA AM), the combined single glyph is preferred, but apparently the Thai stop word list doesn't know that. The two versions are virtually identical—I couldn't see any difference in a dozen fonts that I checked, though I do not have an expert eye. For example, ทำ is not a stop word, but ทํา is (ICU normalization converts the former to the latter.)
      • The slim remainder of tokens (all in the Thai Wikipedia sample) had either zero width spaces (90 tokens) or bidi marks (2 tokens) that left stop words after they were removed.
      • A few other non–stop word tokens with SARA AM merged with their NIKHAHIT + SARA AA counterparts.

An Excursus on Tokenization[edit]

Before continuing with ICU folding for Thai, I decided to look into and make some implementation decisions about tokenization.

The ICU tokenizer has algorithms/dictionaries for a lot of spaceless languages, including Thai, so it is a potentially viable alternative to the Thai tokenizer. However, neither tokenizer is perfect.

Some issue include....

  • Thai analysis in general
    • There are two obsolete characters, and , that have been replaced with the similar looking and similar sounding ข and ค, respectively. Easily fixed with a character filter.
    • Similar to Khmer—but apparently (thankfully!) nowhere near as complex—Thai has some ambiguous representations of diacritics and characters with diacritics.
      • ำ (SARA AM, U+0E33) and ํ + า (NIKHAHIT, U+0E4D + SARA AA, U+0E32) look identical: ทำ vs ทํา.
        • When sara am is combined with some other diacritics, there are three combinations that look the same, in many fonts (occurrences are in Thai Wikipedia, and are found with regular expressions, which can time out and give incomplete results, so counts are sometimes not exact):
          • กล่ำ = ก + ล + ่ + ำ (≥8900 occs)
          • กลํ่า = ก + ล + ํ + ่ + า (80 occs)
          • กล่ํา = ก + ล + ่ + ํ + า (6 occs)
        • There is a fourth combination that looks the same in some fonts/applications:
          • กลำ่ = ก + ล + ำ + ่ (≥2 occs)
        • Approximately 1% of apparent instances of sara am and ่ (MAI EK, U+0E48) are in the wrong order.
        • This split of sara am also occurs around ้ (MAI THO, U+0E49), but not any other Thai diacritics.
    • ึ (SARA UE, U+0E36) and ิ + ํ (SARA I, U+0E34 + NIKHAHIT, U+0E4D) and ํ + ิ (NIKHAHIT, U+0E4D + SARA I, U+0E34) often look identical (depending on font and application): กึ vs กิํ vs กํิ
      • All of these can potentially screw up tokenization, and certainly will mess with matching in general. Fortunately, all of these can also be fixed with character filters.
  • Thai Tokenizer
    • — Doesn't split on em dash, en dash, hyphen-minus, hyphen, horizontal bar, fullwidth hyphen, double quote, colon, or hyphenation point. Easily fixed with character filter similar to word_break_helper.
    • ⬆⬆ Splits on periods between Thai characters.
    • — Zero width spaces, obsolete ฃ and ฅ, and improperly ordered/normalized diacritics can break the tokenization process and result in absurdly long tokens (~200 characters in the extreme). Readily fixable with character filters.
    • ⬇︎⬇︎ Sometimes the nikhahit character—which is generally kind of rare ("infrequently used")—can also cause tokenization to go awry, resulting in overly long tokens. I can't find a way to handle this. The character doesn't seem to be incorrect, so I can't delete it or substitute it.
    • ⬇⬇⬇ I found a buffering bug in the Thai tokenizer. When a string to be analyzed is over 1024 characters long, it gets broken up into 1024-character chunks, even if that splits a token. This can also cause weird long-distance effects ("spooky action at a distance"?) depending on how text is chunked and submitted for tokenization. I saw a split of an English token (BONUSB + ONUS) that had spaces on either side of it—so there is no effort made in the tokenizer to prevent this problem, even when it is straightforward.
    • The tokenizer treats some characters, like some symbols & emoji, Ahom (𑜒𑜑𑜪𑜨), and Grantha (𑌗𑍍𑌰𑌨𑍍𑌥) essentially like punctuation, and ignores them entirely.
    • Oddly ignores New Tai Lue (ᦟᦲᧅᦷᦎᦺᦑᦟᦹᧉ) tokens starting with ᦵ, ᦶ, ᦷ, ᦺ, ᧚, and has complicated/inconsistent processing of tokens with ᧞.
    • Prefers longer tokens for compound words (e.g., พรรคประชาธิปัตย์ ("Democratic party") is one token instead of two).
    • — Sometimes fails to split some other tokens that the ICU tokenizer splits (these are harder to assess).
  • ICU Tokenizer
    • Allows tokens for Ahom, Grantha, and some symbols and emoji to come through.
    • ⬆⬆ Better parsing for CJK, Khmer, Lao, New Tai Lue, and other spaceless languages
    • Explodes homoglyph tokens: creates a new token whenever a new script is encountered, so a Latin token with a Cyrillic character in the middle gets broken into three tokens, making it unfindable. This is a known problem with the ICU tokenizer, but we still use it elsewhere.
      • Relatedly, maintains "current" script set across spaces and assigns digits the "current" script, so the presence of particular earlier tokens can affect the parsing of later letter-number tokens. E.g., x 3a is parsed as x + 3a, while ร 3a is parsed as + 3 + a. This is a known problem with the ICU tokenizer, but we still use it elsewhere.
    • Prefers multiple shorter tokens for compound words (e.g., พรรคประชาธิปัตย์ ("Democratic party") is two tokens, พรรค + ประชาธิปัตย์, instead of one).
    • — Sometimes splits some other tokens that the Thai tokenizer does not split (these are harder to assess).
    • — Allows apostrophes in Thai tokens. These tokens are kind of screwy because they are part of a pronunciation guide in Wiktionary, and they get split oddly no matter what. (For comparison, it would be like saying that ดอกจัน "asterisk" is pronounced dok'chan, and then parsing that as do + k'chan, because do is an English word.) These aren't super common and aren't real words; the word-looking non-word parts that are split off are more of a concern than the bogus tokens with apostrophes in them.
    • — Digits (Arabic 0-9 and Thai ๐-๙) glom on to non-digit words/tokens. This is reasonable in languages with spaces, where tokens are more obvious, but in a spaceless language, this seems like a bad idea—it renders both the number and the word it attaches to almost unsearchable. This was originally "⬇⬇", but it can be fixed with a character filter, but it's a little ugly.
    • — Doesn't split on colon, hyphenation point, middot, semicolons (between numbers), or underscores. Easily fixed with character filter similar to word_break_helper.
      • - Doesn't split on periods between Thai characters. Should also be fixable with a character filter.
And the Winner is...[edit]

... the ICU tokenizer.

I'm personally rather annoyed by the behavior of the ICU tokenizer on homoglyph/mixed-script tokens and the parsing of mixed letter-number tokens, but they aren't super common, and we already accept that behavior for other languages that use the ICU tokenizer.

The improvements to rarer characters and scripts (emoji, Ahom, Grantha, and surely others not in my samples), improved parsing for other spaceless languages, and especially the handling of invisibles (like the zero width space) and the parsing of Thai compounds are all a lot better—so it makes sense to try to use the ICU tokenizer.

I added a character filter that handles:

  • the obsolete characters, by replacing them with the preferred characters.
  • re-ordering diacritics (fortunately there are only a few cases to handle—nothing as complex as the Khmer situation).
  • breaking words/tokens on colon, hyphenation point, middot, semicolons, and underscores, by replacing them with spaces.

Separating numbers from Thai tokens, and splitting only on periods in Thai words turned out to be an interesting exercise in regexes and character filters. The big problem occurs when a single letter is the second half of one regex match and the first half of the next regex match.

For example, I'd like ด3ด to be tokenized as ด + 3 + ด. But once the regex matches "ด3" in order to put a space between them, it continues matching after the 3, and so can't put a space between 3 and the next ด. The obvious (but potentially inefficient) solution is using a lookahead, but there's a bug that screws up the offsets into the original text, by excluding the final character. This screws up my tools—a minor problem—and would also screw up highlighting—a moderate problem.

Similarly, I want ด.ด.ด to be tokenized as ด + ด + ด, but for now—see T170625—I only want to replace periods with spaces between Thai characters (matching the behavior of the Thai tokenizer). In the case of single letters separated by spaces, we have the same regex-matching problem as above.

Rather than trying to do something extra cunning with lookaheads and complex replacements,* the most straightforward solutions is to break the number padding into two steps—Thai letter + number and number + Thai letter—and to run the exact same period replacement filter a second time to pick up any leftover strays. (I was a little worried about efficiency, but regex lookaheads aren't exactly the most efficient thing ever, and the simplified regexes are very simple, so there's no real difference on my virtual machine running Elasticsearch and CirrusSearch.)

____
* As an exercise for the interested reader, this is the single pattern I would have used for spacing out numbers, if not for lookaheads resulting in incorrect offsets:

	'pattern' => '(\\p{Nd})(?=[ก-๏])|([ก-๏])(?=\\p{Nd})',
	'replacement' => '$1$2 ',


Update: David pointed out during code review that the text has to be loaded 4 times to run the four replace filters, and suggested using lookaheads in a single replace filter. I had done my previous testing on ES6.8, so I re-ran the tests on ES7.10, just in case it had fixed the problem. It had not.. however, I realized I could exclusively use lookbehinds, which do not seem to cause any offset problems. (Presumably because when looking behind you don't really have to care about the offsets for characters you have already passed and aren't going back to change.) The new, moderately hideous regex, for your amusement, is:

	'pattern' => '(?<=\\p{Nd})([ก-๏])|(?<=[ก-๏])(\\p{Nd})|(?<=[ก-๏])\.([ก-๏])',
	'replacement' => ' $1$2$3',

I also configured the AnalysisConfigBuilder to check for the ICU plugin before assuming the ICU tokenizer is available. and I configured some additional char_filters (deleting zero width spaces and splitting on many dash-like things, and double quotes) to accommodate some of the Thai tokenizer's weaknesses, if the ICU tokenizer isn't available.

New Tokenizer Results[edit]

Comparing the ICU tokenizer to the Thai tokenizer, a lot is going on. Beyond what has been mentioned so far...

  • There are generally more tokens.
    • My Thai Wiktionary sample had 21% more tokens! (61K/291K)
    • My Thai Wikipedia sample had 4% more tokens (142K/3.4M)
      • The vast majority of new tokens are Thai (46K for Wiktionary, 132K for Wikipedia). The Wikipedia data also showed a dramatic decrease in the number of Thai types (distinct Thai words)—from 103K with the Thai tokenizer, down to 41K with the ICU tokenizer. The average Thai type length also dropped from 5.3 to 4.5 for the Wiktionary sample and 7.6 to 5.1 for the Wikipedia sample. These are both indicative of longer, more distinctive phrases (like พรรคประชาธิปัตย์, "Democratic party") being broken into smaller words (like พรรค + ประชาธิปัตย์), many of which are then also independently seen elsewhere.
      • There are also hundreds to thousands more tokens from Chinese, Japanese, and Lao, because the ICU tokenizer knows how to do basic segmenting for these languages. The Wiktionary sample had about 9K more Chinese tokens with the ICU tokenizer!
      • There are thousands more Latin tokens (and to a much lesser degree many other scripts) because of splitting on hyphens and other separators.
  • Multi-script tokens (including those with homoglyphs) are split up by the ICU tokenizer.
    • So Cоветские, which starts with a Latin C, is split into c and оветские, instead of being fixed by the homoglyph plugin (which runs after tokenization). There are aren't very many of these—less than 10 in each 10K document sample.
    • On the other hand, fairly ridiculous tokens like ๆThe no longer exist.
  • There are no longer any ridiculously long Thai tokens.
    • The Thai Wikipedia sample with the Thai tokenizer had 50 tokens that were at least 50 characters long, including 2 over 200 characters long. The longest Thai token with the ICU Tokenizer is 20 characters long. These longer tokens seem to be names and technical terms, not whole sentences, so they are much more reasonable.
    • There are more long non-Thai tokens because the ICU tokenizer recognizes rare scripts, and they are converted to \u-encoding for Unicode (usually at a 12-to-1 increase in length.. e.g., Gothic 𐌰𐍄𐍄𐌰 is indexed as \uD800\uDF30\uD800\uDF44\uD800\uDF44\uD800\uDF30.
  • There's evidence of the buffering bug in the Thai tokenizer when comparing it to the ICU tokenizer output. One example is that the only instance of all lowercase hitman had disappeared. Looking in the original text, I see that it was part of the name Whitman! So, that's an improvement!

Back to Our Regularly Scheduled Program[edit]

  • Enabled custom ICU folding, saw lots of the usual folding effects.
    • I exempted Thai diacritics from folding.
      • Words like กลอง / กล่อง / กล้อง ("drum" / "box" / "camera")—which differ only by diacritics—are clearly distinct words and should not be folded together on Thai-language wikis.
    • ICU folding changes include:
      • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around—in dozens of scripts.
      • Modifying diacritics in general and Arabic "tatweel" & Japanese "katakana-hiragana prolonged sound mark" when occurring alone became empty tokens (and were filtered).
    • Thai Wiktionary:
      • The biggest changes to Thai tokens are the stripping of non-Thai diacritics (e.g., mācron, cîrcumflex, tĩlde, uṉderline, etc., e.g., ด̄, ด̂, ด̃, ด̰, or ด̱), and the removal of modifier primesʹ and double primesʺ.
      • The most common token mergers were similar spellings and phonetic spellings, or either with stress markers, e.g., Japan, jaːpan, jaːˈpɑn, and jāpān.
    • Thai Wikipedia:
      • Very few Thai tokens were affected.
      • The most common token mergers are Latin tokens with diacritics, and a fair number of quotes being straightened.

Overall Impact for Thai[edit]

  • There are a lot more Thai Wiktionary tokens (~58K, ~20%) and a few more Thai Wikipedia tokens (~114K, ~3%), but many fewer distinct tokens, which comes from the ICU tokenizer dividing up words more finely, as discussed above.
  • ICU tokenization is the biggest source of changes in both wikis.
    • Thai Wiktionary: 3.1K tokens (1.1% of tokens) were merged into 1.4K groups (2.6% of groups).
    • Thai Wikipedia: 11K tokens (0.3% of tokens) were merged into 1.4K groups (0.8% of groups).

Arabic/Thai Reindexing Impacts[edit]

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • Arabic showed the usual small recall improvements from ICU folding.
  • Thai had significantly more momentous changes because of the introduction of a different tokenizer.
    • Without doing significantly more analysis with a Thai speaker, I can't say for sure that the changes are all positive, but they are generally in line with what we saw before, and highlight the cases where the Thai tokenizer seems to be overly agressive (breaking text up to one-character tokens).

Background

  • I pulled a sample of 10K Wikipedia queries from July of 2022 (1 week's worth each). I filtered obvious porn, urls, and other junk queries from each sample (Arabic:535, Thai:388; porn, urls, and numbers are the most common category in both cases) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Arabic Wikipedia (T319420)[edit]

Reindexing Results

  • The zero results rate dropped from 23.7% to 23.2% (-0.5% absolute change; -2.1% relative change).
  • The number of queries that got more results right after reindexing was 21.0%, vs. the pre-reindex control of 0.4–1.5% and post-reindex control of 0.3–0.5%.
  • The number of queries that changed their top result right after reindexing was 8.1%, vs. the pre-reindex control of 0.8–1.1% and post-reindex control of 0.0–0.3%.

Observations

  • The most common cause of improvement in zero-results is ICU folding—in ⅕ of cases, in Latin tokens, in the other ⅘, Arabic—particularly variants of ی.
  • The most common cause of an increased number of results or changes in the top result is ICU folding, again particularly variants of ی, and variants of characters with "hamza above".

Unpacking + ICU Tokenizer + ICU Norm + ICU Folding Impact on Thai Wikipedia (T319420)[edit]

Reindexing Results

  • The zero results rate increased from 16.1% to 17.6% (+1.5% absolute change; +9.3% relative change).
    • We can break this down further as 1.9% went from some results to no results, and 0.4% went from no results to some results.
  • The number of queries that got more results right after reindexing was 8.7%, vs. the pre-reindex control of 0.1–0.6% and post-reindex control of 0.2–0.9%.
  • The number of queries that got fewer results right after reindexing was 44.2%, vs. the pre-reindex control of 0.1–0.8% and post-reindex control of 0–0.6%.
  • The number of queries that changed their top result right after reindexing was 20.3%, vs. the pre-reindex control of 0.2–0.3% and post-reindex control of 0%.

Observations

The big changes here are caused largely by the ICU tokenizer. I looked at queries that went from some results to no results, focusing on the ones that had the most results before the change.

Interestingly, many of the queries have spaces in them, which is a strategy we've seen used before in spaceless languages (in Chinese, for example). If the searcher splits their query into words—even though it isn't required by the language—it prevents some tokenizing errors be creating word boundaries that will be respected. An parallel in English would be writing therein as there in to prevent it possibly being treated as a single word or split as the rein.

Anyway, the Thai tokenizer splits up certain strings into tokens of one or two characters. In many cases, the strings are already short—only three characters—and separated by spaces from other words or phrases in the query. Following the example above, this would be like separating therein as there in, put the tokenizer further split it into the r e in.

Queries on these one- and two-character tokens get a lot more hits than their longer token counterparts, especially when similar agressive tokenization is happening in the indexed text, too. (E.g., if are in the were also tokenized as a r e in the, it would match there in tokenized as the r e in.)

In the cases were queries that previously got no results did get results with the new tokenizer, the pattern is much the opposite, with the ICU tokenizer either splitting up words differently, or, more often, splitting into smaller words. The ICU Tokenizer is somewhat less exuberant about splitting, and generated mostly two-character tokens when splitting more finely, rather than one-character tokens.

Egyptian Arabic and Moroccan Arabic Notes (T316817)[edit]

Background and Notes[edit]

While working on unpacking the Arabic analyzer (see above), my discussion with Mike R. lead me to wonder whether part or all of the analysis chain for Standard Arabic could be applied to Egyptian and Moroccan Arabic.

We individually reviewed the effects of each of the filters from the Arabic analysis chain on 10K Egyptian Arabic Wikipedia docs and 2500 Moroccan Wikipedia docs (there are only ~6K total in that wiki). The filters may not do everything that could be done, but the things they do all seem to be good. The effects are generally similar to the previous Arabic unpacking results.

Mike also identified an additional 129 stop words for both dialects, including orthographic variants and those with prefixes. We opted to go with one list for both dialects, since the stopwords in one are not likely to be content words in the other.

Overall Impact[edit]

Egyptian Arabic[edit]

  • 11.8% fewer tokens, all matching 149 distinct stop words (we kept the standard stopword list in place, too).
    • Of the tokens removed as stop words, about 84% are accounted for by the top 5, and 35% by the top word, فى, which is an Egyptian Arabic variant of the word for "in".
  • 9.0% of types (distinct tokens) accounting for 17.2% of all tokens merged into 4.2% of stemming groups. Or, about 1 in 6 words got stemmed into something that matched something else after stemming.

Moroccan Arabic[edit]

  • 24.2% fewer tokens, matching 202 distinct stop words.
    • Of the tokens removed as stop words, about 46% are accounted for by the top 5, and 14% by the top word, ف, meaning "then".
  • 19.8% of types (distinct tokens) accounting for 18.7% of all tokens merged into 8.5% of stemming groups. Or, about 1 in 5 words got stemmed into something that matched something else after stemming.

Egyptian and Moroccan Arabic Reindexing Impacts[edit]

Summary

  • Huge improvements in zero-results rate! About 1 in 5 former zero-results queries on Moroccan Arabic Wikipedia and more than 1 in 3 on Egyptian Arabic Wikipedia now get results!

Background

  • I pulled 10K Wikipedia queries from September and October of 2022 for Egyptian Arabic. I filtered obvious porn, urls, and other junk queries from the sample (247 queries; porn, urls, and numbers are the most common categories). I randomly sampled 3000 queries from the remainder for Egyptian.
  • I was only able to pull 1680 Moroccan Arabic Wikipedia queries for all of 2022 (though with our 90-day retention policy, I didn't actually go back that far). I filtered obvious porn, urls, and other junk queries from each sample (452 queries; urls are the most common category). I took all 1082 unique queries for Moroccan.

Arabic Analysis Reindexing Impact on Egyptian and Moroccan Arabic Wikipedia (T322044)[edit]

Moroccan Arabic Reindexing Results

  • The zero results rate dropped from 55.3% to 44.8% (-10.5% absolute change; -19.0% relative change).
  • The number of queries that got more results right after reindexing was 19.5%, vs. the pre-reindex control of 0.0% and post-reindex control of 0.0–0.7%.
  • The number of queries that changed their top result right after reindexing was 9.4%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.3%.

Egyptian Arabic Reindexing Results

  • The zero results rate dropped from 54.5% to 34.2% (-20.3% absolute change; -37.2% relative change).
  • The number of queries that got more results right after reindexing was 36.0%, vs. the pre-reindex control of 0.0% and post-reindex control of 0.0–1.0%.
  • The number of queries that changed their top result right after reindexing was 20.3%, vs. the pre-reindex control of 0.1% and post-reindex control of 0.0%.

Observations

  • Most changes are, of course, due to the introduction of the stemmer.
  • Egyptian Arabic also had a handful of ZRR improvements due to ICU folding (accented Latin-script terms matching unaccented ones).
  • A few queries had really large increases in the number of matches. On the Egyptian Arabic Wikipedia, the query for the word for "Egyptians" now matches the word for "Egypt", which matches a lot of articles. On the Moroccan Arabic Wikipedia, a formerly zero-results query now matches a common administrative category. But overall the changes should be a big improvement.

Ukrainian Notes (T318264)[edit]

Background[edit]

The Ukrainian analyzer is on my list of analyzers to unpack, but it not on Elastic's list of built-in language analyzers. The `analysis-ukrainian` plugin is provided by Elastic, though, and it is a simple wrapper around the analyzer provided by Lucene.

The Lucene Ukrainian analyzer is pretty straightforward—simple character filter, standard tokenizer, lowercasing, stopword filtering, and a dictionary-based stemmer using the Morfologik framework—but its components are not provided separately, so it can't be unpacked.

I discussed with David, and we decided that it should be relatively straightforward to build a new plugin that provides the Ukrainian stemming—Morologik is provided by Lucene and the Ukrainian dictionary is available as an independent artifact. We later also decided to include a Ukrainian stopword filter in the plugin, using the list available from Lucene, rather than recreating it in CirrusSearch. The character filter is easy to reproduce outside of a plugin, so we left it out. The other parts of the full analysis chain are standard components (tokenizer), and/or something we want to upgrade anyway (lowercasing to ICU normalization).

Plugin V1[edit]

I built the plugin, using the most current version of the Ukrainian dictionary I found (4.9.1). I recreate the character filter, and ran an unpacked version of the analysis chain. To my surprise there were significant changes. About 2% of tokens changed, and about 0.25% of tokens disappeared. I worked backwards and eventually found that the ES 7.10 Ukrainian analyzer is only using version 3.7.5 of the Ukrainian dictionary.

Plugin V0 / Baseline[edit]

I temporarily downgraded to dictionary 3.7.5, which resulted in zero diffs between the unpacked version (with the homoglyph and ICU normalization upgrades disabled) and the monolithic ES 7.10 version.

I then went back to the 4.9.1 version of the dictionary, and used that as my new baseline for the rest of the upgrades.

Data[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.
    • Also a fair number of tokens joined by periods or colons, like форми:іменники:прикметники:дієслова:прислівники.
    • One surprisingly long token, Тау­ма­та­уа­ка­тан­гі­ан­га­ко­ау­ау­ота­ма­теа­ту­ри­пу­ка­ка­пі­кі­маун­га­хо­ро­ну­ку­по­ка­ну­­енуа­­кі­та­на­та­ху.. which is just the Ukrainian name of Taumata­whakatangihanga­koauau­o­tamatea­turi­pukaka­piki­maunga­horo­nuku­pokai­whenua­ki­tana­tahu, of course.
  • Stemming observations:
    • A relatively small number of tokens (5.2% in the Wikipedia sample, 7.8% in the Wiktionary sample) generate two output tokens. A smaller number (1.0% in Wikipedia, 1.6% in Wiktionary) generate more than two output tokens—up to 7! They generally look like additional inflections of the word.
    • A fair number of output tokens are capitalized by the stemmer, even if the input token is lowercased. For example both Іван and іван ("Ivan"/"ivan") have as output the token Іван.
      • These are mostly identifiable as proper names.
      • In some cases, when there are multiple output tokens, they only differ by upper-/lowercase.
    • There are some "unexpected" stemming groupings, but they are the same ones we saw when first enabling the Ukrainian analyzer.
    • Ukrainian Wikipedia had 22 distinct tokens in its largest stemming group.
      • It's an unfortunate (and unavoidable) group, as it has the stem мати, which means both "mother" and "to have".

Lowercase and Remove Duplicates[edit]

Based on the fact that there seemed to be a fair number of input tokens that generate output tokens that only differ by case, I applied an additional lowercase filter after the stemmer, and a remove_duplicates filter after that. (Remove duplicates only removes identical tokens that have the same offsets, so "Іван іван" still generates two tokens total.)

There are possibly some words that do differ by case (in English, rose the flower and Rose the name, would be an example), but given the fact that sentence-initial words get capitalized, and searchers are often very lax with their capitalization, this doesn't seem like a distinction worth maintaining.

In the Wiktionary sample, the total number of tokens decreased by 6.8%! In the Wikipedia sample, it decreased by 4.4%. That's a lot of duplicates! That said, the number of new collisions (words that stem together that didn't before) is still low.

For Wiktionary: 0.074% of types (unique tokens), accounting for 0.155% of all tokens, were added to existing groups.

For Wikipedia: 0.066% of types, accounting for 0.186% of tokens, were added to existing groups.

The actual count of meaningful collisions is actually a bit lower. In some cases, for example, the tokens were previously grouped under a capitalized stem, but were then grouped under the lowercase stem. For example, Галич and Галичі both previously stemmed to галич. While Галич, Галичі, and Галича all stemmed to Галич. After lowercasing, all three stemmed to галич (and only галич)—with Галича being "added" to the галич group, which recreated the previous Галич group.

Back to Our Regularly Scheduled Program[edit]

  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • A lot of mixed-script tokens!
      • Ukrainian Cyrillic і can be hard to type on non-Ukrainian keyboards, so it is often replaced with Latin i, though there are lots of other Cyrillic/Latin homoglyphs.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • In the Wiktionary sample, almost all the collisions are homoglyphs.
      • In the Wikipedia sample, there are a few invisibles (bidi, no-break spaces), but also most collisions are homoglyphs.
  • Enabled ICU folding, saw lots of the usual folding effects.
    • Did not include any exceptions for ICU folding (see below).
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Most mergers were in Latin-script tokens in both wikis.

й/и and ї/і

By default, ICU folding would fold й to и and ї to і. All four are letters of the Ukrainian alphabet, and I usually don't fold letters that are explicitly included in the alphabet. However, I let them be folded at first for testing, before configuring any exceptions.

  • The Wiktionary sample has ~226,000 words that get put into ~46,000 groups. Out of all of those, there are only 23 groups that have words that differ by и/й or і/ї. (~0.05% of groups)
  • The Wikipedia sample is much bigger—~2.4 million words divided into ~197,000 groups—but fewer than 100 groups were affected by и/й or і/ї. (also ~0.05%)

The words that are affected tend to be Russian, transliterated names, obviously related, or obvious typos. A few words grouped together are distinct, but the preference for exact matches will tend to rank them higher.

Given the very small impact and the general benefit of the mergers, I decided not to add й and ї as folding exceptions.

ґ/г

The Ukrainian analyzer that we are unpacking explicitly includes folding ґ to г in its character filter, so I'm leaving that as-is for now. It's redundant with ICU folding, but I'm not sure if it interacts with the Ukrainian stemmer (the char filter is before stemming; ICU folding is after). Also, if the ICU plugin is not available, we want to maintain the char filter folding to be closer to the monolithic analyzer.

Both ґ and г are in the Ukrainian alphabet, but ґ was only officially reintroduced in 1990, which may be part of why its use would be inconsistent and why folding makes sense.

Overall Impact[edit]

  • There were significantly fewer tokens overall (6.7% for Wiktionary, 4.4% for Wikipedia), mostly due to the effects of lowercasing and removing duplicate tokens from the stemmer.
  • Homoglyph processing had a small but noticeable effect:
    • Ukrainian Wiktionary: 99 tokens (0.04% of tokens) were added to 83 stemming groups (0.18% of groups)
    • Ukrainian Wikipedia: 847 tokens (0.04% of tokens) were added to 479 stemming groups (0.24% of groups)
  • ICU folding is the next biggest source of changes in both wikis. Generally, the merges that resulted from ICU folding were significant, but not extreme:
    • Ukrainian Wiktionary: 1,305 tokens (0.58% of tokens) were merged into 369 stemming groups (0.79% of groups)
    • Ukrainian Wikipedia: 5,660 tokens (0.24% of tokens) were merged into 714 stemming groups (0.36% of groups)

Ukrainian Reindexing Impacts[edit]

General Notes[edit]

Summary

  • Small but meaningful decrease in zero-results rate (1 in 25 now get results), with ICU folding and homoglyph normalization (woo hoo!) creating the biggest impact!

Background

  • I pulled a sample of 10K Wikipedia queries from July of 2022 (1 week's worth). I filtered obvious porn, urls, and other junk queries from each sample (165 queries, numbers and Latin & Cyrillic junk queries are the most common categories) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Ukrainian Wikipedia (T323927)[edit]

Reindexing Results

  • The zero results rate dropped from 24.8% to 23.8% (-1.0% absolute change; -4.0% relative change).
  • The number of queries that got more results right after reindexing was 14.5%, vs. the pre-reindex control of 0.4–0.6% and post-reindex control of 0.1–0.9%.
  • The number of queries that changed their top result right after reindexing was 3.4%, vs. the pre-reindex control of 0.7–0.8% and post-reindex control of 0.0%.

Observations

  • The most common cause of improvement in zero-results is—to my delight and amazement—homoglyph normalization! ICU folding is second, split between generic folding, and й/и and ї/і folding. The rest I couldn't immediately tell, but I think they are due to the stemming dictionary upgrade.
  • The most common cause of an increased number of results seem to be stemming dictionary updates and homoglyph normalization .
  • The most common cause of changes in the top result is not clear, but ICU folding seems to have contributed to changes matches in the opening text—a lot of new top hits have an stress-accented term that would not have matched before show up in the snippet.

Enable ICU Folding for Russian (Gerrit: 859549)[edit]

This isn't really unpacking, but I was looking at SonarQube and noticed that the Russian ICU folding exceptions don't get tested. Turns out ICU folding was not enabled for Russian when the exceptions were configured.

So.. I pulled 10K random Russian Wikipedia articles and 10K random Russian Wiktionary entries for testing.

  • There was already an ICU Folding exception configured for Йй for Russian.
  • There are a handful of lost tokens (1 in Wikipedia, ~100 in Wiktionary), all of which are either ー (U+30FC, katakana-hiragana prolonged sound mark) or ː (U+02D0, modifier letter triangular colon). They are still findable via the "plain" field.
  • The most common ICU folding is diacritics on Latin characters, though there are the normal regularizations across many scripts. The most common Cyrillc normalizations are қ, ң, ұ, ї, and other non-Russian characters, along with straightening curly apostrophes, and stripping grave and double grave accents (which are often used to show stress in names).
  • In Wiktionary: 0.81% of tokens were added to 1.6% of stemming groups.
  • In Wikipedia: 0.09% of tokens were added to 0.5% of stemming groups

Russian ICU Reindexing Impacts[edit]

Summary

  • Impact for one small change (ICU folding) is small, but about 1 in 200 zero-results queries—or 1 in 1000 general queries—will get some results now; and even more queries will have additional good results (e.g., revolution francaise matching révolution française).

Background

  • I pulled a sample of 10K Wikipedia queries from November of 2022 (2 week's worth). I filtered obvious porn, urls, and other junk queries from each sample (229 queries; junk queries are the most common category in all cases) and randomly sampled 3000 queries from the remainder.

Other Notes

As a general rule, sequences of 6 consonants or more is a pretty good indicator of a junk query. There are some longer sequences—usually only 3 letters in English, usually involving s- or sh-like sounds, like "str" or "tch"—that can end up together when one ends a syllable and another starts the next syllable, as in "watchstrap". Other valid long sequences can occur when a language requires a lot of letters because of its spelling conventions. There's no consistent spelling for English "sh" and "ch" sounds across languages, so щ is transliterated into English as "shch", French as "chtch", and German as "schtsch". Add in a typo that deletes a consonant and you get false positive junk identifications.

I had previously had to add rules to reduce these kinds of clusters in Latin text, and this time I had to add some extra rules to reduce similar clusters in Cyrillic!

Reindexing Results

  • The zero results rate dropped from 21.1% to 21.0% (-0.1% absolute change; -0.5% relative change).
  • The number of queries that got more results right after reindexing was 6.7%, vs. the pre-reindex control of 1.3–1.9% and post-reindex control of 0.7–1.2%.
  • The number of queries that changed their top result right after reindexing was 8.7%, vs. the pre-reindex control of 3.1–3.7% and post-reindex control of 0.0–0.5%.

Observations

  • All of the zero-results improvements are from ICU folding changes, of course.
  • There are definitely some ICU folding–related effects in the increased results and top-result changes, but some of that is also just the churn in a large wiki.
  • There are examples in both Latin script (revolution francaise/révolution française) and Cyrillic (қўшиқлари/кушиклари—note that қ and ў are not Russian letters).

Japanese / CJK Unpacking Notes (T326822)[edit]

I started with the usual 10K sample each from Wikipedia and Wiktionary for Japanese.

Data observations[edit]

  • Usual distribution of tokens—plenty of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers (including a new record-holder: 10⁶⁰, spelled out, with commas!), and—in Wiktionary—German words, the occasional Thai word, and IPA transcriptions, etc.
    • Of course, there are lots of two-character CJK tokens, since all longer tokens get broken up into bigrams.
  • So many scripts in Japanese Wiktionary! Including several I hadn't heard of before: Bamum, Elbasan, Lepcha, Mende Kikakui, Sora Sompeng, and Tirhuta.
  • I hadn't seen this before, but the standard lowercase filter converts Cherokee letters to their lowercase form, which is weird, because the lowercase forms are rarely used. icu_normalizer, which we use more often, converts lowercase Cherokee to uppercase.
  • There are smatterings of invisible characters in tokens: bidi marks (quite common on Arabic and Hebrew tokens, but generally all over the place), variation selectors, zero-width non-joiners, zero-width joiners, and zero-width no-break spaces.
  • The Japanese Wikipedia sample features tokens with Latin/Cyrillic homoglyphs, and mixed Latin/Greek tokens (stylistic and IPA transcriptions), and mixed Latin/Cyrillic Proto-Slavic tokens (where Cyrillic ъ is often used in transcriptions).

Analysis Observations[edit]

  • Both Japanese Wikipedia and Wiktionary have very few groups of distinct tokens that are grouped together by the cjk analyzer. The most common are Latin words (mostly English) and numbers in regular and FULLWIDTH forms. There are also a handful of regular and halfwidth Japanese characters (e.g. セ/セ or ラ/ラ).
    • However, I did also see our old friend Istanbul/İstanbul and a few other dotted-İ examples.
  • A weird tokenizing situation (not quite a bug) leads to a few other examples of tokens indexed together, and input tokens that generate (alleged) multiple output tokens. The cjk_width filter converts halfwidth Japanese and fullwidth Latin to their regular forms (FULLWIDTH → FULLWIDTH or ラ → ラ). However, there are no precomposed halfwidth forms with dakuten or handakuten diacritics. cjk_width smartly composes them, but doesn't have the bookkeeping information cjk_bigram would need to recover exact spans in the original text.
    • For example, halfwidth ラドウィッグ ("Ludwig") is 8 characters (spanning, say, offsets 0–8), with "ド" as two characters and "グ" as two characters (a base character and combining diacritic form of ゛). cjk_width converts it to ラドウィッグ, which is only six characters (ド and グ are now single characters); the offsets into the original text for the whole word are still 0–8. Since cjk_width is working on whole tokens, it doesn't track the offsets for each character of its output. cjk_width converts 6-character ラドウィッグ into 5 bigrams—ラド / ドウ / ウィ / ィッ / ッグ—but can't assign specific offsets to each bigram, so all five of them are marked as 0–8. Thus, ラドウィッグ generates (allegedly) 5 tokens. The incorrect offsets would affect highlighting, and could affect ranking, since all 5 tokens (ラド / ドウ / ウィ / ィッ / ッグ) are indexed as being on top of each other, rather than sequential—but I haven't looked into the ranking implications.

Unpacking & Upgrades[edit]

  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff, plus some Japanese/CJK–specific normalizations:
    • Visually identical CJK characters (always "potentially" depending on your fonts)—like ⼈/人, ⼥/女, ⼦/子, ⾨/門, and ⾯/面.
    • Expansion of "square" characters like ㍐ being expanded to ユアン (and bigrammed to ユア + アン).
    • Normalization of encircled characters, like ㋜/ス, ㊙/秘, and Ⓕ/F.
  • Unexpectedly, there were some new tokens with spaces in them! There are two sources:
    • The character ͺ (Greek ypogegrammeni) always gets normalized as " ι" (with a space).
    • Standalone dakuten and handakuten (゛/゜)—which are rarely used—get normalized by icu_normalizer as a space+the combining form of the same diacritic. Even with the space, they are labeled as <KATAKANA>, so the space character gets swept up by cjk_bigram, generating bigrams with a random CJK character and a space. Fixed this with a char_filter to map them to the combining forms, which get regularized properly.
  • Saw the expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map.
  • Enabled custom ICU folding, saw lots of the usual folding effects. Plus...
    • Exempted characters with dakuten and handakuten (voiced sound mark, semi-voiced sound mark, e.g. ざ and ぽ (vs さ and ほ)), and the chōonpu (prolonged sound mark, ー). Because of their distribution in the Unicode blocks (interspersed with the unaccented versions), I used a range that covered them all, but also included characters without dakuten and handakuten that don't get changed by ICU folding.
    • A few more visually identical CJK characters (always "potentially" depending on your fonts)—like ⻑/長 and ⻄/西). Why some are merged by ICU normalization and some by ICU folding, I couldn't say.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around—mostly Latin, with some Cyrillic, with a sprinkling of others.
    • A few digits in different character sets (e.g., 1/१/௧/๑/1)

Stopwords[edit]

The CJK analyzer uses a slightly altered list of English stopwords. It adds s, t, and www, and drops an. Since stopwords are searchable on-wiki because they are indexed in the "plain" field, I just used the default _english_ stopword config to keep the unpacking simple.

ICU Tokenizer[edit]

By default, Japanese has been configured (internally to our system) for a while to use the ICU tokenizer. The monolithic CJK analyzer used by the "text" field could not be modified by our code, so it used the standard tokenizer, and the "plain" field used the ICU tokenizer. Unpacking the CJK analyzer makes the "text" analyzer subject to modification, based on the ICU tokenizer config.

The filter that actually creates the bigrams (cjk_bigram) merges and splits incoming tokens how it sees fit, so it negates any benefits from the ICU tokenizer parsing CJK text. The ICU tokenizer does a few things that are not great (breaking multi-script tokens, and weird things with tokens that have numbers in multi-script texts), so it's better not to use it in the "plain" field / unpacked cjk analyzer.

However, the "plain" field does benefit from the ICU tokenizer, so I figured out a scheme to split the config so they can use different tokenizers.

Overall Impact[edit]

Overall, there was a much bigger impact on my sample from Japanese Wiktionary than the sample from Japanese Wikipedia, largely because it has a larger percentage of Latin and other non-CJK text (where ICU normalization & ICU folding have the most effect).

  • Japanese Wiktionary: 5,485 tokens (1.421% of tokens) were merged into 2,593 stemming groups (2.758% of stemming groups)
  • Japanese Wikipedia: 1,884 tokens (0.015% of tokens) were merged into 650 stemming groups (0.138% of stemming groups)
    • In the Japanese Wiktionary sample, the vast majority of mergers are Latin tokens, such as acquisito/acquīsītō, plus a decent handful of Cyrillic, and a sprinkling of others, like Bopomofo, Greek, Hebrew, Arabic, and others. The Japanese/CJK tokens that merge are the rare ones that are normalized (e.g., ㍐ with アン and ユア, or the visually identical characters mentioned above).
    • In the Japanese Wikipedia sample, the distribution was similar, but the overall number much smaller.
  • Net token count differences were very small, less than ±0.05%, since not many CJK tokens were affected.
  • ICU folding is the biggest source of changes in all wikis—as expected.
    • Lots of bidi marks removed, along with some zero-width non-joiners.
    • Cherokee letters were (properly) uppercased—though it makes no difference in searchability.

Unpacking + ICU Norm + ICU Folding Impact on Japanese Wikipedia & Wiktionary (T327720)[edit]

Wikipedia Reindexing Results

  • The zero results rate was unchanged at 4.3% (0.0% absolute change; 0.0% relative change).
  • The number of queries that got more results right after reindexing was 3.4%, vs. the pre-reindex control of 0.2–1.0% and post-reindex control of 0.6–1.0%. (Largely due to the time between the before and after stages.)
  • The number of queries that changed their top result right after reindexing was 1.7%, vs. the pre-reindex control of 0.5–1.0% and post-reindex control of 0.0%.

Wiktionary Reindexing Results

  • The zero results rate was unchanged at 16.1% (0.0% absolute change; 0.0% relative change).
  • The number of queries that got more results right after reindexing was 0.6%, vs. the pre-reindex control of 1.3–2.1% and post-reindex control of 0.5–1.9%.
  • The number of queries that changed their top result right after reindexing was 1.0%, vs. the pre-reindex control of 0.9–1.3% and post-reindex control of 1.0–1.1%.

Observations

  • There were very few queries affected overall beyond the usual level of expected noise, with virtually none in the Wiktionary query data. I guess while the Wiktionary content has a lot more non-Japanese text, the queries don't.
  • The biggest change in results was in the Wikipedia query sample, with a few queries that lost thousands of results—up to almost 45K results (75%). My heart skipped a beat when I saw the number, but looking into it, it's a good thing. It's a change in stop words; s and t are no longer stop words, so the query s-hybrid, for example, matches an article discussing a car called s-hybrid, rather than the article on "HYBRID UNIVERSE".
  • While the zero-results change for the Wikipedia query sample was "0.0%", there was actually one query that went from no results to some results.. muso jikiden matching Musō Jikiden, which is what we are looking for from ICU folding.

Armenian, Latvian, Hungarian Notes (T325089)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, IPA transcriptions (in Wiktionary) etc.
    • The Armenian Wikipedia sample has quite a few Latin and Cyrillic tokens.
  • Stemming observations:
    • Hungarian Wikipedia had 99 distinct tokens in its largest stemming group.
    • Latvian Wikipedia had 30 distinct tokens in its largest stemming group.
    • Armenian Wikipedia had 93 distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • A smattering of mixed-script tokens.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • Hungarian: Lots of bidi marks and soft hyphens
      • Latvian: More than the usual number of Latin/Cyrillic homoglyph tokens
      • Armenian: Also a fair number of Latin/Cyrillic homoglyph tokens, and some Armenian tokens! (See below.)

The ICU normalizer converts և to եւ (which depending on your Armenian font, could look fairly different or almost identical). Historically, և was a ligature of ե + ւ, but is now it's own letter—sort of:

  • It has no uppercase form—the first letter of եւ is capitalized, giving Եւ.
  • On my Mac, using the find function in several applications finds և and եւ interchangeably, in the same way that searching for X and x typically find each other.
  • On English Wiktionary, several forms with ե + ւ are redirects to forms with և. For example, արեւածագ → արևածագ
  • On Armenian Wiktionary, I didn't see a redirect, but searching, for example, for արեւածագ finds արևածագ anyway because the ICU normalizer is in use in the plain field.

For all these reasons, I decided to let the ICU normalizer continue to convert և to եւ.

One side effect is that two stop words, նաև & և get normalized to նաեւ & եւ and aren't caught by the stopword filter—so I added an additional litte stopword filter to pick those up, too.

  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [Áá, Éé, Íí, Óó, Öö, Őő, Úú, Üü, Űű] for Hungarian. I see some examples of possible typos and imperfect stemming, but I talked to Gergő about it, and a blanket removal of accents is not a good idea!
    • Exempted [Āā, Čč, Ēē, Ģģ, Īī, Ķķ, Ļļ, Ņņ, Šš, Ūū, Žž] for Latvian.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Armenian: Most changes are for Latin tokens. In the Wiktionary sample, they are of the sort where ju·ris·pru·dent now matches jurisprudent—Wiktionary likes their middle-dot syllabification! In the Wikipedia sample, it's mostly the usual diacritics on many Latin and fewer Cyrillic words.
    • Hungarian & Latvian: Mostly the usual diacritics on many Latin words, especially pronunciations in Wiktionary.

Overall Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters.
    • Armenian Wikipedia had a few hundred net losses from normalized stop words being filtered.
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (0.1% to 1.1% of tokens being redistributed into 0.4% to 2.4% of stemming groups).
    • Armenian Wiktionary: 984 tokens (0.460% of tokens) were merged into 193 groups (0.380% of groups)
    • Armenian Wikipedia: 22,008 tokens (1.093% of tokens) were merged into 946 groups (0.530% of groups)
      • The Armenian Wikipedia token numbers are rather inflated because when an infrequent token with ե + ւ (like պարգեւատրվել) is merged with much more frequent tokens with և (like պարգևատրվել), my counting heuristics consider the 300+ և tokens to be merged into the 2 ե + ւ tokens, rather than the other way around, because the merged stem is the ե + ւ version.
    • Hungarian Wiktionary: 2,062 tokens (0.889% of tokens) were merged into 882 groups (2.259% of groups)
    • Hungarian Wikipedia: 4,315 tokens (0.163% of tokens) were merged into 1,403 groups (0.598% of groups)
    • Latvian Wiktionary: 1,128 tokens (1.244% of tokens) were merged into 581 groups (2.429% of groups)
    • Latvian Wikipedia: 1,836 tokens (0.098% of tokens) were merged into 809 groups (0.630% of groups)

Armenian, Latvian, Hungarian Reindexing Impacts[edit]

General Notes[edit]

Summary

  • Small impact in Armenian, with most relevant being Armenian-specific ե+ւ/և folding.
  • Small impact in Latvian, possibly because there are a lot of characters blocked from ICU folding.
  • Moderate impact in Hungarian, due to ICU folding.

Background

  • I pulled a sample of 10K Wikipedia queries from Nov-Dec of 2022 (1 month each). I filtered obvious porn, urls, and other junk queries from each sample (Armenian:129 (mostly URLs), Latvian:127 (mostly junk and URLs), Hungarian:121 (mostly junk)) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Armenian Wikipedia (T327801)[edit]

Reindexing Results

  • The zero results rate dropped from 36.7% to 36.6% (-0.1% absolute change; -0.3% relative change).
  • The number of queries that got more results right after reindexing was 6.5%, vs. the pre-reindex control of 0.0–0.3% and post-reindex control of 0.0–0.4%.
  • The number of queries that changed their top result right after reindexing was 1.1%, vs. the pre-reindex control of 0.6% and post-reindex control of 0.0%.

Observations

  • The most common cause of improvement in zero-results is ICU folding.
  • The most common cause of an increased number of results and changes in the top result is ICU folding, split between general diacritic folding and Armenian-specific ե+ւ/և folding.

Unpacking + ICU Norm + ICU Folding Impact on Latvian Wikipedia (T327801)[edit]

Reindexing Results

  • The zero results rate dropped from 30.6% to 30.4% (-0.2% absolute change; -0.7% relative change).
  • The number of queries that got more results right after reindexing was 4.0%, vs. the pre-reindex control of 0.0% and post-reindex control of 0.0%.
  • The number of queries that changed their top result right after reindexing was 0.7%, vs. the pre-reindex control of 0.1–0.3% and post-reindex control of 0.0%.

Observations

  • The most common cause of all changes here is ICU folding.

Unpacking + ICU Norm + ICU Folding Impact on Hungarian Wikipedia (T327801)[edit]

Reindexing Results

  • The zero results rate dropped from 21.3% to 20.9% (-0.4% absolute change; -1.9% relative change).
  • The number of queries that got more results right after reindexing was 12.2%, vs. the pre-reindex control of 0.4–0.5% and post-reindex control of 0.1–0.4%.
  • The number of queries that changed their top result right after reindexing was 1.9%, vs. the pre-reindex control of 0.8–0.9% and post-reindex control of 0.0%.

Observations

  • The most common cause of all changes here is ICU folding.

Bulgarian, Persian, Lithuanian Notes (T325090)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, IPA transcriptions (in Wiktionary) etc.
    • In the Bulgarian Wiktionary sample, long tokens are mostly Cyrillic, and there are lots of mixed-script tokens (lots of Latin/Cyrillic, but also a few Cyrillic/Greek, Greek/Latin, and a couple with all three!), and very few IPA tokens
    • In the Bulgarian Wikipedia sample, two of the most common Latin tokens are "Wayback" and "Machine"!
    • The Persian samples have a few 0-length tokens, which are usually from tatweel characters, as we saw with Arabic, or other Arabic script diacritics standing alone.
    • In the Lithuanian samples, I see that č is stemmed to t. Looks like a stemmer gone a little overboard, in that it is normalizing an ending that has no root! Not the worst thing a stemmer ever did, though.
  • Stemming observations:
    • Bulgarian Wikipedia had 17 distinct tokens in its largest stemming group.
    • Persian Wikipedia had 11 (!) distinct tokens in its largest stemming group.
    • Lithuanian Wikipedia had 129 (!!) distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • Rather a lot of mixed-script tokens in the Bulgarian wikis. 1.11% of unique tokens (0.25% of all tokens—i.e., 1 in 400) in the Bulgarian Wiktionary sample generated 2 or 3 tokens because of homoglyphs.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • Bulgarian: homoglyphs and soft hyphens
      • Persian: bidi marks and zero-width joiners
      • Lithuanian: soft hyphens
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [Йй] for Bulgarian.
    • Exempted [Ąą, Čč, Ęę, Ėė, Įį, Šš, Ųų, Ūū, Žž] for Lithuanian.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Bulgarian: A lot of mergers from both Latin diacritic removal and removal of Cyrillic stress accents used to show pronunciation (e.g., практѝчност vs практичност)—plus a handful of Greek diacritic removals and others.
    • Persian: A big chunk of mergers from removal of Latin diacritics, plus a lot of Persian token mergers, mostly from the same kinds of folding we saw in (Arabic)[1]. At worst these seem to be merging Arabic cognates or Arabic spellings of names, which seems reasonable.
    • Latvian: Mergers mostly non-Latvian Latin diacritics, with a sprinkling of others.

Overall Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters.
    • Bulgarian had its largest increase in tokens—hundreds of extra tokens in both samples from homoglyphs—though that's still a small percentage of tokens overall.
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (0.21% to 0.85% of tokens being redistributed (or created from homoglyphs and distributed) into 0.38% to 1.4% of stemming groups).
    • Bulgarian Wiktionary: 338 tokens (0.224% of tokens) were merged into 246 groups (0.702% of groups)
    • Bulgarian Wikipedia: 4,536 tokens (0.214% of tokens) were merged into 1,955 groups (1.114% of groups)
    • Persian Wiktionary: 1,089 tokens (0.852% of tokens) were merged into 576 groups (1.407% of groups)
    • Persian Wikipedia: 7,208 tokens (0.717% of tokens) were merged into 821 groups (0.902% of groups)
    • Lithuanian Wiktionary: 116 tokens (0.076% of tokens) were merged into 78 groups (0.379% of groups)
    • Lithuanian Wikipedia: 2,425 tokens (0.159% of tokens) were merged into 1,140 groups (1.035% of groups)

Bulgarian, Persian, Lithuanian Reindexing Impacts[edit]

General Notes[edit]

Summary

  • Moderate impact in Bulgarian, split between ICU folding diacritics and homoglyph matching.
  • Moderate impact in Persian, due to ICU folding of Persian diacritics.
  • Minimal impact in Lithuanian, but a few extra results from ICU folding.

Background

  • I pulled a sample of 10K Wikipedia queries from Nov-Dec of 2022 (1 month for each). I filtered obvious porn, urls, and other junk queries from each sample (Bulgarian:109 (spread across various categories), Persian:885 (mostly porn), Lithuanian:116 (mostly junk)) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Bulgarian Wikipedia (T328315)[edit]

Reindexing Results

  • The zero results rate dropped from 27.0% to 26.5% (-0.5% absolute change; -1.9% relative change).
  • The number of queries that got more results right after reindexing was 6.4%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 1.5%, vs. the pre-reindex control of 0.4–0.7% and post-reindex control of 0.0–0.1%.

Observations

  • The most common causes of improvement in zero-results are ICU folding diacritics and homoglyph matching (almost 50/50).
  • The most common causes of an increased number of results are mostly homoglyph matching and some ICU folding.
  • The most common cause of changes in the top result is ICU folding diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Persian Wikipedia (T328315)[edit]

Reindexing Results

  • The zero results rate dropped from 22.8% to 22.5% (-0.3% absolute change; -1.3% relative change).
  • The number of queries that got more results right after reindexing was 10.5%, vs. the pre-reindex control of 0.1–0.3% and post-reindex control of 0.3–0.5%.
  • The number of queries that changed their top result right after reindexing was 8.7%, vs. the pre-reindex control of 1.8% and post-reindex control of 0.1–0.2%.

Observations

  • The most common cause of improvement in zero-results, increased number of results, and changes in the top result is ICU folding of Persian diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Lithuanian Wikipedia (T328315)[edit]

Reindexing Results

  • The zero results rate was unchanged at 32.2% (0.0% absolute change; 0.0% relative change).
  • The number of queries that got more results right after reindexing was 9.0%, vs. the pre-reindex control of 0.0–0.6% and post-reindex control of 0.1–0.3%.
  • The number of queries that changed their top result right after reindexing was 0.7%, vs. the pre-reindex control of 0.2% and post-reindex control of 0.0%.

Observations

  • The zero-results rate seemed to have some random fluctuation.. two queries went from one result to no results, one went from no results to one result, and one went from no results to six results—that last one was thanks to ICU folding!
  • The most common cause of an increased number of results and changes in the top result is ICU folding.

Romanian and Sorani Notes (T325091)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • No Wiktionary sample for Sorani (ckb) because it is still in the incubator. Since Wiktionary data often has more of a variety of scripts and tokens in it, I created a multi-Wiktionary sample of 200 lines each from 47 other Wiktionary samples I have. It won't be representative of the Sorani Wiktionary, but it should reveal any unexpected problems from processing non-Sorani text.
  • Stemming observations:
    • Usual distribution of tokens in general.
    • Sorani Wikipedia had 39 distinct tokens in its largest stemming group.
      • There is a lot of normalization of characters and invisibles happening in the Sorani chain, apparently in the aptly named sorani_normalization filter, which runs even before lowercasing. (Elastic docs.)
    • Romanian Wikipedia had 33 distinct tokens in its largest stemming group.
      • A fair number of Cyrillic/Latin homoglyphs, and one Cyrillic/Greek/Latin token.
        • Point of interest: that is one messed up word, search-wise... Aλоύταϛ, with a Latin A and a Cyrillic о, and then to top it off the final character ϛ is stigma (a sigma/tau—i.e., σ(or ς) + τ—ligature) not a regular final sigma (ς). It was too much, so I fixed it from my volunteer account.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • A smattering of mixed-script tokens.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • In the Sorani Wikipedia data, there are lots of initial/final/isolated forms of Arabic characters (special forms for guaranteeing a particular version of a letter rather than leaving it up to the font rendering in context). These get folded by ICU normalization to their "normal" forms, mostly leading to good collisions, and a few lost tokens as stop words get normalized into a recognizable form.
      • In the Romanian Wikipedia data, there are lots of invisibles, particularly word joiners and soft hyphens—lots more than I usually see. It's a small number of tokens (only 0.060%), but almost 0.6% of stemming groups had a term added to them after ICU normalization.
      • ... plus handfuls of the usual various character regularizations in both.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Romanian: exempted [ĂăÂâÎîȘșȚț]. Most impact is the expected Latin normalization.
    • Sorani: Nothing exempted; none of the letters in the Sorani alphabet are changed by ICU normalization.
      • More than half of the impact is on Latin tokens, but still plenty on Arabic-script tokens. Normalizations seem to be mostly on Arabic or Persian words/letters.

Romanian ş & ţ vs ș & ț[edit]

  • While reading up the Romanian alphabet, I learned that there is a common confusion between ş and ţ (with cedilla, not officially Romanian letters) and ș and ț (with comma, the correct Romanian letters), and there are plenty of examples on Romanian Wikipedia. So I fixed it with a character filter. (See Part III below.)
  • The impact on Romanian Wiktionary is small, only a couple dozen tokens, roughly even between splits and collisions.
  • The impact on Romanian Wikipedia is larger, with over a thousand tokens affected. The impact is obviously positive. As an example, one can debate whether Stefan and Ștefan should be kept separate (they are), but it's clear that Ştefan goes better with the latter, not the former (and so it will!).

Romanian Stop Words (ş & ţ vs ș & ț Redux!)[edit]

An odd thing I almost didn't notice was that 160 tokens of toţi (1) & şi (159) were no longer being dropped by the Romanian analysis chain after adding the comma/cedilla mapping. Both have cedillas, and neither is a "proper" Romanian word; they should be toți & și, with commas. (My guess is that the list is old, and comes from the before times, when the cedilla versions were all that was available in many fonts/operating systems, and the list has just been carried along for years without being updated.)

I looked at the Romanian stop word list on GitHub and to my surprise, it only uses the cedilla forms of words! I converted the cedilla forms to comma forms and added a secondary stopword filter with just those forms.

The indexing impact is enormous!

  • In my Romanian Wiktionary sample, 2,190 tokens (-1.427%) were dropped, with 2,104 being și, which means "and".
  • In my Romanian Wikipedia sample, 65,223 tokens (-3.406%) were dropped, with 61,867 being și.

We'll have to see how it shakes out in post-reindexing testing of queries, but this could have an impact on both precision (și / "and" won't be required to match) and ranking (și / "and" will be heavily discounted).

Imagine searching for Bosnia și Herțegovina ("Bosnia and Herzegovina") and having the și/and be weighted as much as either Bosnia or Herzegovina. That can't be helping.

I will also try to notify the creator of the stop word list and make a pull request upstream.

Late Breaking Development!—Romanian Stemmer (ş & ţ vs ș & ț Part IΙІ—T330893)[edit]

I opened a pull request upstream for additions to the Lucene Romanian stop word list and a dev there pointed out that the Snowball stemmer is also pre–Unicode 3 and uses the cedilla forms instead of the comma form. It rubs me the wrong way, but the quickest fix is to convert the comma forms to the cedilla forms so the stemmer will do the right thing, and have the internal representation be the (incorrect) cedilla forms. Users don't see them, but it's still a little weird, so I've liberally sprinkled comments in the code.

The original mapping from cedilla forms to comma forms actually caused a small bit of trouble, but it was so small that I didn't interpret it correctly at the time. Basically, the stemmer properly stemmed them, but they are so uncommon that it had very little impact. (In my Wiktionary sample, only 8 tokens (0.005% of tokens) were lost from 7 groups (0.016% of groups), and in my Wikipedia sample, only 754 tokens (0.039% of tokens) were lost from 101 groups (0.075% of groups).)

The reverse mapping, which allows the stemmer to operate on words with comma forms, has a much bigger impact:

  • Romanian Wiktionary: 1,424 tokens (0.941% of tokens) were merged into 601 groups (1.400% of groups)
  • Romanian Wikipedia: 33,978 tokens (1.837% of tokens) were merged into to 1,778 groups (1.313% of groups)

So 0.9%–1.8% of tokens were not being stemmed correctly, and now will be!

Overall Impact[edit]

  • Sorani Wikipedia ICU folding is the biggest source of changes—as expected.
    • 764 tokens (0.070% of tokens) were merged into 337 groups (0.273% of groups)
    • A few hundred Sorani tokens were dropped, almost all normalized to stop words, plus a handful of solo modifiers.
  • For both Romanian samples, a few homoglyph tokens were gained, and a handful of solo modifier tokens were dropped. The biggest impact is from merging ş/ţ and ș/ț.
    • Crossed out stats below are the pre–stemmer hack values. There's now a bigger impact because of reverse cedilla/comma mapping and its interaction with the stemmer. See ş & ţ vs ș & ț Part IΙІ above.
  • Romanian Wiktionary:
    • 959 tokens (0.625% of tokens) were merged into 656 groups (1.499% of groups)
    • 2,428 tokens (1.583% of tokens) were merged into 1,237 groups (2.827% of groups)
    • Net 2,143 tokens (-1.397%) were dropped, most of which were stop words with s/t with comma (mostly și).
  • Romanian Wikipedia:
    • 8,586 tokens (0.448% of tokens) were merged into 2,555 groups (1.849% of groups)
    • 69,431 tokens (3.626% of tokens) were merged into 4,148 groups (3.002% of groups)
    • Net 65,022 tokens (-3.396%) were dropped, mostly comma stop words, mostly și.

Romanian/Sorani Reindexing Impacts[edit]

General Notes[edit]

Summary

  • Romanian Wikipedia saw a big impact in the increase in number of results for more than 1 in 4 queries because of the merger of ş/ș and ţ/ț forms, which affects queries directly, but also affects stemmed matches for words in queries and in the article text. There was also a nice (but more standard) change to zero-results rate from diacritic folding.
  • Sorani Wikipedia also had a nice but normal-range change to zero-results rate, from Arabic-script normalizations, and shows some instability for <1% of queries because of its smaller size.

Background

  • I pulled a sample of 10K Wikipedia queries from late 2022/early 2023 (one month for Romanian and three months for Sorani). I filtered obvious porn, urls, and other junk queries from each sample (Romanian:164, lots of porn, URLs, and junk; Sorani:148, mostly URLs) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Romanian Wikipedia (T330783)[edit]

Reindexing Results

  • The zero results rate dropped from 24.4% to 23.7% (-0.7% absolute change; -2.9% relative change).
  • The number of queries that got more results right after reindexing was 27.5%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 4.9%, vs. the pre-reindex control of 0.3–0.5% and post-reindex control of 0.0%.

The zero results rate change is nice, but 1 in 4 queries getting more results is huge, and shows the impact of the comma/cedilla merger (i.e., ş matching ș, and ţ matching ț) and the stemmer only working on the uncommon/incorrect comma forms. The change not only affects queries with those characters in the query, but also those that match other words in the wiki content. There are Romanian inflections with ş/ș or ţ/ț that are now being stemmed correctly, increasing matches with other words that were always stemmed correctly.

Observations

  • The most common cause of improvement in zero-results is foreign diacritic folding, which usually resulted in a smaller number of matches; a few cases of ş/ș or ţ/ț folding resulted in more matches per query, because moderately common words with the cedilla forms could fail to exact match anything, or the comma form could fail to stem properly.
  • The most common cause of an increased number of results is ş/ș or ţ/ț folding, again with cedillas in queries or commas in previously uninflected inflected forms.
    • A small number of queries got fewer results because they used comma forms, which are not recognized by the stemmer without the comma/cedilla merger. This put them in a different stemming group for a different, more common word.
  • The most common cause of changes in the top result is a moderate to large increase in the total number of results, so the same as that above. In cases where there's no change in the number of results, the most common change is swapping the top two results, which I assume is a general change of stats from the mergers caused by ş/ș and ţ/ț folding.

Unpacking + ICU Norm + ICU Folding Impact on Sorani Wikipedia (T330783)[edit]

Reindexing Results

  • The zero results rate dropped from 38.8% to 38.3% (-0.5% absolute change; -1.3% relative change).
  • The number of queries that got more results right after reindexing was 3.3%, vs. the pre-reindex control of 0.0% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 2.3%, vs. the pre-reindex control of 0.7–0.9% and post-reindex control of 0.0%.

Observations

  • The most common cause of improvement in zero-results are mostly due to Arabic-script normalizations, with one curly quote being straightened in a Latin-script query.
  • The most common cause of an increased number of results are also mostly due to Arabic-script normalizations.
  • The most common cause of changes in the top result not attributable to more results seems to be randomness. There was a 0-7%–0.9% change in top results over 15 minute intervals before the reindex, and some of the top result changes swapped the top two results, which swapped back when I checked them, and then swapped again when I reloaded the page a little later—indicating that the stats for some words/queries are not super stable, which we expect to some degree between servers on smaller wikis.

Turkish Notes (T329762)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for Turkish.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, numbers, IPA transcriptions (in Wiktionary), and words joined by periods (in Wikipedia), etc.
  • Stemming observations:
    • Turkish Wikipedia had 77 distinct tokens in its largest "real" stemming group.
      • The largest stemming group has 149 distinct tokens that stem to d. All of them except plain "d" start with d', and almost all of them are either obviously French or Italian (d'immortalite), or names (D'Souza).
      • The next largest stemming group has 141 distinct tokens that stem to l. Almost all are French, except for plain l.
      • See notes on Turkish Apostrophe below for more details.
    • The stemmer is a little aggressive.
      • Three tokens, ları, leri, and siniz, are reduced to empty strings. They are all common endings, but it seems like there should be some sort of stem left.
      • In Turkish there are some voiced/devoiced alternations (e.g., p/b, k/ğ—though I can't find any "normal"-looking t/d alternations) in the final consonant of a stem, when some endings starting with a vowel are added. So we have ahşap/ahşaba and alabalık/alabalığı, and we generally see b→p and ğ→k in stems. It gets a little too aggressive when the "stem" is only one letter! So, kına and ğına are both stemmed as k, and bunsen and puma (neither of which I suspect are native Turkish words) are stemmed to p. Despite no other evidence of t/d alternations in normal stems, tümüyse and dinden (and dine and time) are stemmed to t.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.

Side Quest: Apostrophes[edit]

Apostrophes are used in Turkish to separate proper names from suffixes attached to them—e.g., Türkiye'den, "from Turkey"—presumably because without the apostrophe the boundary between an unfamiliar name and the suffixes could be ambiguous. English does something similar with a's, i's, and u's—which are the plurals of a, i, and u—to distinguish them from words as, is, and us, or with other unusually spelled forms like OK'd which is used as the past tense of to OK something.

The Elastic/Lucene apostrophe token filter removes the first apostrophe it finds in a word, plus everything after the apostrophe. This is disastrous for non-Turkish words and names, like D'Artagnan, d'Ivoire, and d'Urbervilles (which are all reduced to d) or O'Connell, O'Keefe, and O'Sullivan (which are all reduced to o, which is also a stopword!). It's even more inappropriate to chop everything after the first apostrophe in cases like O'Connor'a, O'Connor'dan, O'Connor'un, and O'Connor'ın, where the second apostrophe is clearly the one delineating the proper name from the suffixes.

Processing text can be complex, but there are some additional fairly straightforward cases of inappropriate apostrophe-chopping by apostrophe:

  • Perhaps subtly, bābā'ī and arc'teryx—because ī and x are not native Turkish letters, therefore they are not going to appear in Turkish suffixes.
  • On the other hand, egregiously, επ'ευκαιρία, прем'єр, and ג'אלה are not even in the Latin alphabet, and so are clearly not Turkish, but they are still subject to apostrophe-chopping by apostrophe.

I noticed a number of other patterns looking at words with apostrophes in the Turkish Wikipedia and Wiktionary samples:

  • Certain pre-apostrophe strings are almost all French or Italian elision (like d' and l'): l', d', dell', j', all', nell', qu', un', sull', dall', plus the French double elisions j'n' and j't'.
    • An exception to the French/Italian elision pattern is when the elision-looking pre-apostrophe string is followed by a common Turkish suffix post-apostrophe. So, d'de, d'den, d'nin, and d'ye are probably Turkish words about the letter d, and not French.
      • My favorite is d'nin, since both d' and 'nin mean "of"—so it's either "of nin" or "of d." In a Turkish context, we assume "of d" is more likely.
      • An exception to the exception is d'un. While un is a Turkish suffix, d'un is "of a" in French and it just too likely to be French. Similarly with l'un, qu'un, s'il, and qu'il.
  • Many characters get used as apostrophes: modifier apostrophe ʼ, left curly quote , right curly quote , grave accent `, acute accent ´, modifier grave accent ˋ, and modifier acute accent ˊ all occur. I didn't have fullwidth apostrophe in my sample, but it is so unambiguous that I included it.
    • The standard tokenizer and icu_tokenizer split on the non-modifier versions of grave and acute accents (`, ´), but they still occur in Turkish Wikipedia in place of apostrophes, so I include them.
  • Certain words (and morphemes) show up with apostrophes often enough to get special cases.
    • Words related to Kur'an/Qur'an that start with /^([kq]ur)'([aâā]n)/
    • Words ending with (generally English) -n't, e.g. ain't
    • Words with (generally English) -'n'-, e.g., rock'n'roll
    • Words with -'s'- (which is almost always English 's + apostrophe + Turkish suffix(es)), e.g., McDonald's'ın
  • As mentioned above, when a word has two or more apostrophes, only the last one could possibly be the followed by suffixes, e.g., nuku'alofa'nın

Side Note: Turkish Suffixes[edit]

Turkish is an agglutinating language, so it can really pile on the suffixes. Without trying to break down multiple suffixes, I analyzed the most common post-apostrophe endings in my samples.

In the Wikipedia sample, the top 10 endings accounted for about ~44% of words with apostrophes, while in the Wiktionary data, it was ~72%.

Wikipedia, top endings Wiktionary, top endings
10 74,676 44.30% 10 6,667 71.90%
20 106,455 63.20% 20 7,478 80.60%
30 121,020 71.80% 30 7,810 84.20%
40 127,952 75.90% 40 8,030 86.60%
50 131,822 78.20% 50 8,165 88.10%
60 134,136 79.60% 60 8,239 88.80%
100 137,819 81.80% 100 8,386 90.40%
168,470 100% 9,273 100%

I decided to take into account the top ~50 endings from each sample, plus any obvious variants (due to vowel harmony and voiced/devoiced variation mentioned above), resulting in about 90 post-apostrophe endings counting as "common" endings.

  • Any word with a combination of those endings after an apostrophe and not accounted for by something above is probably a Turkish word.
    • In practice, I didn't see more than four of the endings stacked (note that some endings may be two suffixes).
    • There are some pretty complex cases, though: b'dekilere = "to those in b"
    • There are some false positives, such as in + di + a + na and i + ta + li + a, but (a) some are caught by other patterns (d'italia is probably not Turkish) and (b) no simple set of rules is going to catch everything.
  • Words not accounted for by any of the above that have one letter before the apostrophe (e.g., s'appelle) are generally not Turkish words. (The Turkish cases, like b'dekilere, got caught by the previous rule.)
    • It's hard to divine the structure of these various non-Turkish words from unknown sources, so removing the apostrophe seems like the best thing to do. s'appelle would be ideally analyzed as appelle, but we're doing Turkish here, not French, and sappelle is much better than s.
  • As mentioned above, apostrophes followed by non-Turkish letters—which could be Cyrillic, Hangul, Hebrew, or Greek, or even Latin letters like ī, ë, or ß, or even q, x, or w—do not separate proper nouns and suffixes.
    • The list of Turkish letters includes abcçdefgğhıijklmnoöprsştuüvyz, plus âîû, which are commonly used diacritics in Turkish words.
    • Non-Turkish letters before the apostrophe are fine. You have to inflect non-Turkish words sometimes, too: Þórunn'nun, α’nın, дз'dır, բազկաթոռ'dan, قاعدة‎'nin.
  • There are a small number of fairly common strings that occur as the part of a token before an apostrophe that are almost never intended to be inflected words. That is, they are almost always part of a longer name or foreign word. They are: ch, ma, ta, and te. e.g., ch'ang and ta'rikh.

Obviously, these heuristics interact with each other and have to be carefully ordered when applied.

To handle all that apostrophe-related complexity, I created a new token filter, better_apostrophe, in a new plugin, extra-analysis-turkish. The code and docs are in all the usual places, including GitHub.

Overall, better_apostrophe is better than apostrophe, but still not perfect, of course.

A couple of final observations on apostrophe-chomping:

  • Another smaller pattern I noticed is that there are a fair number of words joined by periods. If the first one has an apostrophe in it, the second one gets chomped. (Which isn't actually terrible, because nothing is going to match the combined word.) I didn't investigate too closely, but I think enabling word_break_helper could be useful here, as elsewhere—but I think that's for T170625.
  • An interesting conundrum of apostrophe stripping is that it makes the boundaries of proper nouns clearer than they normally would be. For example, the proper stemmed form of Fransa'nın ("of France") is clearly Fransa. However, when Fransa appears by itself, the Turkish stemmer roughs it up a little and the output is fra. I originally thought that putting the apostrophe handling before the stemmer was a mistake, because Fransa'nın gets apostrophized as Fransa, which is then stemmed to fra. However, this is probably the best outcome, because then it at least matches plain Fransa, which is also stemmed to fra.

Back to Unpacking[edit]

  • The impact of enabling better_apostrophe isn't huge, but it's quite noticeable at certain focal points:
    • In the Wikipedia sample, 164 d'something words are no longer stemmed as d. The remaining 5 in the stemming group are plain d, and d followed by very Turkish suffixes: d'de, d'den, d'nin, and d'ye. Similarly, 157 l'something words are no longer stemmed as l.
      • Lots of mergers between d'something, l'something, and plain something, and a handful of mergers like K'omoks/Komoks and Kur'an/Kuran.
    • There's no longer an algorithmic bias against certain Irish names! Over a hundred O'Names were restored.
    • Numerical minutes/seconds (e.g., 1'23) are no longer truncated.
    • Common apostrophe-like characters are treated properly, so Mary‘nin now stems with Mary'nin and Mary’nin.
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • A fair number of mixed-script tokens—more than 200 Cyrillic/Latin in the Wikipedia sample—mostly homoglyphs.
      • An oddly common pattern is Cyrillic schwa ә in an otherwise Latin word—though it looks like they may have all come from one article.
      • Because of Turkish suffixing, there are a few surprising "homoglyph" tokens. Completely Cyrillic words, like МСАТ can get a Turkish/Latin suffix, giving МСАТ’ta. All-Cyrillic МСАТ would not normally be converted to all-Latin MCAT, but because of the 'ta suffix it is. It's rare, though.
    • The expected Dotted I (İ) regression did not happen because Turkish lowercasing does the right thing—or at least a right thing.
    • Most common normalizations for Turkish: invisibles (bidi, zero-width (non)joiners, soft hyphens), ß → ss, and a fair number of homoglyphs.
  • Enabled custom ICU folding for Turkish, saw lots of the usual folding effects.
    • Exempted Çç, Ğğ, Iı, İi, Öö, Şş, Üü for Turkish.
      • I and i aren't strictly necessary but they keep the Turkish upper/lower pairs Iı & İi together and makes it clear both are intended.
      • I didn't include Ââ, Îî, Ûû since they are not letters of the Turkish alphabet, and their use seems to be somewhat inconsistent. There are words distinguished only by circumflexes, but it seems that they are sometimes considered optional and their use has been inconsistant over the years.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • In the Wiktionary sample, there were a fair number of middot (·) collapses, e.g., mat·ka·lauk·ku → matkalaukku, but no corresponding hyphenation point (‧) collapse.
    • As part of enabling ICU folding, we also add a filter to remove empty tokens (which sometimes occur when we have a token made up of just a combining diacritic by itself, like ʾ). It also filtered out the three empty tokens the stemmer generates for ları, leri, and siniz, which is a good thing. (They will also still get picked up in the plain field.)

Overall Impact[edit]

  • There were few token count differences (~0.02%), mostly from extra homoglyph tokens or fewer solo combining characters.
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding, etc., were significant, but not extreme (0.5% to 2.4% of tokens being redistributed into 1.4% to 1.8% of stemming groups).
    • Turkish Wiktionary: 3,515 tokens (2.421% of tokens) were merged into 566 groups (1.844% of groups)
    • Turkish Wikipedia: 7,976 tokens (0.502% of tokens) were merged into 1,824 groups (1.358% of groups)

better_apostrophe Impact on Turkish Wikipedia (T337064)[edit]

Reindexing Results

  • The zero results rate was unchanged at 25.9%.
  • The number of queries that got more results right after reindexing was 4.4%, vs. the pre-reindex control of 0.6–0.8% and post-reindex control of 0.4–0.9%.
  • The number of queries that got fewer results right after reindexing was 1.9%, vs. the pre-reindex control of 0.1–0.6% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 1.0%, vs. the pre-reindex control of 0.4% and post-reindex control of 0.0%.

Observations

  • The only possible causes of changes are updates to the text of the wiki over the 4 hours of testing, differences in ranking between shards before reindexing, and better apostrophe processing.
  • The biggest decreases in results are generally the obvious candidates, like l or s no longer matching l'administration or s'il.
    • There are some other interesting cases like kimlerle ("with whom") being stemmed a bit too aggressively to just k, which then used to match names/transliterations like K'erosili and K'riat, but no longer does.
  • The biggest increase in terms of raw count comes from the stemmed form of aidin (ai) mathing French j'ai, but the best match new match is not in the top 15, and the percentage increase in results is just 3.62%. I can't figure out the second largest increase in results in terms of raw numbers because it is only a 0.16% increase in relative terms (i.e., 1 in 625—so I just can't find them!)
    • The proportionally largest increases in results come from matches like itikat and i'tikat, which seem to be correct, and matches on australie, which is the French word for Australia, which—being French—often come in the form of d'Australie or l'Australie.
  • I only found one clear example of the top result changing for apostrophe-related reasons, and that was a new Italian result where all'oggetto was indexed as oggetto rather than all. The rest seem to be random variation, and the differences between pre-reindex control value of 0.4% and the reindex value of 1.0% is a difference in timing.
    • The controls (pre- and post-reindex) are 15 minutes apart, the reindex was about 2.5 hours after the last control (getting the timing just right on reindexes is impossible without spamming the production servers with queries). Comparing the first and last controls (45 minutes apart) gives a top result change of 0.6%, so that tracks. The top result stops changing after reindexing because the fresh indexes are more similar across shards and so different shards are less likely to give different results.

Brazilian Portuguese Notes (T325092)[edit]

Side Quest: Brazil v Portugal[edit]

The Brazilian Portuguese situation is a little different from our usual scenario. The Brazilian Portuguese analysis chain (labelled "brazilian" by Elastic/Lucene) is only enabled on the Wiki Movimento Brasil wiki (br.wikimedia.org). Other Portuguese-language wikis (8 wiki projects and the Wikimedia Portugal wiki) use the Portuguese analysis chain (labelled "portuguese").

  • Data—since only one small wiki uses the Brazilian Portuguese analysis chain, the data I used for the analysis analysis included:
    • The same 10K sample each from Portuguese Wikipedia and Wiktionary used when unpacking the Portuguese analyzer.
    • A sample of 500 pages from the Wiki Movimento Brasil wiki (which has fewer than 1K total content pages and fewer than 6K total pages).
  • Analysis observations—I did some very basic comparisons between the brazilian and portuguese analysis chains, using the original monolithic Elastic analyzer for each, on the Portuguese Wikipedia data.
    • Total tokens in the Portuguese Wikipedia sample:
      • brazilian: 2 million
      • portuguese: 1.9M
    • Distinct tokens in its largest stemming group in the Wikipedia sample:
      • brazilian: 39
      • portuguese: 14

I looked into the analysis chain implementations for the brazilian and portuguese stemmers and stopword lists, and they seem to be completely independent implementations. Brazilian stopwords (128 words) and stemmer. Portuguese stopwords (203 words) and stemmer.

Looking at "plain Latin" tokens—no punctuation, numbers, invisibles, or combining/modifier characters—the brazilian analyzer found 163,636 distinct tokens and 1,868,827 total tokens, while the portuguese analyzer found 163,544 distinct tokens and 1,754,784 total tokens. Note that at this stage, capitalization differences count as "distinct" tokens, so no and No are two separate words. portuguese has 92 fewer distinct plain Latin tokens, and about 114K fewer total plain Latin tokens. Looking at the 25 most common token lists for each, brazilian doesn't filter no ("in the"), para ("for"), and foi ("he/she was" or "he/she went")—accounting for about 68K tokens—and about a dozen others—accounting for about 118K tokens total—while portugese doesn't filter sobre ("on")—accounting for about 3.5K tokens.

Looking briefly at the differences in stemming groups, it looks like the brazilian stemmer is much better with verb forms, but also more aggressive with derivational morphology. The portuguese stemmer seems to get more plural forms in -s, but may also pick up some false positives. My initial impression is that the brazilian stemmer is probably better, but may be too aggressive—but I didn't spend enough time looking into it, since I wanted to get back to finishing the unpacking.

However, I asked Diego about Brazilian and European Portuguese, and it seems like they are similar enough that it is definitely worth investigating whether the brazilian analyzer could be useful on Portuguese-language wikis.

Back to Unpacking[edit]

  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • It's not worth reporting on the content of the samples, since they come from wikis where the brazilian analyzer isn't used.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations in the br.wikimedia.org data was #ª/#º ordinals converted to a/o, and there was one bidi mark in the sample, despite no LTR scripts in the sample—those things can hide anywhere!
  • Enabled custom ICU folding—no exemptions needed.
    • In the br.wikimedia.org sample, the main impact is diacritics and curly apostrophes.

Overall Impact[edit]

  • There were no token count differences in the br.wikimedia.org data.
  • ICU folding is the main source of changes.
    • 989 tokens (0.549% of tokens) were merged into 33 groups (0.214% of groups)—i.e., about 1 in 180 tokens match something they didn't match before.

Estonian Notes (T332322)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for Estonian.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are (in Wikipedia) long Estonian words (likely compounds), words joined by periods, file names, numbers, and (in Wiktionary) lots of German compounds, IPA transcriptions, \u encoded tokens, etc.
  • Enabled the new (to us) Estonian analyzer.
    • Using Wiktionary and DeepL, plus the fact that many Estonian words have fairly long stems and Estonian has nicely modular morphology, it's pretty clear that the stemmer is doing the right thing. It may not be getting every form, but it is clearly grouping lots of related words, and generally only related words. (I did see a few errors with non-Estonian words—English barrel & barren, and the name Barret all stem as barre, for example.)
  • Stemming observations:
    • Estonian Wikipedia had 52 distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations for Estonian: Lots of German ß/ss normalizations and invisibles (bidi, zero-width (non)joiners, soft hyphens), and a few mixed-script homoglyphs.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted Šš, Žž, Õõ, Ää, Öö, Üü for Estonian.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • A fair number of curly apostrophes normalized to straight apostrophes

Overall Stemming Impact[edit]

  • 4.372% of Wiktionary tokens (532 distinct, including case variants) and 13.146% of Wikipedia tokens (987 distinct) were filtered as stop words.
  • The merges from stemming were quite significant (even more so for Wikipedia):
    • Estonian Wiktionary: 11,044 tokens (6.912% of tokens) were merged into 2,487 groups (2.895% of groups)
    • Estonian Wikipedia: 497,501 tokens (22.675% of tokens) were merged into 23,075 groups (7.373% of groups)

Overall Unpacking Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters.
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (0.25% to 1% of tokens being redistributed into 0.66% to 1.1% of stemming groups).
    • Estonian Wiktionary: 1,411 tokens (0.923% of tokens) were merged into 892 groups (1.121% of groups)
    • Estonian Wikipedia: 4,612 tokens (0.242% of tokens) were merged into 1,453 groups (0.662% of groups)

Estonian Reindexing Impacts[edit]

General Notes[edit]

Summary

  • Estonian got a new stemmer, and it had a pretty big impact! 1 in 6 previous Estonian Wikipedia zero-results queries get results. Almost 1 in 3 of non-zero-results queries get more results, and more than 1 in 8 queries had their top result change.

Background

  • I pulled a sample of 10K Wikipedia queries from March of 2023 (2 weeks). I filtered obvious porn, urls, and other junk queries from each sample (109, junk queries and numbers are the most common categories) and randomly sampled 3000 queries from the remainder.

Unpacking + New Stemmer + ICU Folding Impact on Estonian Wikipedia (T335704)[edit]

Reindexing Results

  • The zero results rate dropped from 35.1% to 29.4% (-5.7% absolute change; -16.2% relative change).
  • The number of queries that got more results right after reindexing was 32.7%, vs. the pre-reindex control of 0.0% and post-reindex control of 0.1%.
  • The number of queries that changed their top result right after reindexing was 11.8%, vs. the pre-reindex control of 0.1–0.2% and post-reindex control of 0.0%.

Observations

  • The most common cause of improvement in zero-results is the new stemmer, though there are some improvements from ICU folding as well.
  • The most common cause of an increased number of results is the new stemmer.
  • The most common cause of changes in the top result for queries that didn't get many or any new results is probably a change in scoring of query terms—an uncommon form of a common word would have been high scoring before stemming, but lower scoring after.
  • The most common cause of changes in the top result for queries that did get lots of new results is the new stemmer.