User:TJones (WMF)/Notes/Unpacking Notes

From mediawiki.org

April-September 2021 — See TJones_(WMF)/Notes for other projects. See also T272606. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

The Unpacking Process[edit]

Gather Data

  • Gather 10K articles (without repeats) each from Wikipedia and Wiktionary for each language (custom Perl script, wikitext.pl)
    • Manual review/editing: remove leading white space, dedupe lines, review potential HTML tags ( search for <[a-z]+ )
  • Gather 10K/4weeks query data from Wikipedia for each language (Jupyter notebook, Sample_Queries.ipynb on stat1007)

Run Baselines

  • Per language (I've been working on three at a time recently to somewhat streamline the process):
    • set language to target and reindex
    • run analyze_counts.pl as baseline for wiki and wikt 10K samples

Unpack Analyzers

Re-enable Analyzer Upgrades

  • Re-enable homoglyphs and icu_norm upgrades
  • Per language:
    • set language to target and reindex
    • run analyze_counts.pl as upgraded for wiki and wikt 10K samples
    • run compare_counts.plsolo for baseline/upgraded for wiki and wikt; baseline_vs_upgraded comparison for wiki and wikt
      • solo—just trying to get the lay of the land
        • look at potential problem stems
        • look at largest Type Group Counts
          • anything around 20+ is interesting; well over 20 is surprising (but not necessarily wrong)
        • look at Tokens Generated per Input Token; usually expect 1 in baseline; some 2s with homoglyphs
        • look at Final Type Lengths; 1s are often CJK, longest are often URLs, German, spaceless languages, or \u encoded
      • comparison—see what changed
        • expect dotted-I regression
        • lots of hidden characters removed (soft hyphens, bidi marks, joiners and non-joiners)
        • Super- and subscript characters get converted, ß to ss, too
        • Regularization of non-Latin characters is common, particularly, Greek ς to σ
        • investigate anything that doesn’t make sense

Repair Unpacked & Upgraded Analyzers

  • Per language:
    • Make any needed “repairs” to accommodate ICU normalization
      • possibly just dotted_I_fix
    • set language to target and reindex
    • run analyze_counts.pl as repaired for wiki and wikt 10K samples
    • run compare_counts.plsolo for repaired for wiki and wikt; upgraded_vs_repaired comparison for wiki and wikt
      • solo—just trying to get the lay of the land
      • comparison—look for expected changes (maybe just dotted-I)

Enable ICU Folding

  • Per language:
    • enable ICU Folding
      • add language code to $languagesWithIcuFolding, and any folding exceptions to getICUSetFilter()
      • add asciifolding to filter list, usually in last place
    • set language to target and reindex
    • run analyze_counts.pl as folded for wiki and wikt 10K samples
    • run compare_counts.plsolo for repaired for wiki and wikt; repaired_vs_folded comparison for wiki and wikt
      • solo—potential problem stems can show systematic changes, even if they aren’t really problems
        • elision (l’elision, d’elision, qu’elision, s’etc.) can throw this off
      • comparison—look for expected changes (rare characters and variants folded, diacritics folded, etc.)

Compare Final Analyzer to Baseline

  • Per language:
    • run compare_counts.plbaseline_vs_folded comparison for wiki and wikt
      • comparison—look at the overall impact of unpacking, upgrades, and ICU folding
        • Token delta: expect small numbers (<100) unless something “interesting” happened
        • New Collision Stats gives a sense of the overall impact, # of tokens that merge into other groups.
          • Typically < 3% on each number, with higher values in Wiktionary
        • Possibly a few Lost pre-analysis tokens
        • Net Gains: expect plenty of changes; high-impact changes are usually—
          • one- or two- letter tokens (e.g., a picks up á, à, ă, â, å, ä, ã, ā, ə, ɚ)
          • something with a lot of variants that includes a folded character (e.g., abc, abcs, l'abs, l'abcs, d'abc, d'abcs, qu'abc, qu'abcs, etc. (with straight quotes) picks up l’abs, l’abcs, d’abc, d’abcs, qu’abc, qu’abcs, etc. (with curly quotes)
          • or a diacriticless typo (Francois) picks up all the forms with diacritics (François—it’s hard to find an example in English)
        • Don’t expect any New Splits, Found pre-analysis tokens, or Net Losses unless there was additional customization
    • Summarize findings (here and in Phabricator)

Merge Your Patch

  • When everything looks good and makes sense, submit the patch.
    • When the patch is merged, it’s time to reindex.

Prep Query Data

  • Before reindexing, using the 10K Wikipedia query sample:
    • Filter “bad queries” and randomly sample 3K queries (using a custom Perl script, run_queries.pl)
      • Review the “bad queries” to make sure the filters are behaving reasonably for the given language

Reindexing and Before-And-After Analysis

  • While reindexing Wikipedia for a given language, kick off “brute-force” sampling (using a custom Perl script, brute.pl)
    • The brute-force script runs the same 3K queries every 10 minutes while reindexing
    • Let it run 2–3 more iterations after reindexing is complete
    • You may have to throw out a query run if reindexing finished in the middle of the run
    • Using time stamps from the reindexing and query runs figure out the smallest gap between a “before” and an “after” query run and compare them (using a custom Perl script, comp_queries.pl), noting differences in zero results rate, increases and decreases in results counts, and changes in top results.
    • Use similarly spaced pre-reindexing runs and post-reindexing as controls to get a handle on normal variability and compare to the before-and-after results.
    • Comparing the earliest and latest pre-reindexing runs also allows you to judge what is random fluctuation and what is directional. e.g.:
      • if 10-minute interval comparisons all give 2-3% changes in top result, and a 60-minute interval gives 2.3% changes in top results, it’s probably random noise.
      • If 10-minute interval comparisons all give 2-3% changes in increased results, and a 60-minute interval gives 6% change in increased results, it’s probably partly noise overlaying a general increasing trend.
  • Summarize findings (here and in Phabricator)

Post-Reindexing Top Result Changes[edit]

I've seen something of a trend across wikis: The number of searches that have their top result change decreases dramatically after reindexing. It is possible that there is some effect from changed word stats from merging words after ICU Normalization or ICU Folding (e.g., resume and résumé are counted together). And of course new content may have been added to the Wiki that rightfully earns a place as the new top result for a given query.

However, after consulting with the Elasticsearch Brain Trust™, we decided that best explanation for this is increased consistency across shards after reindexing.

The most common cause of short term changes in top results is having the query served by a different shard. In addition to having different statistics for uncommon words that are spread unevenly across shards, word statistics are not immediately updated when documents are deleted or changed. Over time the shards are more likely to differ from each other.

After reindexing, every shard has a reasonably balanced brand-spanking new index with no history of deletions and changes, so the shards are likely more similar in their stats (and thus in their reporting of the top result).

Spanish Notes (T277699)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a few examples in each sample
  • Enabled ICU normalization and saw the usual normalization
    • Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Potential concerns:
      • and are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
      • However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
      • gets mapped to no, which is a stop word. gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
  • Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
  • Updated test fixtures for Spanish and multi-language tests.
  • Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.

Tokenization/Indexing Impacts

  • Spanish Wikipedia (eswiki)
    • There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
    • About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.
  • Spanish Wiktionary (eswikt)
    • There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
    • Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.

Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia (T282808)[edit]

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.

Background

  • I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
  • I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
  • I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.

Expected Results

  • Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.

Control Results

  • The number of queries getting zero results held steady at 19.3%
  • The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
  • The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
  • The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
  • The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
  • These results are also generally consistent with the control tests I ran in April and May.

Reindexing Results

  • The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
  • The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
  • The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
  • The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
  • The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.

Post-Reindex Control

  • The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.

German/Dutch/Portuguese Notes (T281379)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
  • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
  • Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • German required customization to maintain ß for stopword processing.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
    • Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)

German Notes[edit]

General German

  • ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched unicodeSetFilter in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
    • Sorting all this out also seems to have fixed T87136.
  • There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.

German Wikipedia

  • Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
  • It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from word_break_helper, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
  • About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.

German Wiktionary

  • Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
  • The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
  • About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Dutch Notes[edit]

General Dutch

  • Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
  • There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.

Dutch Wikipedia

  • Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
  • About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.

Dutch Wiktionary

  • The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
  • About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Portuguese Notes[edit]

Portuguese Wikipedia

  • There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
  • The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and fi and fl ligatures being expanded to fi and fl.
  • Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
  • About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.

Portuguese Wiktionary

  • There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
  • Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓).
  • About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.

DE/NL/PT Reindexing Impacts[edit]

Impact Tool Filtering Improvements During German, Dutch, Portuguese Testing[edit]

While working on German, I discovered that 28 of the filtered German queries should not have been filtered (28 out of 10K isn't too, too many, though). Sequences of 6+ consonants are not too uncommon in German (e.g., Deutschschweizer, "German-speaking Swiss person", or Angstschweiß, "cold sweat"), but they do follow certain patterns, which I've now incorporated into my filtering.

I also added additional filtering for more URLs, email addresses, Cyrillic-flavored junk, and very long queries (≥100 characters) that get 0 results.

I tested these filtering changes on German, Dutch, Portuguese, Spanish, English, Khmer, Basque, Catalan, and Danish query corpora.

Unpacking + ICU Norm + ICU Folding + ß/ss Split Impact on German Wikipedia (T284185)[edit]

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for German Wikipedia. Folding diacritics had a noticeable impact on the zero results rate and the total number of results returned. For example, searching for surangama sutra now finds Śūraṅgama-sūtra. Reindexing in general seems to decrease variability in the top result.
  • I also disabled the folding of ß to ss in the plain field, which had a small negative impact on recall in certain corner cases. (See T87136 for rationale.)

Background

  • I pulled a 10K sample of German Wikipedia queries from April of 2021, and filtered 134 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
    • I later discovered that 28 of the filtered queries should not have been filtered (28 out of 10K isn't too, too many, though). Sequences of 6+ consonants are not too uncommon in German (e.g., Deutschschweizer, "German-speaking Swiss person", or Angstschweiß, "cold sweat"), but they do follow certain patterns, which I've now incorporated into my filtering.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on German Wikipedia, similar to the method used on Spanish Wikipedia. A number of control diffs were run every ~10 minutes before and after reindexing.
  • I compared each iteration against the subsequent one, and compared the first and last runs before reindexing to get insight into "trends" vs "noise" in the comparisons.

Control Results

  • The number of queries getting zero results held steady at 22.0%
  • The number of queries getting a different number of results increases slightly over time (0.3% to 1.6% in 10 minute intervals; 3.6% over 90 minutes)
    • The number of queries getting fewer results is noise (0.0% to 0.4% in 10 minute intervals; 0.5% over 90 minutes)
    • The number of queries getting more results increases slightly over time (0.2% to 1.5% in 10 minute intervals; 3.2% over 90 minutes)
  • The number of queries changing their top result is noise (1.5% to 2.2% in 10 minute intervals; 1.9% over 90 minutes)

Reindexing Results

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for German Wikipedia. Folding diacritics had a noticeable impact on the zero results rate and the total number of results returned. For example, searching for surangama sutra now finds Śūraṅgama-sūtra. Reindexing in general seems to decrease variability in the top result.
  • The zero results rate dropped to 21.7% (-0.3% absolute change; -1.4% relative change).
  • The number of queries getting a different number of results increased to 13.6% (vs. the 0.3%–1.6% range seen in control).
    • The number of queries getting fewer results was about 4 times the max of the control range (1.8% vs 0.0%–0.4%). 7 of 54 involve ss or ß, but I don't see a pattern for the rest. 37 of 54 only got 1 fewer result, so the impact is not large.
    • The number of queries getting more results was 11.5% (vs the control range of 0.2%–1.5%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 4.0% (vs. the control range of 1.5%–2.2%; that's less than 2x increase). I looked at some of these, and some are definitely the result of folding allowing for matching words in the top result.

Post-Reindex Control

  • The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.2% vs 1.5%–2.2%).

Observations

  • The most dramatic decrease in results (both in absolute terms and percentage-wise), was for the query was heisst s.w.a.t. ("what does S.W.A.T. do?"): from 3369 down to 67 results. Currently, word_break_helper is configured for the plain field, but not the text field (as before), and ß no longer maps to ss in the plain field. word_break_helper breaks up s.w.a.t. into four separate letters in the plain field (but not the text field), improving recall. So, the query in the plain field is was + heisst + s + w + a + t, while the text field query is heisst/heißt + s.w.a.t. Since heißt is much more common than heisst (68K vs 2K results), the plain query returns many fewer results.
    • On the one hand, enabling word_break_helper everywhere would be nice, but we also need proper acronym support! (T170625)

Unpacking + ICU Norm + ICU Folding Impact on Dutch Wikipedia (T284185)[edit]

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding had a likely minor impact for Dutch Wikipedia. There was a small decrease in zero-results queries, a general increase in recall (both attributable to ICU Folding—buthusbankje matches bûthúsbankje, or a curly quote is converted to a straight quote), and a decrease in changes to top queries (a general side-effect of reindexing).

Background

  • I pulled a 10K sample of Dutch Wikipedia queries from April of 2021, and filtered 125 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on Dutch Wikipedia, similar to the method used on Spanish Wikipedia. A number of control diffs were run every ~10 minutes before and after reindexing.
  • I was unable to time the query runs with reindexing just right, so the reindexing finished during one of the query runs. I had to drop that one, so comparisons are across every other run (i.e., ~20 minutes apart). I also compared the first and last runs before and after reindexing to try to get insight into "trends" vs "noise" in the comparisons, but the shorter total time (~30 minutes) wasn't really long enough to let the signal emerge from the noise.

Control Results

  • The number of queries getting zero results held steady at 23.3%
  • The number of queries getting a different number of results is hard to judge (0.7% to 1.1% in 20 minute intervals; 1.2% over 30 minutes)
    • The number of queries getting fewer results is possibly noise (0.2% to 0.8% in 20 minute intervals; 0.8% over 30 minutes)
    • The number of queries getting more results is probably noise (0.3% to 0.8% in 20 minute intervals; 0.5% over 30 minutes)
  • The number of queries changing their top result is probably noise (1.2% to 1.4% in 20 minute intervals; 1.2% over 30 minutes)

Reindexing Results

  • While unpacking an analyzer should have no impact on results, adding ICU folding had a likely minor impact for Dutch Wikipedia. There was a small decrease in zero-results queries, a general increase in recall (both attributable to ICU Folding), and a decrease in changes to top queries (a general side-effect of reindexing).
  • The zero results rate dropped to 23.2% (-0.1% absolute change; -0.4% relative change).
  • The number of queries getting a different number of results increased to 8.0% (vs. the 0.7%–1.1% range seen in control).
    • The number of queries getting fewer results was within the control range (0.3% vs 0.2%–0.8%).
    • The number of queries getting more results was 7.5% (vs the control range of 0.3%–0.8%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 3.4% (vs. the control range of 1.2%–1.4%).

Post-Reindex Control

  • The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.1% vs 1.2%–1.4%).

Observations

  • Zero-results changes are all due to ICU folding, so that buthusbankje matches bûthúsbankje, or a curly quote is converted to a straight quote. These are all fairly rare words that got ≤5 results with ICU Folding.
  • Large increases in number of results and changes in the top result are largely obviously from ICU folding.

Unpacking + ICU Norm + ICU Folding Impact on Portuguese Wikipedia (T284185)[edit]

Summary

  • ICU folding increases recall for some queries, affecting zero results rate and the total number of results returned. Missing tildes (a instead of ã, or o instead of õ) are the biggest source of changes, so this is a very good change for Portuguese searchers who omit them!

Background

  • I pulled the usual sample of 10K queries from Portuguese Wikipedia (April 2021), filtered 149 queries, and randomly sampled 3K from the remainder.
  • I used a brute-force diff strategy, with control diffs before and after (at ~10 minute intervals).
    • The before/after time difference was 15 minutes because of the exact time reindexing finished.

Control Results

  • The number of queries getting zero results held steady at 18.7%
  • The number of queries getting a different number of results is increasing (0.8% in 10 minute intervals; 1.5% over 20 minutes)
    • The number of queries getting fewer results is noise (0.1% to 0.3% in 10 minute intervals; 0.3% over 20 minutes)
    • The number of queries getting more results is increasing (0.6% to 0.7% in 10 minute intervals; 1.2% over 20 minutes)
  • The number of queries changing their top result is noise (1.1% to 1.4% in 10 minute intervals; 1.4% over 20 minutes)

Reindexing Results

  • The zero results rate dropped to 18.3% (-0.4% absolute change; -2.1% relative change).
  • The number of queries getting a different number of results increased to 15.3% (vs. the 0.8% seen in control).
    • The number of queries getting fewer results was similar to the control range (1.0% in 15 minutes vs 0.6%–0.7% in 10 minutes and 1.2% in 20 minutes).
    • The number of queries getting more results was 13.8% (vs the control range of 0.6%–0.7%). These are largely due to ICU folding. The biggest increases are not the former zero results queries.
  • The number of queries that changed their top result was 3.4% (vs. the control range of 1.2%–1.4%).

Post-Reindex Control

  • The three control tests I ran after reindexing showed changes approximately within the normal range, except for changes in the top result, which was much lower (0.0%–0.1% vs 1.2%–1.4%).

Observations

  • Zero-results changes are mostly obviously due to ICU folding.
  • Large increases in number of results and changes in the top result are largely obviously from ICU folding. Particularly sao matching são—which increased hits from 300 to 21K!
    • The one query I couldn't figure out was 1926~. The absolute increase is fairly large (~5K) but the relative increase it not (2.3%—out of 218K).
  • Overall, missing tildes (a instead of ã, or o instead of õ) are the biggest sources of changes.

Basque, Catalan, and Danish Notes (T283366)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.
  • Stemming observations:
    • Catalan Wikipedia had up to 180(!) distinct tokens in stemming groups.
    • Basque Wikipedia had up to 200(!!) distinct tokens in stemming groups.
    • Danish Wikipedia had a mere 30 distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a handful of examples in all six samples.
    • Catalan Wikipedia had two mixed–Cyrillic/Greek/Latin tokens!
    • Found Greek/Latin examples in all three Wikipedias and Danish Wiktionary, and Greek/Cyrillic in Catalan Wikipedia.
  • Enabled ICU normalization and saw the usual normalizations.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations: lots of ß and invisibles (soft-hyphen, bidi marks, etc.) all around; 1ª, 1º for Basque and Catalan Wikipedias, and some full-width characters for Catalan Wikipedia.
    • Catalan Wikipedia also loses a lot (12K+ out of 4.1M) of "E⎵" and "O⎵" tokens, where ⎵ represents a "zero-width no-break space" (U+FEFF). "e" and "o" are stop words—"o" means "or", but "e" just seems to refer to the letter; weird. The versions with U+FEFF seem to be used exclusively in coordinates ("E" stands for "est", which is "east"; "O" stands for "oest", which is "west"). Since the coords are very exact (e.g., "42.176388888889°N,3.0416666666667°E"), I don't think many people are searching for them specifically, and if they are, the plain field will help them out.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [ñ] for Basque and [æ, ø, å] for Danish. [ç] was unclear for Basque and Catalan, but I let it be folded to c for both for the first pass.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Basque: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
    • Catalan Wiktionary: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
    • Catalan Wikipedia:
      • Lots of high-impact collisions (ten or more distinct words merged into another group—often two largish groups merging). They came in three flavors:
        • The majority are ç → c; most look ok
        • A few ñ → n; these look good; mostly low frequency Spanish cognates merging with Catalan ones
        • Single letters merging with diacritical variants, like [eː, e̞, e͂, ê, ē, Ĕ, ɛ, ẹ, ẽ, ẽː] merging with [È, É, è, é]
      • Surprisingly, lots of Japanese Katakana changes, deleting the prolonged sound mark ー.
    • Danish: Also straightened a fair number of curly quotes.

Overall Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters. (But see Catalan Wikipedia.)
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (0.5% to 1.5% of tokens being redistributed into 1% to 3% of stemming groups).
    • Basque Wiktionary: 649 tokens (1.111% of tokens) were merged into 473 groups (2.330% of groups)
    • Basque Wikipedia: 27,620 tokens (1.175% of tokens) were merged into 3,244 groups (1.325% of groups)
    • Catalan Wiktionary: 840 tokens (0.520% of tokens) were merged into 400 groups (1.181% of groups)
    • Catalan Wikipedia:
      • 12.7K fewer tokens out of 4.1M (see "E⎵" and "O⎵" above)
      • 39,099 tokens (0.943% of tokens) were merged into 2,513 groups (0.967% of groups)
    • Danish Wiktionary: 1,515 tokens (1.387% of tokens) were merged into 904 groups (2.788% of groups)
    • Danish Wikipedia: 20,778 tokens (0.611% of tokens) were merged into 2,990 groups (1.023% of groups)

CA/DA/EU Reindexing Impacts[edit]

An Unexpected Experiment[edit]

David needed to reindex over 800 wikis for the ores_articletopicsweighted_tags rename, including all of the large wikis covered by unpacking Catalan, Danish, and Basque. (There was another small wiki for the Denmark Wikimedia chapter, which I reindexed.)

Because I couldn't control the exact timing of the reindexing, I ran 5 pre-reindex control query runs at 10 minute intervals for comparison, and then ran follow-up query runs at approximately 1-day intervals (usually ±15 minutes, sometimes ±2 hours).

The exact number of pre-reindex controls and post-reindex controls for each language differed because they were reindexed on different days.

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • Catalan has a very large improvement in zero-results rate (8.1% relative improvement, or 1 in 12), largely driven by the fact that people type -cio for -ció (which is cognate with Spanish -ción and English -tion).
  • In general, the impact on Danish was very mild; the general variability in Danish query results is lower than for other wikis.
  • Basque improvements are in large part due to queries in Spanish that are missing the expected Spanish accents.

Background

  • I pulled a sample of 10K Wikipedia queries from April of 2021 (1 week each for Catalan and Danish, the whole month for Basque). I filtered obvious porn, urls, and other junk queries from each sample (ca:237, da:396, eu:438, urls most common category in all cases) and randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Catalan Wikipedia (T284691)[edit]

Reindexing Results

  • Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
  • The zero results rate dropped from 14.9% to 13.7% (-1.2% absolute change; -8.1% relative change).
  • The number of queries that got more results right after reindexing was 30.4%, vs. the pre-reindex control of 17.1% and post-reindex control of 14.8–17.6%.
  • The number of queries that changed their top result right after reindexing was 6.2%, vs. the pre-reindex control of 1.0% and post-reindex control of 0.6–2.0%.

Observations

  • The most common cause of improvement in zero-results is matching -cio in the query with -ció in the text, and they generally look very good.
  • Some of the most common causes of an increased number of results include -cio/-ció, other accents missing in queries, and c/ç matches. Not all of the highest impact c/ç matches look great, but these are edge cases. From the earlier analysis chain analysis (see above), I expect c/ç matches are overall a good thing, though we should keep an eye out for reports of problems.

Unpacking + ICU Norm + ICU Folding Impact on Danish Wikipedia (T284691)[edit]

Reindexing Results

  • Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
  • The zero results rate dropped from 28.6% to 28.2% (-0.4% absolute change; -1.4% relative change).
  • The number of queries that got more results right after reindexing was 9.0%, vs. the pre-reindex control of 2.1–3.1% and post-reindex control of 2.0–3.0%.
  • The number of queries that changed their top result right after reindexing was 1.7%, vs. the pre-reindex control of 0.7–0.9% and post-reindex control of 0.2–0.9%.

Observations

  • Generally the impact on Danish Wikipedia was very muted compared to most others we've seen so far.

Unpacking + ICU Norm + ICU Folding Impact on Basque Wikipedia (T284691)[edit]

Reindexing Results

  • Note that the sampling rate is ~1 day, rather than ~10 minutes as in previous measurements.
  • The zero results rate dropped from 24.4% to 23.1% (-1.3% absolute change; -5.3% relative change).
  • The number of queries that got more results right after reindexing was 21.6%, vs. the pre-reindex control of 6.9–7.9% and post-reindex control of 7.2–10.0%.
  • The number of queries that changed their top result right after reindexing was 4.0%, vs. the pre-reindex control of 0.2–0.6% and post-reindex control of 0.1–0.2%.

Observations

  • A lot of the rescued zero-results and some of the other improved queries are in Spanish, and are missing the expected Spanish accents.

Unexpected Experiment, Unexpected Results![edit]

The results of this unexpected experiment are actually very good. With fairly different behavior from all three of these samples (Catalan with big improvements, Basque with more typical improvements, and Danish with smaller improvements and generally less variability), the impacts—especially now that we know where to expect them—are easy to detect at one-day intervals, despite the general variability in results over time. This means I can back off my sampling rate from ~10 minutes (which is sometimes hard to achieve) to something a little easier to handle—like half-hourly or hourly.

Czech, Finnish, and Galician Notes (T284578)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, IPA transcriptions (in Wiktionary) etc.
  • Stemming observations:
    • Czech Wikipedia had 37 distinct tokens in its largest stemming group.
      • The Czech stemmer stems single letters c → k, z → h, č → k, and ž → h (though plain z is a stop word) and ek → k and eh → h. This seems like an over-aggressive stemmer... looking at the code, it is modifying endings even when there is nothing that looks like a stem. I will submit a ticket or maybe work on a patch as a 10% project.
    • Finnish Wikipedia had 61 distinct tokens in its largest stemming group.
    • Galician Wikipedia had 66 distinct tokens in its largest stemming group.
      • Since I can recognize some cognates in other Romance languages, I can say that the largest group is a little aggressive; it includes Ester, Estaban, estación, estato, estella, estiño, plus many forms of estar.
      • Galician also has a very large number of words in other scripts, which lead to some very long tokens, like the 132-character \u-encoded version of 𐍀𐌰𐌿𐍂𐍄𐌿𐌲𐌰𐌻𐌾𐌰, Gothic for "Portugal".
      • Galician Wiktionary likes to use superscript numbers for different meanings of the same word, so the entry for canto has canto¹ through canto⁴, which get indexed as canto1 through canto4—there a fair number of such tokens. Fortunately, the unnumbered version should always be on the same page.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and found plenty of examples.
    • There are some Greek/Latin examples in Czech
      • Including "incorrect" Greek letters in IPA on cswikt (oddly, there are some Greek letters that are commonly used in IPA and others that have Latin equivalents that are used instead, and for a couple it's a free-for-all!)
    • There are Cyrillic/Greek and Latin/Greek examples in Finnish Wikipedia and Galician Wiktionary.
    • Galician Wikipedia had lots of Latin/Greek tokens—though many seem to be abbreviations for scientific terms... but there are a few actual mistakes in there, too.
  • Enabled ICU normalization and saw the usual normalizations.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • Czech: the usual various character regularizations, invisibles (bidi, zero-width (non)joiners, soft hyphens), a few #ª ordinals
      • Finnish: mostly ß/ss & soft hyphens
      • Galician: lots of #ª ordinals, lots of invisibles
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [Áá, Čč, Ďď, Éé, Ěě, Íí, Ňň, Óó, Řř, Šš, Ťť, Úú, Ůů, Ýý, and Žž] for Czech.
    • Exempted [Åå, Ää, Öö] for Finnish.
    • Exempted [Ññ] for Galician.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Czech: lots more tokens with Latin + diacritics than usual, since the list of exemptions is pretty big, and exempts some characters used in other languages, like French and Polish.
    • Finnish: lots of š and ž, which are supposed to be used in loan words and foreign names, but are often simplified to s or z (or sh and zh, but that is probably outside our scope).
    • Galician: Nothing really sticks out as particularly common; just a collection of the usual folding mergers.

Czech, Finnish, Galician Reindexing Impacts[edit]

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • The Czech and Finnish Wikipedia samples showed clear but rather muted impact on user query results. The Galician results are a little more robust and show a more consistent pattern of searchers not using standard accents (rather than just problems with "foreign" diacritics).

Background

  • I pulled a sample of 10K Wikipedia queries from approximately July of 2021 (1 week each for Czech and Finnish, June through August for Galician). I filtered obvious porn, urls, and other junk queries from each sample (Czech:152, Finnish:226, Galician:928, urls are the most common category in all cases, with numbers and junk being common for all, as well. Galician also had a lot of porn queries, and overall more useless queries, which is a trend on smaller wikis). I randomly sampled 3000 queries from the remainder.

Unpacking + ICU Norm + ICU Folding Impact on Czech Wikipedia (T290079)[edit]

Reindexing Results

  • The zero results rate dropped from 23.8% to 23.6% (-0.2% absolute change; -0.8% relative change).
  • The number of queries that got more results right after reindexing was 8.4%, vs. the pre-reindex control of 0.2–0.7% and post-reindex control of 0.1–0.7%.
  • The number of queries that changed their top result right after reindexing was 1.6%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0%.

Observations

  • Generally the impact on Czech Wikipedia was rather muted. Changes in results were generally from missing diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Finnish Wikipedia (T290079)[edit]

Reindexing Results

  • The zero results rate dropped from 24.6% to 24.4% (-0.2% absolute change; -0.8% relative change).
  • The number of queries that got more results right after reindexing was 9.1%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 4.0%, vs. the pre-reindex control of 0.4–0.5% and post-reindex control of 0.0–0.1%.

Observations

  • * Generally the impact on Finnish Wikipedia was also muted. Changes in results were generally from missing diacritics.

Unpacking + ICU Norm + ICU Folding Impact on Galician Wikipedia (T290079)[edit]

Reindexing Results

  • The zero results rate dropped from 18.1% to 17.5% (-0.6% absolute change; -3.3% relative change).
  • The number of queries that got more results right after reindexing was 18.6%, vs. the pre-reindex control of 0.2–0.5% and post-reindex control of 0.1–0.7%.
  • The number of queries that changed their top result right after reindexing was 4.1%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.

Observations

  • The most common causes of improvement in zero-results came from matching missing accents on words that end with vowel + n. Cognate with what we've seen before, -cion for -ción is common, along with general accents missing from -ón/-ín/-ún endings.
  • The most common causes of an increased number of results and changes in the top result include correcting for missing accents from final vowel + n, and general incorrect (missing, extra, or wrong) diacritics.

Hindi, Irish, Norwegian Notes (T289612)[edit]

  • Usual 10K sample each from Wikipedia and Wiktionary for each language.
    • Except for Irish Wiktionary, which is quite small; I used a 1K sample for gawikt.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, compounds, a bit of likely vandalism; etc.
  • Stemming observations:
    • Irish Wikipedia had 16 distinct tokens in its largest stemming group.
    • Norwegian Wikipedia had 18 distinct tokens in its largest stemming group.
    • Hindi Wikipedia had 46 distinct tokens in its largest stemming group.
      • The first pass at analysis showed 1780 "potential problem" stems in the Hindi Wikipedia data, which are ones where the stemming group has no common prefix and no common suffix. This isn't particularly rare, but there usually aren't so many. It turns out that the majority (~1400) were caused by Devanagari numerals and Arabic numerals (e.g., १ and 1). I added folding rules to my analysis to handle those cases. Another common cause were long versions of vowels, such as अ (a) and आ (ā), which seem to frequently alternate at the beginning of words that have the same stem. A few more folding rules and I got down to a more normal number of "potential problem" stems—just 12—and they were all reasonable.
    • A smattering of mixed-script tokens.
      • Hindi had many non-homoglyph mixed script tokens, mostly Devanagari and another script. Many of these were separated by colons or periods, making me think word_break_helper could be useful, especially with better acronym handling.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and saw the usual stuff.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
      • Though not for Irish! Since Irish has language-specific lowercasing rules, both lowercasing and ICU normalization happen and lowercasing handles İ correctly.
    • Most common normalizations:
      • Irish Wikipedia also uses Mathematical Bold Italic characters (e.g., 𝙄𝙧𝙚𝙡𝙖𝙣𝙙) rather than bold and italic styling in certain cases, such as names of legal cases.
        • One instance of triple diacritics stuck out: gCúbå̊̊
      • Hindi had lots of bi-directional symbols, including on many words that are not RTL.
      • Norwegian had the usual various character regularizations, mostly diacritics, plus a handful of invisibles.
  • Further Customization—Irish
    • Older forms of Irish orthography used an overdot (ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ/ẛ ṫ) to indicate lenition, which is now usually indicated with a following h (bh ch dh fh gh mh ph sh th). Since these are not a commonly occurring characters, it is easy enough to do the mapping (ḃ => bh, etc.) as a character filter. It doesn't cause a lot of changes, but it does create a handful of good mergers.
    • Another feature of Gaelic script is that its lowercase i is dotless (ı). However, since there is no distinction between i and ı in Irish, i is generally used in printing and electronic text. ICU folding already converts ı to i.
      • As an example, amhráin ("songs") appears in my corpus both in its modern form, and its older form, aṁráın (with dotted ṁ and dotless ı). Adding the overdot character filter (plus the existing ICU folding) allows these to match!
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Nothing exempted for Irish or Hindi.
    • Exempted Ææ, Øø, and Åå for Norwegian.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
      • Irish uses a fair number of acute accents to mark long vowels, though it seems to sometimes be omitted (perhaps as a mistake). There are quite a few mergers between diacriticked (or partly diacriticked) forms and fully diacriticked forms, such as cailíochta and cáilíochta. There are a few potential incorrect forms—I recognize some English words that happen to look like forms of Irish words—but there aren't a lot, and some of them are already conflated by the current search.
    • Hindi: Most folding affects Latin words, and most of the Hindi words that were affected had bidi and other invisible characters stripped.
    • Norwegian Wiktionary had a surprising number of apparently Romance-language words that had their non-Norwegian diacritics normalized away.

Overall Impact[edit]

  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters.
  • ICU folding is the biggest source of changes in all wikis—as expected.
    • Irish Wikipedia: 134,095 tokens (15.887% of tokens) were merged into 2,524 groups (2.822% of groups).
    • Irish Wiktionary: 130 tokens (1.272% of tokens) were merged into 44 groups (1.074% of groups).
      • Irish Wiktionary mergers may be less numerous because of the smaller 1K sample size.
      • Irish had a much bigger apparent impact (15.887% of tokens), which is partially an oddity of accounting.
        • Looking at amhrán ("song") as an example, the original main stemming group consisted of amhrán, Amhrán, amhránaíocht, Amhránaíocht, amhránaíochta, Amhránaíochta, d’amhrán, nAmhrán, and tAmhrán. Another group without acute accents—possibly typos—consisted of amhran and Amhran. The larger group (which has more members that are also more common) is counted as merging into the smaller group because the new folded stem is amhran, not amhrán, giving 9 mergers rather than 2.
    • Hindi Wiktionary: 4 tokens (0.002% of tokens) were merged into 4 groups (0.012% of groups).
    • Hindi Wikipedia: 296 tokens (0.019% of tokens) were merged into 150 groups (0.128% of groups).
      • Hindi was barely affected by ICU folding, since it doesn't do much to Hindi text.
    • Norwegian Wiktionary: 1,310 tokens (1.229% of tokens) were merged into 990 groups (4.302% of groups)
    • Norwegian Wikipedia: 6,731 tokens (0.424% of tokens) were merged into 1,633 groups (0.979% of groups)
      • Generally, the merges that resulted from ICU folding in Norwegian were significant, but not extreme.

Irish, Hindi, Norwegian Reindexing Impacts[edit]

General Notes[edit]

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary

  • Specific new matches in all three (Irish, Hindi, & Norwegian) Wikipedias are good.
  • The impact overall on the zero-results rate is fairly small for all three.
    • The zero-results rate for Hindi Wikipedia, independent of recent changes, it really high (60+%), so I investigated a bit. Transliteration of Latin queries to Devanagari could have a sizable impact.
  • Irish and Norwegian had a sizable increase in total results, and a noticeable increase in top results. Hindi had much smaller increases for both.
    • Irish changes were dominated by Irish diacritics (which are not part of the alphabet), while the Norwegian changes were dominated by foreign diacritics.

Background

  • I tried to pull a sample of 10K Wikipedia queries from June–August of 2021 (1 week in July each for Hindi and Norwegian, almost three months for Irish). I was only able to get 2,543 queries for Irish Wikipedia. I filtered obvious porn, urls, and other junk queries from each sample (Irish:959, Hindi:528, Norwegian:250, with urls and porn being the most common categories) and randomly sampled 3000 queries from the remainder (there were only 1448 unique queries left for the Irish sample).

Unpacking + ICU Norm + ICU Folding Impact on Irish Wikipedia (T294257)[edit]

Reindexing Results

  • The zero results rate dropped from 32.6% to 30.5% (-2.1% absolute change; -6.4% relative change).
  • The number of queries that got more results right after reindexing was 12.3%, vs. the pre-reindex control of 0% and post-reindex control of 0%.
  • The number of queries that changed their top result right after reindexing was 5.4%, vs. the pre-reindex control of 0.2% and post-reindex control of 0%.

Observations

  • The most common cause of improvement in zero-results is matching missing Irish diacritics.
  • The most common cause of an increased number of results is also matching missing Irish diacritics.
    • Unaccented versions of names like Seamus, Padraig, and O Suilleabhain now can find the accented versions (Séamus, Pádraig, Ó Súilleabháin).
    • Not all diacritical matches are the best. Irish matches English be, which occurs in titles of English works. matches are still ranked highly because of exact matches.
  • The most common cause of changes in the top result is—you guessed it!—matching missing Irish diacritics; often with a near exact title match.
  • The negligible or zero changes in number of results and top results stems from, I believe, the small size and low activity of the wiki; basically, there is virtually no noise at the 15–30 minute scale.

Unpacking + ICU Norm + ICU Folding Impact on Hindi Wikipedia (T294257)[edit]

Reindexing Results

  • The zero results rate dropped from 62.1% to 62.0% (-0.1% absolute change; -0.2% relative change).
  • The number of queries that got more results right after reindexing was 2.3%, vs. the pre-reindex control of 0.0–0.1% and post-reindex control of 0.0–0.1%.
  • The number of queries that changed their top result right after reindexing was 0.9%, vs. the pre-reindex control of 0.1% and post-reindex control of 0%.

Observations

  • The most common cause of improvement in zero-results is matching missing foreign diacritics. (e.g., shito/shitō and nippo/nippō)
  • The most common causes of an increased number of results are matching missing foreign diacritics, removal of invisibles, and—to a much lesser degree—ICU normalization of some Hindi and other Brahmic accents, including Devanagari and Odia/Oriya virama and Sanskrit udātta.
  • The most common causes of changes in the top result are the same as for the increased number of results, since there is a lot of overlap (i.e., searches that got more results often changed their top result).
Hindi Wikipedia Zero Results Queries[edit]

Because the zero results rate was so high, I decided there was no time like the present to do a little investigating into why. I did a little diffing into the 1,861 queries that got no results. (A reminder where this sample comes from: 10K Hindi Wikipedia queries were extracted from the search logs, 528 were filtered as porn, URLs, numbers-only, other junk, etc., and the remainder was deduped, leaving 9,060 unique queries. A random sub-sample of 3K was chosen from there, and the 1,861 (62.0%) of those that got zero results are under discussion here.)

The large majority (84%) of zero-results queries are in the Latin script, with Devanagari (13%) and mixed Latin + Devanagari (2%) making up most of the rest.

  • 1566 (84.1%) Latin
  • 244 (13.1%) Devanagari
  • 43 (2.3%) Latin + Devanagari
  • 4 (0.2%) Gujarati
  • 1 Gurmukhi (Punjabi) + Devanagari
  • 1 CJK
  • 1 emoji
  • 1 misc/wtf (punct + Devanagari combining chars)

I reviewed a random sample of 50 of the Latin queries, and divided them into two broad (and easy for me to discern) categories—English and non-English. The non-English generally looks like transliterated Devanagari/Hindi, but I did not explicitly verify that in all cases. There are a relatively small number of English queries, and larger number of mixed English and non-English queries, and the majority (~70%) are non-English.

50 Latin sample

  • 34 non-English
  • 13 Mixed English + non-English
  • 2 English
  • 1 ???

I took a separate random sample of 20 non-English queries and used Google Translate in Hindi to conver them to Devanagari. About half couldn't be automatically converted (I didn't dig into that to figure out why), but 25% got some Wikipedia results after conversion, and 15% that got no Wikipedia results got some sister-search (Wiktionary, etc.) results. The remaining 15% got no results.

20 non-English sample

  • 9 can't convert
  • 5 some results
  • 3 sister search results
  • 3 no results

Taking this naive calculation with a huge grain of salt (or at least with huge error bars), 84.1% of zero-result queries are in Latin script, 68% of those are likely transliterated Devanagari, and 40% of those get results when transliterated back to Devanagari. That's 22.9% (probably ±314.59%)... actually, the math nerd in me couldn't let it go... using the Wilson Score Interval and the standard error propagation formula for multiplication, I get 23.3% ± 11.9%.

So, in very round numbers, almost ¼ of non-junk zero-result queries (and likely at least ⅒ and at most ⅓) on Hindi Wikipedia could be rehabilitated with some sort of decent Latin-to-Devanagari transliteration. The number could be noticeably higher, too—most optimistically doubled—if the queries that Google Translate could not automatically convert got some sort of results with a more robust transliteration scheme; on the other hand, they could all be junk, too. It is also possible that the mixed English and transliterated Devanagari zero-result queries could get some results—though transliterating the right part of the mixed queries could present a significant challenge.

I have opened a ticket with this info (T297761) to go on our backlog as a possible future improvement for Hindi.

I also looked at a random sample of 20 of the zero-result Devanagari queries. The most common grouping is what I call "homework". These are queries that are phrased like all or part of a typical homework question, or other information-seeking question. Something like What is the airspeed velocity of an unladen swallow?, How does aspirin find a headache,or hyperbolic geometry parallel lines.

I also found four names, one porn query, and three I couldn't readily decipher.

20 Devanagari sample

  • 12 "homework"
  • 4 names
  • 1 porn
  • 3 ???

Homework-type questions in general sometimes benefit from removing stop words, but sometimes there are too many specific but only semi-relevant content words to find a match.

Unpacking + ICU Norm + ICU Folding Impact on Norwegian Wikipedia (T294257)[edit]

Reindexing Results

  • The zero results rate dropped from 26.4% to 26.2% (-0.2% absolute change; -0.8% relative change).
  • The number of queries that got more results right after reindexing was 9.2%, vs. the pre-reindex control of 0.1–0.3% and post-reindex control of 0.1–0.2%.
  • The number of queries that changed their top result right after reindexing was 4.3%, vs. the pre-reindex control of 1.1–1.2% and post-reindex control of 0%.

Observations

  • The most common cause of improvement in zero-results is matching missing foreign diacritics. (e.g., Butragueno/Butragueño and Bockmann/Böckmann)
  • The most common cause of an increased number of results is matching foreign diacritics.
  • The most common cause of changes in the top result is matching foreign diacritics.