User:TJones (WMF)/Notes/Unpacking Notes

April-June 2021 — See TJones_(WMF)/Notes for other projects. See also T272606. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

The Unpacking Process
Gather Data Run Baselines Unpack Analyzers Re-enable Analyzer Upgrades Repair Unpacked & Upgraded Analyzers Enable ICU Folding Compare Final Analyzer to Baseline Merge Your Patch Prep Query Data Reindexing and Before-And-After Analysis
 * Gather 10K articles (without repeats) each from Wikipedia and Wiktionary for each language (custom Perl script, )
 * Manual review/editing: remove leading white space, dedupe lines, review potential HTML tags ( search for  )
 * Gather 10K/4weeks query data from Wikipedia for each language (Jupyter notebook,  on  )
 * Per language (I've been working on three at a time recently to somewhat streamline the process):
 * set language to target and reindex
 * run  as baseline for wiki and wikt 10K samples
 * Disable homoglyphs and icu_norm upgrades at these locations in
 * in
 * in
 * Per language:
 * unpack analyzer in  /
 * https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
 * ignore
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * should be zero diffs in count files; otherwise unpacking is not correct
 * Re-enable homoglyphs and icu_norm upgrades
 * Per language:
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * run  —   for  /  for wiki and wikt;   comparison for wiki and wikt
 * solo—just trying to get the lay of the land
 * look at potential problem stems
 * look at largest Type Group Counts
 * anything around 20+ is interesting; well over 20 is surprising (but not necessarily wrong)
 * look at Tokens Generated per Input Token; usually expect 1 in baseline; some 2s with homoglyphs
 * look at Final Type Lengths; 1s are often CJK, longest are often URLs, German, spaceless languages, or  encoded
 * comparison—see what changed
 * expect dotted-I regression
 * lots of hidden characters removed (soft hyphens, bidi marks, joiners and non-joiners)
 * Super- and subscript characters get converted, ß to ss, too
 * Regularization of non-Latin characters is common, particularly, Greek ς to σ
 * investigate anything that doesn’t make sense
 * Per language:
 * Make any needed “repairs” to accommodate ICU normalization
 * possibly just
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * run  —   for repaired for wiki and wikt;   comparison for wiki and wikt
 * solo—just trying to get the lay of the land
 * comparison—look for expected changes (maybe just dotted-I)
 * Per language:
 * enable ICU Folding
 * add language code to, and any folding exceptions to
 * add  to   list, usually in last place
 * set language to target and reindex
 * verify  config at http://127.0.0.1:8080/w/api.php?action=cirrus-settings-dump
 * run  as   for wiki and wikt 10K samples
 * run  —   for repaired for wiki and wikt;   comparison for wiki and wikt
 * solo—potential problem stems can show systematic changes, even if they aren’t really problems
 * elision (l’elision, d’elision, qu’elision, s’etc.) can throw this off
 * comparison—look for expected changes (rare characters and variants folded, diacritics folded, etc.)
 * Per language:
 * run  —   comparison for wiki and wikt
 * comparison—look at the overall impact of unpacking, upgrades, and ICU folding
 * Token delta: expect small numbers (<100) unless something “interesting” happened
 * New Collision Stats gives a sense of the overall impact, # of tokens that merge into other groups.
 * Typically < 3% on each number, with higher values in Wiktionary
 * Possibly a few Lost pre-analysis tokens
 * Net Gains: expect plenty of changes; high-impact changes are usually—
 * one- or two- letter tokens (e.g., a picks up á, à, ă, â, å, ä, ã, ā, ə, ɚ)
 * something with a lot of variants that includes a folded character (e.g., abc, abcs, l'abs, l'abcs, d'abc, d'abcs, qu'abc, qu'abcs, etc. (with straight quotes) picks up l’abs, l’abcs, d’abc, d’abcs, qu’abc, qu’abcs, etc. (with curly quotes)
 * or a diacriticless typo (Francois) picks up all the forms with diacritics (François—it’s hard to find an example in English)
 * Don’t expect any New Splits, Found pre-analysis tokens, or Net Losses unless there was additional customization
 * Summarize findings (here and in Phabricator)
 * When everything looks good and makes sense, submit the patch.
 * When the patch is merged, it’s time to reindex.
 * Before reindexing, using the 10K Wikipedia query sample:
 * Filter “bad queries” and randomly sample 3K queries (using a custom Perl script, )
 * Review the “bad queries” to make sure the filters are behaving reasonably for the given language
 * While reindexing Wikipedia for a given language, kick off “brute-force” sampling (using a custom Perl script, )
 * The brute-force script runs the same 3K queries every 10 minutes while reindexing
 * Let it run 2–3 more iterations after reindexing is complete
 * You may have to throw out a query run if reindexing finished in the middle of the run
 * Using time stamps from the reindexing and query runs figure out the smallest gap between a “before” and an “after” query run and compare them (using a custom Perl script, ), noting differences in zero results rate, increases and decreases in results counts, and changes in top results.
 * Use similarly spaced pre-reindexing runs and post-reindexing as controls to get a handle on normal variability and compare to the before-and-after results.
 * Comparing the earliest and latest pre-reindexing runs also allows you to judge what is random fluctuation and what is directional. e.g.:
 * if 10-minute interval comparisons all give 2-3% changes in top result, and a 60-minute interval gives 2.3% changes in top results, it’s probably random noise.
 * If 10-minute interval comparisons all give 2-3% changes in increased results, and a 60-minute interval gives 6% change in increased results, it’s probably partly noise overlaying a general increasing trend.
 * Summarize findings (here and in Phabricator)

Spanish Notes (T277699)

 * Usual 10K sample each from Wikipedia and Wiktionary
 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
 * Enabled homoglyphs and found a few examples in each sample
 * Enabled ICU normalization and saw the usual normalization
 * Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Potential concerns:
 * 1ª and 1º are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
 * However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
 * nº gets mapped to no, which is a stop word. pº gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
 * Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
 * Updated test fixtures for Spanish and multi-language tests.


 * Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.

Tokenization/Indexing Impacts


 * Spanish Wikipedia (eswiki)
 * There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
 * About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.


 * Spanish Wiktionary (eswikt)
 * There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
 * Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.

Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia (T282808)
Summary
 * While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.

Background
 * I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
 * I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
 * I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
 * I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.

Expected Results
 * Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.

Control Results
 * The number of queries getting zero results held steady at 19.3%
 * The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
 * The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
 * The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
 * The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
 * These results are also generally consistent with the control tests I ran in April and May.

Reindexing Results
 * The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
 * The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
 * The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
 * The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
 * The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
 * The number of queries that changes their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.

Post-Reindex Control
 * The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.

German/Dutch/Portuguese Notes (T281379)

 * Usual 10K sample each from Wikipedia and Wiktionary for each language
 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
 * Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
 * Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * German required customization to maintain ß for stopword processing.
 * Enabled custom ICU folding for each language, saw lots of the usual folding effects.
 * Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
 * Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)

German Notes
General German


 * ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched  in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
 * Sorting all this out also seems to have fixed T87136.
 * There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.

German Wikipedia


 * Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
 * It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
 * About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.

German Wikitionary


 * Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
 * The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
 * About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Dutch Notes
General Dutch


 * Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
 * There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.

Dutch Wikipedia


 * Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
 * About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.

Dutch Wiktionary


 * The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
 * About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Portuguese Notes
Portuguese Wikipedia


 * There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
 * The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and ﬁ and ﬂ ligatures being expanded to fi and fl.
 * Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
 * About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.

Portuguese Wiktionary


 * There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
 * Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓).
 * About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.

Basque, Catalan, and Danish Notes (T283366)

 * Usual 10K sample over a 1–4 week period from Wikipedia and Wiktionary for each language.
 * Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.


 * Stemming observations:
 * Catalan Wikipedia had up to 180(!) distinct tokens in stemming groups.
 * Basque Wikipedia had up to 200(!!) distinct tokens in stemming groups.
 * Danish Wikipedia had a mere 30 distinct tokens in its largest stemming group.


 * Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
 * Note that  is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.


 * Enabled homoglyphs and found a handful of examples in all six samples.
 * Catalan Wikipedia had two mixed–Cyrillic/Greek/Latin tokens!
 * Found Greek/Latin examples in all three Wikipedias and Danish Wiktionary, and Greek/Cyrillic in Catalan Wikipedia.


 * Enabled ICU normalization and saw the usual normalizations.
 * The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
 * Most common normalizations: lots of ß and invisibles (soft-hyphen, bidi marks, etc.) all around; 1ª, 1º for Basque and Catalan Wikipedias, and some full-width characters for Catalan Wikipedia.
 * Catalan Wikipedia also loses a lot (12K+ out of 4.1M) of "E⎵" and "O⎵" tokens, where ⎵ represents a "zero-width no-break space" (U+FEFF). "e" and "o" are stop words—"o" means "or", but "e" just seems to refer to the letter; weird. The versions with U+FEFF seem to be used exclusively in coordinates ("E" stands for "est", which is "east"; "O" stands for "oest", which is "west"). Since the coords are very exact (e.g., "42.176388888889°N,3.0416666666667°E"), I don't think many people are searching for them specifically, and if they are, the plain field will help them out.


 * Enabled custom ICU folding for each language, saw lots of the usual folding effects.
 * Exempted [ñ] for Basque and [æ, ø, å] for Danish. [ç] was unclear for Basque and Catalan, but I let it be folded to c for both for the first pass.
 * ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
 * Basque: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
 * Catalan Wiktionary: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
 * Catalan Wikipedia:
 * Lots of high-impact collisions (ten or more distinct words merged into another group—often two largish groups merging). They came in three flavors:
 * The majority are ç → c; most look ok
 * A few ñ → n; these look good; mostly low frequency Spanish cognates merging with Catalan ones
 * Single letters merging with diacritical variants, like [eː, e̞, e͂, ê, ē, Ĕ, ɛ, ẹ, ẽ, ẽː] merging with [È, É, è, é]
 * Surprisingly, lots of Japanese Katakana changes, deleting the prolonged sound mark ー.
 * Danish: Also straightened a fair number of curly quotes.

Overall Impact

 * There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters. (But see Catalan Wikipedia.)
 * ICU folding is the biggest source of changes in all wikis—as expected.
 * Generally, the merges that resulted from ICU folding were significant, but not extreme (0.5% to 1.5% of tokens being redistributed into 1% to 3% of stemming groups).
 * Basque Wiktionary: 649 tokens (1.111% of tokens) were merged into 473 groups (2.330% of groups)
 * Basque Wikipedia: 27,620 tokens (1.175% of tokens) were merged into 3,244 groups (1.325% of groups)
 * Catalan Wiktionary: 840 tokens (0.520% of tokens) were merged into 400 groups (1.181% of groups)
 * Catalan Wikipedia:
 * 12.7K fewer tokens out of 4.1M (see "E⎵" and "O⎵" above)
 * 39,099 tokens (0.943% of tokens) were merged into 2,513 groups (0.967% of groups)
 * Danish Wiktionary: 1,515 tokens (1.387% of tokens) were merged into 904 groups (2.788% of groups)
 * Danish Wikipedia: 20,778 tokens (0.611% of tokens) were merged into 2,990 groups (1.023% of groups)