User:TJones (WMF)/Notes/Language Analyzer Harmonization Notes

May 2023 — See TJones_(WMF)/Notes for other projects. See also T219550. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro, Goals, Caveats
The goal of bringing language analyzers "into harmony" is to have as many of the non–language-specific elements of the analyzers to be the same as possible. Some split words on underscores and periods, some don't. Some split CamelCase words and some don't. Some use ASCII folding, some use ICU folding, and some don't use either. Some preserve the original word and have two ouptuts when folding, and some don't. Some use the ICU tokenizer and some use the standard tokenizer (for no particular reason—there are good reasons to use the ICU, Hebrew, Korean, or Chinese tokenizers in particular cases). When there is no language-specific reason for these differences, it's confusing, and we clearly aren't using analysis best practices everywhere.

My design goal is to have all of the relevant upgrades made by default across all language analysis configurations, with only the exceptions having to be explicitly configured.

Our performance goal is to reduce zero-results rate and/or increase the number of results returned for 75% of relevant queries averaged across all wikis. This goal comes with some caveats, left out of the initial statement to keep it reasonably concise.


 * "All wikis" is, in effect, "all reasonably active wikis"—if a wiki has only had twelve searches last month, none with apostrophes, it's hard to meaningfully measure "75% of the queries with apostrophes" in them. More details in "Data Collection" below.
 * I'm also limiting my samples to Wikipedias because they have the most variety of content and queries, and to limit testing scope, allowing more languages to be included.
 * I'm going to ignore wikis with unchanged configs (some elements are already deployed on some wikis), since they will have approximately 0% change in results (there's always a bit of noise).
 * "Relevant" queries are those that have the feature being worked on. So, I will have a collection of queries with apostrophe-like characters in them to test improved apostrophe handling, and a collection of queries with acronyms to test better acronym processing. I'll still test general query corpora to get a sense of the overall impact, and to look for cases where queries without the feature being worked on still get more matches (for example, searching for NASA should get more matches to N.A.S.A. in articles).
 * I'm also applying my usual filters (used for all the unpacking impact analyses) to queries, mostly to filter out porn and other junk. For example, I don't think it is super important whether the query s`wsdfffffffsf actually gets more results once we normalize the backtick/grave accent to an apostrophe.
 * Smaller/lower-activity wikis may get filtered out for having too few relevant queries for a given feature.
 * We are averaging rates across wikis so that wiki size isn't a factor (and neither is sample rate—so, I can oversample smaller wikis without having to worry about a lot of bookkeeping).

Data Collection
I started by including all Wikipedias with 10,000 or more articles. I also gathered the number of active editors and the number of full-text queries (with the usual anti-bot filters) for March 2023. I dropped those with less than 700 monthly queries and fewer than 50 active editors. My original ideas for thresholds had been ~1000 monthly queries and ~100 active editors, but I didn't want or need a super sharp cut off. Limiting by very low active editor counts meant fewer samples to get at the query-gathering step, which is somewhat time-consuming. Limiting by query count also meant less work at the next step of filtering queries, and all later steps, too.

I ran my usual query filters (as mentioned above), and also dropped wikis with fewer than 700 unique queries after filtering. That left 90 Wikipedias to work with. In order of number of unique filtered monthly queries, they are: English, Spanish, French, German, Russian, Japanese, Chinese, Italian, Portuguese, Polish, Arabic, Dutch, Czech, Korean, Indonesian, Turkish, Persian, Vietnamese, Swedish, Hebrew, Ukrainian, Igbo, Finnish, Hungarian, Romanian, Greek, Norwegian, Catalan, Hindi, Thai, Simple English, Danish, Bangla, Slovak, Bulgarian, Swahili, Croatian, Serbian, Tagalog, Slovenian, Lithuanian, Georgian, Tamil, Malay, Uzbek, Estonian, Albanian, Azerbaijani, Latvian, Armenian, Marathi, Burmese, Malayalam, Afrikaans, Urdu, Basque, Mongolian, Telugu, Sinhala, Kazakh, Macedonian, Khmer, Kannada, Bosnian, Egyptian Arabic, Galician, Cantonese, Icelandic, Gujarati, Central Kurdish, Serbo-Croatian, Nepali, Latin, Kyrgyz, Belarusian, Esperanto, Norwegian Nynorsk, Assamese, Tajik, Punjabi, Oriya, Welsh, Asturian, Belarusian-Taraškievica, Scots, Luxembourgish, Irish, Alemannic, Breton, & Kurdish.


 * Or, in language codes: en, es, fr, de, ru, ja, zh, it, pt, pl, ar, nl, cs, ko, id, tr, fa, vi, sv, he, uk, ig, fi, hu, ro, el, no, ca, hi, th, simple, da, bn, sk, bg, sw, hr, sr, tl, sl, lt, ka, ta, ms, uz, et, sq, az, lv, hy, mr, my, ml, af, ur, eu, mn, te, si, kk, mk, km, kn, bs, arz, gl, zh-yue, is, gu, ckb, sh, ne, la, ky, be, eo, nn, as, tg, pa, or, cy, ast, be-tarask, sco, lb, ga, als, br, ku.

I sampled 1,000 unique filtered queries from each language (except for those that had fewer than 1000). I also made a slight mistake in my filtering (I adapted the code from another script I use and missed a bit), so many of the 1K query corpora are actually 990-999 queries when run against production indexes... Sigh.

I also pulled 1,000 articles from each Wikipedia to use for testing.

Some Observations
After filtering porn and likely junk queries and uniquifying queries, the percentage of queries remaining generally ranged from 94.68% (Icelandic—so many unique queries!) to 70.64% (Persian), with a median of 87.5% (87.51% for Marathi and 87.49% for Azerbaijani), and a generally smooth distribution across that range.

There were three outliers:


 * Swahili (57.67%) and Igbo (37.74%) just had a lot of junk queries.
 * Vietnamese was even lower at 30.00%, with some junk queries but also an amazing number of repeated queries, many of which are quite complex (not like everyone is searching for just famous names or movie titles or something "simple"). A few queries I looked up on Google seem to exactly match titles or excerpts of web pages. I wonder if there is a browser tool or plugin somewhere that is automatically doing wiki searches based on page content.

"Infrastructure"
I've built up some temporary "infrastructure" to support impact analysis of the harmonization changes. Since every or almost every wiki will need to be reindexed to enable harmonization changes, timing the "before and after" query analyses for the 90 sampled wikis would be difficult.

Instead, I've set up a daily process that runs all 90 samples each day. There's an added bonus of seeing the daily variation in results without any changes.

I will also pull relevant sub-samples for each of the features (apostrophes, acronyms, word_break_helper, etc.) being worked on and run them daily as well.

There's a rather small chance of having a reindexing finish while a sample is being run, so that half the sample is "before" and half is "after". If that happens, I can change my monitoring cadence to every other day for that sample for comparison's sake and it should be ok.

Apostrophes (T315118)
There are many candidate "apostrophe-like" characters...

and Acronyms (T170625)
...