User:TJones (WMF)/Notes/Analysis of DYM Method 2

October 2020, February 2021 — See TJones_(WMF)/Notes for other projects. See also T244800. For help with technical jargon, check out the Search Glossary.

Background
Spelling corrections and other "Did you mean…?" (DYM) suggestions on Wikipedia and its sister projects are often not great. So, we've undertaken a project to improve the DYM suggestions on Wikipedia.

The goal is to implement multiple methods for making suggestions that address the issues of different languages and projects, including difficulties introduced by writing systems, and the amount of data available. (See T212884.) "Method 2" (M2—T212891) uses resources external other than search logs (e.g., dictionaries with word frequencies) as a source for spelling corrections. Only applicable to languages with relevant linguistic resources, in particular the CJK languages (Chinese, Japanese, Korean).

The goal of this analysis is to get a sense of how broadly Method 2 applies to queries on Chinese, Japanese, and Korean Wikipedias and a baseline of the current DYM process to compare it to. I also want to get a rough idea of the quality of the suggestions, and some information on the queries themselves.

Data
I pulled user query logs for CJK Wikipedias, along with current production (phrase suggester) suggestions from the built-in phrase suggester. I also pulled the current Glent M2 suggestions (which are being calculated, but not shown to users), and linked them to the matching user queries.

Because of a bug in Glent filtering, there is a hole in the Glent Japanese data. I used a sample of data I pulled in February instead, so that queries and suggestions were all from the same time period.

I had to do my own normalization to link the queries. I did basic normalization to lowercase and normalize whitespace. I also did heavy normalization—lowercasing and removing all spaces—as a back up. In all cases, the majority of query matches were exact, but some basically normalized and heavily normalized queries matched in all three sample sets.

The table below shows the number of queries in each set, and the number of exact-match, normalized-match, and heavily normalized–match M2 suggestions that were linked up. Unsurprisingly, basic normalization doesn't do much, since CJK characters don't change when "lowercased". The heavily normalized suggestions and queries match more often, when spaces are removed. Note #0: The data we are using is not a perfect reflection of the user experience. There are some "last mile" tricks used to improve suggestions. For example, if a query is a title match for an article, we don't make a suggestion. These deviations are minor, and shouldn't affect the overall impression we get from our corpus.

Common Analysis Notes
Below are some notes that apply to all three languages; rather than repeat them, I've put them here.

Multiple Suggestions
The current DYM system only makes one suggestion at a time, and we're only looking at one suggestion from M2 for any given query.

However, because the current DYM suggestions come from different shards of the index—or from the same shard accessed weeks apart—the underlying word statistics can differ, and the exact DYM suggestion shown to users can vary.

When queries with multiple suggestions make it into a review sample, we rate all the options offered and average them.

Evaluations
Based on my experience with M0 and M1, and the fact that I needed to ask someone else to do the manual review, I decided to analyze samples of the production suggestions, the M2 suggestions, the head-to-head sample (where both made a suggestion). I did not review the queries where prod and M2 agree; there aren't very many of them, but it would still require reviewing a fair number of them.

I asked the reviewer to look at samples of 25 from each group (so ~100 suggestions total, since there are two suggestions for each query in the head-to-head group).

Queries were rated from 5 (best) to 1 (worst) based on the following examples (in English): While 2 is better than 1 and 5 is better than 4, I consider 1s and 2s to be poor suggestions and 4s and 5s to be good suggestions. 3s are generally neither good nor bad, and also not that common.

95% confidence intervals are calculated using the Wilson score interval, which accounts for very small sample sizes, using this tool. As a result, the confidence intervals can seem a little skewed: 3/9 is centered on 38%, rather than 33%, for example.

Head-to-Head Comparison
I pulled out a sample of 25 queries where the current DYM and M2 both had suggestions and those suggestions differ, and the reviewer reviewed them manually. These are not necessarily a representative sample of all queries, or the rest of the queries for either system.

"Clearly better" suggestions are ones where one is a 4 or 5 and the other is a 1 or 2; or, anything other than a 3 against a 3 (e.g., 4 vs 3, or 3 vs 2). If one system gets a 1 and the other gets a 2, they are both bad, even if one is a little less bad.

Queries by Script Type
I also broke down the queries by script type.

The "CJK, Mixed" category catches mixed Chinese, Japanese, and Korean queries.

The "IPA-ish" category attracts writing systems that use characters that are also used in the International Phonetic Alphabet (IPA); in this case, many are IPA or IPA-related, with a few leetspeak-style encoding of normal text using IPA symbols.

The "Other/Mixed" category is generally made up of queries in multiple scripts/languages, or those that have emoji or other rarer characters.

The "Punctuation" category has queries that are all punctuation, while "Symbols" has all-number queries, plus queries with punctuation and other non-punctuation symbols.

Korean
Below are the details for the Korean sample. Thanks to Jerry Kim for the manual review of the various samples!

Korean Stats
There are 312,698 queries in our test corpus. 276,745 (88.502%) of them are identical after (basic) normalization.


 * Current DYM: 34,103 queries (10.906%) got a DYM suggestion from the current production system. 33,598 queries (10.745%) got suggestions from the current DYM system but not from M2.
 * M2 DYM: 7,424 queries (2.374%) got a DYM suggestion from M2. 6,919 queries (2.213%) got suggestions from M2 but not from the current DYM system.
 * Both: 505 queries (0.161%) got suggestions from both systems. 9 of the suggestions (0.003%) were the same for both the current DYM system and M2. There weren't any additional suggestions that differed only by case.

Multiple Suggestions in Korean
In our corpus, the current prod DYM suggestions gave different suggestion for the same queries for 334 different queries, which is not very many. Most had two different suggestions, but 11 queries each got 3 different suggestions, and 4 queries each got 4 different suggestions. About half of the queries with multiple suggestions are in the Latin script.

Korean Head-to-Head Comparison
The head-to-head categorizations are as follows:
 * 1: M2 DYM gave a clearly better suggestion
 * 10: the current DYM gave a clearly better suggestion
 * 13: both suggestions were bad
 * 1: both suggestions were good

Overall, the distribution of M2 head-to-head suggestions was 2 good (one 5, one 4), 23 bad (one 2, 22 1s). The distribution of current prod head-to-head suggestions was 11 good (four 5s, seven 4s), 14 bad (four 2s, 10 1s).

While the sample size is not good enough to make fine-grained estimates (the 95% confidence interval for 40% (10/25) is ~23-59%, for 4% (1/25) it's ~0.7-20%), it's clear that M2 is worse than the current DYM when they both make suggestions.

Production DYM for Korean
The production DYM suggestions are not entirely Korean. In my sample of 25, 9 were in the Latin alphabet (most were obviously English or names).
 * 8/16 Korean suggestions were good, and 8/16 were bad. (95% confidence interval: 50% ±22%)
 * 3/9 Latin suggestions were good (38% ±26%), 5/9 were bad (54% ±27%)
 * 11/25 of all suggestions were good (45% ±18%), 13/25 were bad (52% ±18%)

Glent Method 2 for Korean
In my sample of 25, the Method 2 suggestions were all in Korean.
 * 6/25 suggestions were good (27% ±16%), 17/25 were bad (66% ±17%).

Production DYM vs Glent Method 2 for Korean
Reiterating the above in graphical form:



While there is significant overlap in most of the error bars, M2 is clearly giving more bad suggestions than good suggestions, and the data suggests that production DYM is better than M2 on suggestions in Korean (when the average is not brought down by the similarly poor suggestions in Latin script).

Korean Queries by Script Type
About 73½% of queries are in the Hangul script, which is lower than I expected! About 21% of queries were in the Latin script, which is higher than I expected!

The Other/Mixed category is predominantly mixed Korean and Latin script, with some other weird stuff thrown in there.

Japanese
Below are the details for the Japanese sample. Thanks to Lisa Hiraide for the manual review of the various samples!

Japanese Stats
There are 292,351 queries in our test corpus. 249,197 (85.239%) of them are identical after (basic) normalization.


 * Current DYM: 24,198 queries (8.277%) got a DYM suggestion from the current production system. 24,170 queries (8.267%) got suggestions from the current DYM system but not from M2.
 * M2 DYM: 2,122 queries (0.726%) got a DYM suggestion from M2. 2,094 queries (0.716%) got suggestions from M2 but not from the current DYM system.
 * Both: 28 queries (0.010%) got suggestions from both systems. 1 of the suggestions (0.000%) were the same for both the current DYM system and M2. There weren't any additional suggestions that differed only by case.

Multiple Suggestions in Japanese
In our corpus, the current prod DYM suggestions gave different suggestion for the same queries for 111 different queries, which is not very many. Most had two different suggestions, but 1 queries got 3 different suggestions. About three quarters of the queries with multiple suggestions are in the Latin script.

Japanese Head-to-Head Comparison
There were only 28 head-to-head queries, and with duplicates there were only 14 distinct queries, so our sample for Japanese is even smaller than expected—but it is all the data there was in almost 300K queries. Also, thanks to an editing error on my part, only 13 of the 14 made it to the reviewer.

The head-to-head categorizations are as follows:
 * 3: M2 DYM gave a clearly better suggestion
 * 3: the current DYM gave a clearly better suggestion
 * 7: both suggestions were good

Overall, the distribution of M2 head-to-head suggestions was 10 good (three 5s, seven 4s), 3 bad (three 2s). The distribution of current prod head-to-head suggestions was 9 good (one 5s, eight 4s), 3 meh, 2 bad (two 2s).

While the sample size is not good enough to make fine-grained estimates, M2 and the current DYM seem to be roughly the same when they both make suggestions.

Production DYM for Japanese
The production DYM suggestions are not entirely Japanese. In my sample of 25, 13 were in the Latin alphabet (most were obviously English or names).
 * 5/12 Japanese suggestions were good, and 5/12 were bad. (95% confidence interval: 44% ±24%)
 * 3/13 Latin suggestions were good (29% ±21%), 8/13 were bad (59% ±23%)
 * 8/25 of all suggestions were good (34% ±17%), 13/25 were bad (52% ±18%)

Glent Method 2 for Japanese
In my sample of 25, the Method 2 suggestions were all in Japanese.
 * 7/25 suggestions were good (31% ±17%), 17/25 were bad (66% ±17%).

Production DYM vs Glent Method 2 for Japanese
Reiterating the above in graphical form:



While there is significant overlap in most of the error bars, M2 is giving more bad suggestions than good suggestions (with just the narrowest gap between the 95% confidence intervals), and the data suggests that production DYM is better than M2 on suggestions in Japanese (when the average is not brought down by the similarly poor suggestions in Latin script).

Japanese Queries by Script Type
Up to (see below) about 86% of queries are plausibly "Japanese". About 13% of queries were in the Latin script, which is higher than I expected—though I should have learned to expect such things by now.

"Japanese" includes Katakana (~17%), Hiragana (~6½%), Ideographic (~29%), most of "CJK, Mixed" (~20%), and most of "Other/Mixed" (~14%).
 * The Ideographic category could include queries in Chinese, but the characters are frequently used in Japanese, too.
 * "CJK, Mixed" could include Hangul characters, but in this case it generally does not.
 * Much of the "Other/Mixed" category is Katakana, Hiragana, and Ideographic/Chinese characters mixed with Latin script.

Chinese
February 2021—Below are the details for the Chinese sample. Thanks to David Chan for the manual review of the various samples!

Chinese Stats
There are 743,092 queries in our test corpus. 620,660 (83.524%) of them are identical after (basic) normalization.


 * Current DYM: 51,674 queries (6.954%) got a DYM suggestion from the current production system. 51,472 queries (6.927%) got suggestions from the current DYM system but not from M2.
 * M2 DYM: 21,028 queries (2.830%) got a DYM suggestion from M2. 20,826 queries (2.803%) got suggestions from M2 but not from the current DYM system.
 * Both: 202 queries (0.027%) got suggestions from both systems. None of the suggestions were the same for both the current DYM system and M2. There weren't any additional suggestions that differed only by case.

Multiple Suggestions in Chinese
In our corpus, the current prod DYM suggestions gave different suggestion for the same queries for 876 different queries, which is not very many. Most had two different suggestions, but 68 queries each got 3 different suggestions, 11 queries each got 4 different suggestions, and one query each got 6 or 7 suggestions. (The query with the most suggestions was—correctly spelled!—milli vanilli, which got suggestions like miller vanini, miller vanilla, miller valli, miles vanilla, miami vanilla, miami vanillae, and mille vanilla. The majority of queries with multiple suggestions are in the Latin script; only 51 queries (5.822%) had a CJK character in either the query or one of the suggestions.

Chinese Head-to-Head Comparison
All of the queries where both methods provided suggestions had CJK characters in them.

The head-to-head categorizations are as follows:
 * 13: M2 DYM gave a clearly better suggestion
 * 12: both suggestions were bad

Overall, the distribution of M2 head-to-head suggestions was 12 good (all 5s), 12 bad (two 2s, eight 1s). The distribution of current prod head-to-head suggestions was 0 good, 23 bad (18 2s, five 1s).

While the sample size is not good enough to make fine-grained estimates (the 95% confidence interval for 48% (12/25) is ~30-67%, for 0% (0/25) it's ~0-13%, for 92% (23/25) it's ~75-98%), it's clear that M2 is much better than the current DYM when they both make suggestions.

Production DYM for Chinese
The production DYM suggestions are generally not Chinese. In my sample of 26 (one query had two suggestions), all but 1 were in the Latin alphabet (most were obviously English or names).
 * 1/1 Chinese suggestions were bad (95% confidence interval: 60% ±40%), 0/1 were good (40% ±40%)
 * 4/25 Latin suggestions were good (21% ±14%), 19/25 were bad (73% ±16%)
 * 4/26 of all suggestions were good (20% ±14%), 20/26 were bad (73% ±16%)

Glent Method 2 for Chinese
In my sample of 25, the Method 2 suggestions were all in Chinese (two were mixed Chinese/Latin script).
 * 14/25 suggestions were good (55% ±18%), 9/25 were bad (38% ±18%).
 * The majority of the suggestions (9/25) and all of the ones with a score of 5 were annotated as "traditional to simplified conversion". Given ideal traditional to simplified conversion, these wouldn't actually change the search results (since traditional characters are converted to simplified for search and indexing).
 * Counting the traditional to simplified conversion as neither good nor bad, only 5/25 (24% ±15%) are still good.

Production DYM vs Glent Method 2 for Chinese
Reiterating the above in graphical form:

Note that the production DYM results on Chinese queries have ridiculous error bars because they are both based on exactly one query. In the other production DYM categories show that more suggestions are bad than good. The M2 suggestions are not definitively more good than bad, though the trend is obviously in that direction—but they are clearly better than the production DYM suggestions.

Chinese Queries by Script Type
About 65% of queries are CJK Ideographs, which is lower than I expected! About 25% of queries were in the Latin script, which is higher than I expected—and even slightly more extreme than the previous two test sets.

The Other/Mixed category is predominantly mixed Chinese and Latin script, with some other weird stuff thrown in there.

Summary & Recommendations
Summary

Korean, Japanese, and Chinese follow a vaguely similar pattern:

(Updated to include Chinese in Feb 2021)
 * ~65–80% of queries are in the expected writing system(s).
 * ~10–25% of queries are in Latin (and the rest are a mixed bag).
 * ~7–12% of queries get suggestions from the current production DYM, they are generally mediocre (~⅕–½ are rated as good).
 * ~⅓ to all of suggestions made are for Latin queries, and they are generally poor (between none and ⅓ are rated as good).
 * Suggestions in the expected writing system(s) are generally mediocre (up to ½ are rated as good).

For Korean and Japanese, M2 suggestions are generally poor (~30% are rated as good); Chinese suggestions do better (~55% are good, and they are generally better than the production DYM suggestions—unless we ignore the traditional-to-simplified conversions, then it is back in the 25–30% are good range).
 * M2 provides a small impact (~¾–3%), but with some non-trivial increase in coverage (~8¾%–40%).

Recommendations

I don't have any great insight into the causes of the poor quality of M2 suggestions, and the reviewers who helped me didn't offer any additional feedback about patterns of errors—which is totally fine. Perhaps the assumption that similar-looking characters are a major source of errors is not correct.

The results aren't great, but the new M2 suggestions are largely orthogonal to the existing prod/phrase suggester suggestions, and of roughly similar quality. We should run an A/B test and then decide whether the additional effort to implement M2 is worth whatever increase in clickthrough we see.