User:TJones (WMF)/Notes/Analysis of DYM Method 1

October 2019 — See TJones_(WMF)/Notes for other projects. See also T232760. For help with technical jargon, check out the Search Glossary.

Background
Spelling corrections and other "Did you mean…?" (DYM) suggestions on Wikipedia and its sister projects are often not great. So, we've undertaken a project to improve the DYM suggestions on Wikipedia.

The goal is to implement multiple methods for making suggestions that address the issues of different languages and projects, including difficulties introduced by writing systems, and the amount of data available. (See T212884.) "Method 1" (M1—T212889) mines search logs for common queries and created an efficient method for choosing candidates to make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. It is only applicable for languages with relatively small writing systems (alphabets, abjads, syllabaries, etc.—not CJK, in particular).

The main point of Method 1 compared to Method 0 is that it should be higher recall than Method 0, because it is not limited to self-corrections made by a user, but rather can branch out to similar queries made by any user. Because there is no direct correlation between the query and the correction, precision is likely to be lower than Method 0.

The goal of this analysis is to get a sense of how broadly Method 1 applies to queries on English Wikipedia and a baseline of the current DYM process to compare it to. I also want to get a rough idea of the quality of the suggestions, and some information on the queries themselves. Finally, I'll have some suggestions for improving M1 processing.

Data
As with M0, David ran the M1 training and evaluation on query data for me, giving me about 225K queries for evaluation. He pulled the original query and the current DYM suggestion, if any, made at the time, and then computed the M1 DYM suggestion, if any.

Note #0: The data we are using is not a perfect reflection of the user experience. There are some "last mile" tricks used to improve suggestions. For example, if a query is a title match for an article, we don't make a suggestion. These deviations are minor, and shouldn't affect the overall impression we get from our corpus.

Note #1: M1 normalizes queries before making suggestions. Simple normalizations include lowercasing and using standard whitespace, so that "same thing", "Same Thing", "SaMe   ThiNG   ", and "      sAmE      tHiNg     ",  are all treated as the same thing.[badum tiss]

Stats
There are 225,436 queries in our test corpus. 112,422 (49.869%) of them are identical after normalization.


 * Current DYM: 93,727 (41.576%) got a DYM suggestion from the current production system. 80,478 queries (35.699%) got suggestions from the current DYM system but not from M1.
 * M1 DYM: 27,395 queries (12.152%) got a DYM suggestion from M1. 14,146 queries (6.275%) got suggestions from M1 but not from the current DYM system.
 * Both: 13,249 queries (5.877%) got suggestions from both systems. 1,677 of the suggestions (0.744%) were the same for both the current DYM system and M1. There weren't any additional suggestions that differed only by case.

Comparison to M0: The method-independent stats (percentage of queries unchanged by normalization and percentage of production suggestions) are roughly the same, which is what we'd expect. M1 makes many more suggestions than M0 (12% vs 1.5%) and has more overlap with production suggestions, though a slighlty lower but similar rate of identical suggestions (0.7% vs 0.9%).

Multiple Suggestions
The current DYM system only makes one suggestion at a time, and we're only looking at one suggestion from M1 for any given query. (Longer term we may consider multiple suggestions for a given query. It is difficult, but possible, to get Google to generate multiple suggestions—I usually do it by putting together two ambiguously misspelled words that don't have anything to do with each other. For example, glake bruck—at the time of this writing—gets four suggestions from Google: glade brook, glaze brook, glass brick, and glow brick.)

However, because the current DYM suggestions come from different shards of the index—or from the same shard accessed weeks apart—the underlying word statistics can differ, and the exact DYM suggestion shown to users can vary.

In our corpus, the current prod DYM suggestions gave different suggestion for the same queries for 77 different queries, which is very, very few. Most had two different suggestions, but TWO queries each got 3 different suggestions!

Comparison to M0: Since this is about the production DYM system, and thus independent of M0 and M1, it is no surprise that the numbers are similar to the M1 corpus—though the rates are a bit smaller because M1 corpus is larger.

Evaluations
After completing the head-to-head comparison (see below), I decided to dig more deeply into the other categories that I ignored for M0, namely, suggestions where production DYM and M1 agree, and samples of production DYM– and M1-only suggestions.

I tried to rate suggestions as a user, not as an NLP-aware search developer. So for many of the suggestions, I can see where they came from and why they are reasonable based on the algorithms behind them, but they would still be unhelpful as a user. Of course, on-wiki search is also limited by the content available on-wiki. A common source of poor results I see is a minor celebrity who has a website and fan sites (so Google knows who they are) but no Wikipedia page, so we don't.

In my ratings, "good" or "reasonable" suggestions are ones that seem likely to help the person find what they are looking for better than what they originally typed in.

The "meh" suggestions don't look particularly bad, but they aren't particularly helpful. For example 2019 nba drafts gets the suggestion 2019 nba draft which, because of stemming, gives the same results (though possibly ranked differently). There is currently no Wikipedia article for author alexander s presley so the recommendation of alexander s prestley isn't terribly useful, but it's not unreasonable.

The "poor" or "not very good" suggestions are not going to help the user find what they are looking for. I also took note of suggestions that seem to be particularly bad, in that are likely to seem to the user to be unmotivated and unrelated to the original query.

Head-to-Head Comparison
I pulled out a sample of 100 queries where the current DYM and M1 both had suggestions and those suggestions differ, and I reviewed them manually.

The final categorization counts are as follows:


 * 7: M1 DYM gave a better suggestion.
 * For 2 of them, M1 had extra punctuation in the suggestion. In one case (martin luther king*) the asterisk is actually a search operator, and gives the query more results since it matches any word starting with king including Kingsville, and Kingoué, and Kingiseppsky.
 * 40: the current DYM gave a better suggestion.
 * For 22 of these, the M1 suggestion is also noticeably worse. 8 of them start with a dash (-ous) and 14 of them were really bad (e.g., jebait getting a suggestion of web it or korea population getting a suggestion of korea popularation).
 * 26: Both suggestions were reasonable.
 * For 6 of these, the M1 suggestion is a plural of the prod DYM suggestion (e.g., axoloti gets a prod suggestion of axolotl while M1 suggests axolotls—this isn't really wrong, but it strikes me as weird)
 * For 7 of these, the M1 suggestion has additional punctuation or other symbols in the suggestion, such as interstellar+ - jason_schwartzman — judgment, (with the comma as part of the suggestion) - milton+friedman - phosphorus\ - ricochet. (with the period)
 * 23: Both suggestions were not very good.
 * For 15 of these, the M1 suggestions were noticeably worse. 2 of them started with a dash (-enza) and 5 of them were really bad (e.g., abbys getting a suggestion of a b s or bone screw getting a suggestion of long screw).
 * 4: User intent was too unclear.

While the sample size is not good enough to make fine-grained estimates (the 95% confidence interval for 40% (40/100) is ~31-50%, for 7% (7/100) it's ~3-14%), it's clear that M1 is worse than the current DYM when they both make suggestions.

Production DYM
Because M1's performance was much less than our hopes and expectations, I reviewed a sample of 100 M1 DYM suggestions to gain additional insight into the kinds of problems M1 is having and to make sure that suggestions were M1 and prod both have suggestions are somehow different (which would be contrary to our expectations and understanding of M1 internals). I also reviewed a sample of 100 production DYM suggestions for comparison.

The final categorization counts for the Production DYM sample are as follows:


 * 30: Suggestions were reasonable.
 * 22: Suggestions were not helpful or clearly not right, but not actively bad. ("meh")
 * 45: Suggestions were poor.
 * 3: User intent was too unclear.

Glent Method 1
The final categorization counts for the Method 1 sample are as follows:


 * 17: Suggestions were reasonable.
 * 3 of these had unnecessary punctuation in them, including + instead of a space between words, or an extra space at the beginning of the suggested query. These give the same results as the more typically formated queries, but they seem odd as suggestions.
 * 3 of these had unnecessary inflections (extra -s or -ed) which can affect ranking. A misspelling of Atlanta was corrected to atlantas which does find Atlanta'—but it also ranks Atlantis highly because Atlantas is a redirect to Atlantis.
 * 11: Suggestions were not helpful or clearly not right, but not actively bad. ("meh")
 * 72: Suggestions were poor.
 * 2 of these generated negations (-x or !x)
 * 16 of these stood out as changing one or more words to much more common but unrelated words in order to get more results.

Method 1 and Production DYM Agree
I had expected that when the current production DYM and M1 agree that the results would be better than either alone, and that seems to be what happened when I pulled 100 examples where that was the case.


 * 53: Suggestions were reasonable.
 * 24: Suggestions were not helpful or clearly not right, but not actively bad. ("meh")
 * 23: Suggestions were poor.

Method 1 Patterns and Anti-Patterns
Anti-Patterns

After looking at the examples I see a number of anti-patterns in Method 1 that lead to the poor suggestions. The biggest problem seems to be that it is optimizing heavily for the largest number of search results regardless of the sorts of edits it needs to make to get there.

The most obvious example of this are negated queries. One example is fogus, which is a surname, but also a resonable misspelling of focus—which is what the production DYM suggested. However, you can get to -ous in two edits (the M1 limit) by deleting the c and changing f to a dash. -ous gets 5.9 million results, because it returns every article that doesn't have a form of the word ous in it.. which is most of them.

Another anti-pattern I see is converting one word in a query into a stop word or other very common word. The edit distance limit is 2, but it applies at the string level, not the token level. Thus cf gene gets the suggestion a gene which effectively just removes cf from the query. Similarly, hot instagram gets the suggestion her instagram which is much less limiting.

I particularly see the wisdom now (despite its limitations) of not allowing the first letter of a token to change. In addition to cf to a as above, I saw pad to and and others. A somewhat humorous example is cia assassinations which gets mi6 assasssinations as the suggestion.

Some of the edits inadvertently take advantage of stemming. red-herring gets the suggestion red hering because hering gets stemmed to here which of course gets many more results than herring. Similarly, the misspelling greek goddeses gets greek godness as a recommendation because godness is stemmed to god which is much more common than goddess on English Wikipedia.

Another small pattern that I observed is suggesting duplicate query terms. battle rattle gets the suggestion rattle rattle and youtyoup gets the suggestion you you—these duplicated query terms are essentially the same as single word queries, and so they get a lot of results.

Changing letters to spaces is another way to carve off a more common sub-word and maybe convert part of the word to a stop word or other much more common word. The most egregious example is abbys which gets the suggestion a b s (which returns the same number of results as just b s). Another example, redmax gets the m changed to a space and the x changed to an n giving red an which gets almost all the same results as just red by itself.

Patterns

People seem to have a lot of trouble with doubled letters (appolonius vs apollonius or myrrdin vs myrddin) so decreasing the cost of those edits would make such suggestions look better.

People also seem to swap letters frequently. I see two classes of such swaps:
 * Typos, where two random adjacent letters are swapped, often resulting in something that doesn't look like a word (e.g., queit vs quiet or thrid vs third).
 * Mis-remembered pronunciations, where two nearby vowels or consonants are swapped (e.g., dinasours vs dinosaurs or levasimole vs levamisole).

Again, decreasing the cost of those edits would make such suggestions look better.

Queries by Script Type
I also broke down the queries by script type. This list isn't exhaustive; it only includes the easily categorizable ones, though I worked a little harder on it than I did for M0.

Almost 97% of queries are in the Latin script, so obviously that's a reasonable place to put our focus, though it is surprising that almost 1% of queries are in Arabic script and a bit more than 0.5% are Cyrillic. (These numbers are unchanged from the M0 data!!)

The "CJK, Mixed" category catches mixed Chinese, Japanese, and Korean queries. The sample I looked at was generally Japanese, either mixed Hiragana and Katakana, or with Japanese with Chinese characters.

The "IPA-ish" category attracts writing systems that use characters that are also used in the International Phonetic Alphabet (IPA); in this case, most are IPA or IPA-related, with one ASCII art and one leetspeak-style encoding of normal text using IPA symbols.

The "Other/Mixed" category is generally made up of queries in multiple scripts/languages, or those that have emoji or other rarer characters.

The "Punctuation" category has queries that are all punctuation, while "Symbols" all-number queries, plus queries with punctuation and other non-punctuation symbols.

Invisibles
Looking at language analyzers, I often run into invisible characters that block proper searching. The most common are zero-width non-joiners, zero-width joiners, non-breaking spaces, soft hyphens, and bi-directional marks.

These marks all occur in our query sample, with Arabic, Latin, Myanmar, Sinhala, and Telugu scripts. It makes sense to me to strip these characters out (or substitute with spaces for the non-breaking space) during the M1 normalization process.

Summary & Recommendations
Without the continuity of context provided by single-session edits (as in M0) and with edit distance limits imposed per-string rather than per-token, and too much weight apparently put on result counts, M1 (in English) ends up optimizing for edits that give unrelated but shorter and more common words, often (and more often then current production DYM) making poor suggestions.

I don't think Method 1 should go to A/B test until we fix some of these problems.

Recommendations

Items in bold seem the most pressing to me.


 * Improve filtering during training:
 * Filter queries with search operators in them (token-initial dash (-) and token-final asterisk (*) in particular).
 * Filter more aggressively on queries that are all numbers, spaces, and punctuation.


 * Improve normalization during training:
 * Normalize or filter queries with punctuation in them.
 * Doing this across languages could be hard because the analysis chains may do different things in different languages.
 * Deduplicate search terms from queries.
 * Consider filtering stop words from strings as part of normalization.
 * Remove the most common invisible characters (except for non-breaking spaces, which should just be regular spaces).


 * Improve edit distance calculations:
 * Implement per-token (rather than per-string) edit distance limits.
 * Consider/develop/research a more complex edit distance metric that averages edit distances per token changed or something similar. (The intuition here is that if you have, say, two five letter tokens, then a one-letter change in both tokens is probably "better" than a two-letter change in one token.)
 * Increase the cost for changing a letter to a space or vice versa when computing edit distance.
 * Inserting a space is less often a problem, though depending on the costs the edit distance computation could "cheat" by adding a space and then deleting a letter.
 * Increase the cost for changing or deleting the first letter of a token when computing edit distance. (Have to think about how to implement this depending on the script.)
 * Decrease the cost for changing a doubled letter to single or vice versa.
 * Decrease the cost for swapping two adjacent letters.
 * Decrease the cost for swapping two consecutive vowels of consonants (possibly too complex); or decrease the cost for swapping two not quite adjacent letters (i.e., with one letter between them).
 * More generally, it might make sense to have the ability to specify non-standard edit distance weights. Some things may be nearly universal (e.g., everyone uses vowels as vowels in Latin alphabet, so swapping vowels could be lower cost than swapping a vowel and a consonant). These would have to be script-dependent, and could possibly be language dependent, but they would give a more meaningful edit distance metric—as would the other, probably easier changes above.


 * Improve suggestion ranking calculations:
 * Change the weighting of suggestions based on result count to not overpower other factors.
 * Obvious options include some sort of saturating function, or a very squashy function, like.
 * I realize that I've assumed that edit distance plays a role in the weighting of suggestions, but I'm not sure that's the case. If not, it probably should be, rather than letting result count reign supreme.
 * This is pretty speculative, but some of the worst suggestions that rely on weird stemming effects could be eliminated by generating counts on token-quoted versions of the tokens. So, greek godness gets 25K results (because godness is stemmed to god), but "greek" "godness" gets only 1. red hering gets 74K (hering stems to here), but "red" "hering" only gets 216. It would definitely take more research to determine whether this can be done without causing more harm than good, and it could complicate issues of counting results in general—but maybe some average of the plain query and the token-quoted version would work.