User:TJones (WMF)/Notes/Glent Update Notes

There are a number of smaller topics related to the ongoing Glent updates (currently mostly driven by Method 1 improvements). Rather than create a bajillion separate pages or forget some details, I'm lumping them all together here. (I'll start with current work and document future work, and backfill older stuff when I have time.)

M1 Scoring Improvements ("frhedScore")
(May 2020—T238151)

After looking at M‍1 examples in English, French, and Korean, I've come up with a better scoring function for choosing the suggestion:

$$\qquad suggestion\_score = {log_{10}( f \cdot f \cdot h' ) \over 3} - ed$$

The parts of the formula, using the example query motorgead, are below:


 * f = frequency of suggestion (motorhead: 721; motored: 161; motorhead\: 9)
 * h = # hits for suggestion (motorhead: 1,985; motored: 115,834; motorhead\: 1,982)
 * note that diffs between motorhead and motorhead\ are probably chronological
 * h′ = $min( h, 10^4 ) + {h \over 10^3}$ (motorhead: 1,985; motored: 10,115.8; motorhead\: 1,982)
 * this maxes out the effects of # hits at ~11K; the $h \over 10^3$ term keeps values sorted by size if everything else is actually equal.
 * the exponent of the $h \over 10^3$ term could be increased from 3 to further downplay the value of large hit counts. When h = 1M, h′ = 11K—so a 100x increase in hit counts corresponds to a 10% increase in h′. After taking the log10, though, the increase in final score attributable to h/h′ is only ~0.33% (log10(10K) = 4; log10(11K) = ~4.04 (~1%)—but then it gets divided by 3).
 * ed = edit distance from query (motorhead: 0.92; motored: 1.68; motorhead\: 0.92)


 * suggestion_score = ${log_{10}( f \cdot f \cdot h' ) \over 3} - ed$ (motorhead: 2.085; motored: 1.126; motorhead\: 0.815)
 * the final score emphasizes the frequency of the suggestion, downplays really large numbers of results, and penalizes large edit distance (and smooths over small differences in the weights used in the edit distance scoring)

Note: scale-based increases to frequency (say, doubling of traffic or doubling of the length of time we keep data) has no effect on suggestion ranking. Doubling all frequencies increases all scores by ~0.2, for example.

We may talk about changing the normalization of h′ for smaller wikis in the future. For frequencies below 10K, scale-based changes don’t change ranking, but for frequencies that cross that threshold in either direction, it can.

Glent Combination Scoring
(May 2020, update June 2020—T238151)

The current Glent implementation uses a collection of multiplicative weights to combine suggestions from M0 M1, and M2 M0 and M1. M2 is its own thing and should not be combined with M0 and M1. (Not 100% sure if this was always the plan, but it's the current consensus (June 2020) after discussing it again.)

Assumptions

 * M0 ≫ M1 *
 * M2 should not interact with M0 and M1
 * edit dist == 1 ≫ edit dist == 2
 * more hits > fewer hits

Edit distance is Levenshtein distance (insertions, deletions, substitutions all cost 1 edit)

* Note that “>” is “greater than, and “≫” is “much greater than”; I resisted the urge to use “⋙”.

Formula
$$\qquad score(M\#) = hits \cdot \left\{\!\begin{aligned} &ed==1: &100\\ &ed==2: &1 \end{aligned}\right\} \cdot \left\{\!\begin{aligned} &M0: &10^9\\ &M1: &1\\ &M2: &(10 - ed) * 10^5 \end{aligned}\right\} $$

Background
The new scoring method for M1 that takes into account suggestion frequency (fr), suggestion hit count (h—effectively maxed out at ~11K by h′), and suggestion-vs.-query token-aware edit distance (ed) is:

$$\begin{align} \qquad &h' = min( h, 10^4 ) + {h \over 10^3} \\ \\ &frhedScore = {log_{10}( fr^2 \cdot h' ) \over 3} - ed \end{align}$$

Notes: ${ log_{10}( fr^2 \cdot h' ) \over 3 }$ is log10 of the weighted geometric mean of suggestion frequency and clamped suggestion hit count, with frequency counted twice as heavily. The edit distance works as a penalty against suggestions that are less similar to the original query.

In a recent sample of M1 suggestions, the new token-aware edit distance generates scores from ~0.6 to ~1.8, suggestion frequency runs from 1 to ~20K (with ~4K being a common high value), and hit count from ~100 to ~1.1M (with 500–1000 being a typical low value, and 50K–200K being a typical high value).

frhedScore values could theoretically range from about –2 (e.g., fr=1, h=1, ed=2) to about 5 (e.g., fr=250K, h=5M, ed=0), but values from about –0.5 (e.g., fr=3, h=450, ed=1.7) to about 3.5 (e.g., fr=20K, h=75K, ed=0.7) are what I’ve actually seen.

Assumptions

 * higher suggestion frequency ≫ lower suggestion frequency
 * lower edit dist > higher edit dist
 * more hits > fewer hits
 * M0 > M1
 * M2 should not interact with M0 and M1

Edit distance is token-aware edit distance with non-integer values, when possible.

The first three assumptions are built into the frhedScore. I suggest using this score for M0 suggestions as well, and that we come up with new way to combine M0 and M1 into the final score for a given suggestion.

Reward….
Rather than scaling the scores (i.e., multiplying them by some constant) to satisfy the M0 > M1 ranking requirement, I suggest shifting them (with simple addition and subtraction) so they overlap in a way that accomplishes the same goal. Adding 0.5 to M0 would do the trick.

…and Punishment
Looking through a number of M1 suggestions in English, French, and Korean, it looks like a good approximation of the cut-off between “good” and “very good” suggestions is a frhedScore of ~1.5. Also, suggestions scoring below ~0.5 are generally not great and shouldn’t win out over better suggestions from other algorithms.

I suggest further pushing down scores below 0.5 (so that other methods can overtake them) by giving an additional –5 penalty.

Fudging M2
Assuming M2 doesn't interact with M0 and M1, this is no longer necessary.

Looking at a small sample of Chinese M2 data, all of the Levenshtein distances are 1 or 2, but the best ranked suggestions (ranked by a speaker) are Traditional-to-Simplified conversions, and all of the ed==2 suggestions are in that group. The default M1 token-aware edit distance params don’t work well with this data; in particular, because Chinese words are much shorter than English words—often 1 or 2 characters—even a three word query (which in English might be ~15 letters) might not allow any edits with an edit limit of 30%. Removing the proportional edit limit would help, but that requires more investigation. For now we can stick with the Levenshtein distance, which will have a value of 1 or 2.

The problem with frhedScore for M2 is that we don’t have suggestion frequency or hit count data for M2 suggestions. We could give all suggestions a value of, say, 2.75 for the “log₁₀( fr² * h′ ) / 3” portion of the frhedScore (corresponding to approximately fr=500, h=750; or fr=200, h=5000), making the fakeFrhedScore:

$$\qquad frhedScore(M2) = 2.75 - ed$$

If ed==1, the fakeFrhedScore would be 1.75 (at the low end of “very good” for M1), and if ed==2, the fakeFrhedScore would be 0.75 (at the low end of “good” for M1).

Other values instead of 2.75 might be reasonable, in the 2.5–3.0 range

Formula
$$\begin{aligned} \qquad &edScore(M\#) = \cdot \left\{\!\begin{aligned} &M0:\ frhedScore(fr, h, ed)\\ &M1:\ frhedScore(fr, h, ed)\\ &M2:\ 10 - ed \end{aligned}\right\}

\\ \\

&methodScore = edScore(M\#) + \left\{\!\begin{aligned} &M0:\ &0.5\\ &otherwise:\ &0.0 \end{aligned}\right\} + \left\{\!\begin{aligned} &edScore < 0.5:& -5\\ &otherwise: & 0 \end{aligned}\right\}

\end{aligned} $$

Note that the  adjustment would never apply to M2 because scores are always in the 8–10 range.

(I'm suggesting changing M2 to  rather than   to save a multiplication and so all methodScores have the same rough order of magnitude, which is within [-10, 10].)

Next Steps M0 Complications
If this all sounds reasonable, then I will investigate M0 a little more closely, and…

Unfortunately, I optimistically did these out of order. Making the code changes wasn't too much work, but when I went to verify that M0 suggestions scored well, things kind of fell apart. There are two main complications:
 * ✔︎ Make sure M0 suggestions are scored reasonably well with the current token-aware edit distance algorithm and settings.
 * ✔︎ Make code changes to implement the new scoring proposal


 * The frequency scale for M0 is completely different than for M1.
 * The 0th percentile for both is 1, of course, but the 95th percentile frequencies for English Wikipedia are 8 for M0 and 799 for M1, for French, 8 vs 324, for German, 8 vs 576. The 99.9th percentile frequencies are English 102 vs 22,047, French 53 vs 10,271, German 83 vs 6,780.
 * I went down the rabbit hole trying to scale the M0 scores to make them commensurate. A power series seems to be the best fit for scaling M0 frequencies to M1 frequencies: ${ fr_{M1} \approx a \cdot {fr_{M0}}^b + c}$, for various values of a (0.5–3), b (1.5–3.5), and c (>0), but M0 is very chunky and has a very long tail of frequencies equal to 1, so the fit isn't perfect, and it isn't super consistent across wikis. And, given the way frequency plays into the frhedScore, c needs to be large enough so that a frequency of 1 doesn't score too low. It might make sense to optimize the limit for calculating h' in the frhedScore for M0—though some upper bound is probbly still useful, since the max M0 freqeuncy I saw was 177,472 on German Wikipedia... though it is so specific that I don't think it could be normal "organic" search traffic.
 * The M1 settings for the token-aware edit distance are too restrictive.
 * M0 didn't have the problems that M1 did with edit distance—particularly because there is usually a direct relationship between the query and the suggestion—so the optimal edit distance params are much different. More edits are okay, and "bad" edits in M1—like changing the first letter of a word—are not bad in M0. The current "plain" edit distance used for M0 is less restrictive than the token-aware edit distance optimized for M1, so it's better for M0. The better tokenization and a less restrictive set of parameters optimized for M0 would probably be better still—especially for duplicate letters—but it's okay as is for now, especially given the impact of M0.

Long story short, the assumptions and settings for M1 & frhedScore do not generalize well enough to M0 to put them on the same scale. When M0 and M1 both have suggestions, the best suggestion is often the same suggestion, or M0 is better. When only M0 has a suggestion, it's pretty good. For now, we should let M0 always beat M1 until we take the time to optimize edit distance params for M0 and find the "reasonable" range of frhedScores for M0 (which will be lower, because of the lower frequency scores).

However, the frhedScore is still useful for M0 because it takes into account the relative frequency of suggestions and allows them to override small differences in the number of hits (which often occur only because the number of results for a given query change over time), and can even override differences in edit distance (1 vs 2 for M0) for large enough freqeuncy differences.

New Assumptions

 * higher suggestion frequency ≫ lower suggestion frequency
 * lower edit dist > higher edit dist
 * more hits > fewer hits
 * M0 ≫ M1
 * M2 should not interact with M0 and M1

New Coke Formula
$$\begin{aligned} \qquad &edScore(M\#) = \cdot \left\{\!\begin{aligned} &M0:\ frhedScore(fr, h, ed)\\ &M1:\ frhedScore(fr, h, ed)\\ &M2:\ 10 - ed \end{aligned}\right\}

\\ \\

&methodScore = edScore(M\#) + \left\{\!\begin{aligned} &M0:\ &20.0\\ &otherwise:\ &0.0 \end{aligned}\right\}

\end{aligned} $$

Adding 20 to M0 means that M0 will always outperform M1.

Future Steps (for M0 & M2)

 * M0
 * We could look into optimizing token-aware edit distance params for M0
 * We could look into enabling tokenization-based normalization to get even better suggestions by ignore punctuation, for example.
 * Determine the "good" and "not so good" ranges for the frhedScore for M0 candidates, since it is different from those for M1.
 * M2
 * We could look into bringing in M2 hits information (at least for a subset of likely suggestion candidates) by re-running queries in batch mode.
 * Might it be possible to get frequency information for M2 hits, too, by querying the Glent logs? We could give all candidates that aren’t in the logs a suggestion frequency of 1.