User:TJones (WMF)/Notes/Glent Update Notes

There are a number of smaller topics related to the ongoing Glent updates (currently mostly driven by Method 1 improvements). Rather than create a bajillion separate pages or forget some details, I'm lumping them all together here. (I'll start with current work and document future work, and backfill older stuff when I have time.)

M1 Scoring Improvements ("frhedScore")
(May 2020—T238151)

After looking at M‍1 examples in English, French, and Korean, I've come up with a better scoring function for choosing the suggestion:

$$\qquad suggestion\_score = {log_{10}( f \cdot f \cdot h' ) \over 3} - ed$$

The parts of the formula, using the example query motorgead, are below:


 * f = frequency of suggestion (motorhead: 721; motored: 161; motorhead\: 9)
 * h = # hits for suggestion (motorhead: 1,985; motored: 115,834; motorhead\: 1,982)
 * note that diffs between motorhead and motorhead\ are probably chronological
 * h′ = $min( h, 10^4 ) + {h \over 10^3}$ (motorhead: 1,985; motored: 10,115.8; motorhead\: 1,982)
 * this maxes out the effects of # hits at ~11K; the $h \over 10^3$ term keeps values sorted by size if everything else is actually equal.
 * the exponent of the $h \over 10^3$ term could be increased from 3 to further downplay the value of large hit counts. When h = 1M, h′ = 11K—so a 100x increase in hit counts corresponds to a 10% increase in h′. After taking the log10, though, the increase in final score attributable to h/h′ is only ~0.33% (log10(10K) = 4; log10(11K) = ~4.04 (~1%)—but then it gets divided by 3).
 * ed = edit distance from query (motorhead: 0.92; motored: 1.68; motorhead\: 0.92)


 * suggestion_score = ${log_{10}( f \cdot f \cdot h' ) \over 3} - ed$ (motorhead: 2.085; motored: 1.126; motorhead\: 0.815)
 * the final score emphasizes the frequency of the suggestion, downplays really large numbers of results, and penalizes large edit distance (and smooths over small differences in the weights used in the edit distance scoring)

Note: scale-based increases to frequency (say, doubling of traffic or doubling of the length of time we keep data) has no effect on suggestion ranking. Doubling all frequencies increases all scores by ~0.2, for example.

We may talk about changing the normalization of h′ for smaller wikis in the future. For frequencies below 10K, scale-based changes don’t change ranking, but for frequencies that cross that threshold in either direction, it can.

Glent Combination Scoring
(May 2020—T238151)

The current Glent implementation uses a collection of multiplicative weights to combine suggestions from M0, M1, and M2.

Assumptions

 * M0 ≫ M2 ≫ M1 *
 * edit dist == 1 ≫ edit dist == 2
 * more hits > fewer hits

Edit distance is Levenshtein distance (insertions, deletions, substitutions all cost 1 edit)

* Note that “>” is “greater than, and “≫” is “much greater than”; I resisted the urge to use “⋙”.

Formula
$$\qquad score(M\#) = hits \cdot \left\{\!\begin{aligned} &ed==1: &100\\ &ed==2: &1 \end{aligned}\right\} \cdot \left\{\!\begin{aligned} &M0: &10^9\\ &M1: &1\\ &M2: &(10 - ed) * 10^5 \end{aligned}\right\} $$

Background
The new scoring method for M1 that takes into account suggestion frequency (fr), suggestion hit count (h—effectively maxed out at ~11K by h′), and suggestion-vs.-query token-aware edit distance (ed) is:

$$\begin{align} \qquad &h' = min( h, 10^4 ) + {h \over 10^3} \\ \\ &frhedScore = {log_{10}( fr^2 \cdot h' ) \over 3} - ed \end{align}$$

Notes: ${ log_{10}( fr^2 \cdot h' ) \over 3 }$ is log10 of the weighted geometric mean of suggestion frequency and clamped suggestion hit count, with frequency counted twice as heavily. The edit distance works as a penalty against suggestions that are less similar to the original query.

In a recent sample of M1 suggestions, the new token-aware edit distance generates scores from ~0.6 to ~1.8, suggestion frequency runs from 1 to ~20K (with ~4K being a common high value), and hit count from ~100 to ~1.1M (with 500–1000 being a typical low value, and 50K–200K being a typical high value).

frhedScore values could theoretically range from about –2 (e.g., fr=1, h=1, ed=2) to about 5 (e.g., fr=250K, h=5M, ed=0), but values from about –0.5 (e.g., fr=3, h=450, ed=1.7) to about 3.5 (e.g., fr=20K, h=75K, ed=0.7) are what I’ve actually seen.

Assumptions

 * higher suggestion frequency ≫ lower suggestion frequency
 * lower edit dist > higher edit dist
 * more hits > fewer hits
 * M0 > M2 > M1

Edit distance is token-aware edit distance with non-integer values, when possible.

The first three assumptions are built into the frhedScore. I suggest using this score for M0 suggestions as well, and an adapted version of it for M2, and that we come up with new way to combine them all into the final score for a given suggestion.

Reward….
Rather than scaling the scores (i.e., multiplying them by some constant) to satisfy the M# ranking requirement, I suggest shifting them (with simple addition and subtraction) so they overlap in a way that accomplishes the same goal. Adding 0.5 to M2 and 1.0 to M1 would do the trick.

…and Punishment
Looking through a number of M1 suggestions in English, French, and Korean, it looks like a good approximation of the cut-off between “good” and “very good” suggestions is a frhedScore of ~1.5. Also, suggestions scoring below ~0.5 are generally not great and shouldn’t win out over better suggestions from other algorithms.

I suggest further pushing down scores below 0.5 (so that other methods can overtake them) by giving an additional –5 penalty.

Fudging M2
Looking at a small sample of Chinese M2 data, all of the Levenshtein distances are 1 or 2, but the best ranked suggestions (ranked by a speaker) are Traditional-to-Simplified conversions, and all of the ed==2 suggestions are in that group. The default M1 token-aware edit distance params don’t work well with this data; in particular, because Chinese words are much shorter than English words—often 1 or 2 characters—even a three word query (which in English might be ~15 letters) might not allow any edits with an edit limit of 30%. Removing the proportional edit limit would help, but that requires more investigation. For now we can stick with the Levenshtein distance, which will have a value of 1 or 2.

The problem with frhedScore for M2 is that we don’t have suggestion frequency or hit count data for M2 suggestions. We could give all suggestions a value of, say, 2.75 for the “log₁₀( fr² * h′ ) / 3” portion of the frhedScore (corresponding to approximately fr=500, h=750; or fr=200, h=5000), making the fakeFrhedScore:

$$\qquad frhedScore(M2) = 2.75 - ed$$

If ed==1, the fakeFrhedScore would be 1.75 (at the low end of “very good” for M1), and if ed==2, the fakeFrhedScore would be 0.75 (at the low end of “good” for M1).

Other values instead of 2.75 might be reasonable, in the 2.5–3.0 range

Formula
$$\begin{aligned} \qquad &frhedScore(M\#) = \cdot \left\{\!\begin{aligned} &M0:\ frhedScore(fr, h, ed)\\ &M1:\ frhedScore(fr, h, ed)\\ &M2:\ 2.75 - ed \end{aligned}\right\}

\\ \\

&methodScore = frhedScore(M\#) + \left\{\!\begin{aligned} &M0:\ 1.0\\ &M1:\ 0.0\\ &M2:\ 0.5 \end{aligned}\right\} + \left\{\!\begin{aligned} &frhedScore < 0.5:& -5\\ &otherwise: & 0 \end{aligned}\right\}

\end{aligned} $$

Next Steps
If this all sounds reasonable, then I will investigate M0 and M2 a little more closely, and…


 * Make sure M0 suggestions are scored reasonably well with the current token-aware edit distance algorithm and settings.
 * Investigate M2 edit distance scoring options (Levenshtein vs token-aware edit distance with no proportional limit).
 * Make code changes to implement the new scoring proposal

Future Steps

 * We could look into bringing in M2 hits information (at least for a subset of likely suggestion candidates) by re-running queries in batch mode.
 * Might it be possible to get frequency information for them, too, by querying the Glent logs? We could give all candidates that aren’t in the logs a suggestion frequency of 1.