User:TJones (WMF)/Notes/Glent Update Notes

There are a number of smaller topics related to the ongoing Glent updates (currently mostly driven by Method 1 improvements). Rather than create a bajillion separate pages or forget some details, I'm lumping them all together here. (I'll start with current work and document future work, and backfill older stuff when I have time.)

Glent Combination Scoring
The current Glent implemnentation uses a collection of multiplicative weights to combing suggestions from M0, M1, and M2.

Assumptions

 * M0 ≫ M2 ≫ M1 *
 * edit dist == 1 ≫ edit dist == 2
 * more hits > fewer hits

Edit distance is Levenshtein distance (insertions, deletions, substitutions all cost 1 edit)

* Note that “>” is “greater than, and “≫” is “much greater than”; I resisted the urge to use “⋙”.

Formula
$$\qquad score(M\#) = hits \cdot \left\{\!\begin{aligned} &ed==1: &100\\ &ed==2: &1 \end{aligned}\right\} \cdot \left\{\!\begin{aligned} &M0: &10^9\\ &M1: &1\\ &M2: &(10 - ed) * 10^5 \end{aligned}\right\} $$

Background
The new scoring method for M1 that takes into account suggestion frequency (fr), suggestion hit count (h—effectively maxed out at ~11K by h′), and suggestion-vs.-query token-aware edit distance (ed) is:

$$\begin{align} \qquad &h' = min( h, 10^4 ) + {h \over 10^3} \\ \\ &frhedScore = {log_{10}( fr^2 \cdot h' ) \over 3} - ed \end{align}$$

Notes: ${ log_{10}( fr^2 \cdot h' ) \over 3 }$ is log10 of the weighted geometric mean of suggestion frequency and clamped suggestion hit count, with frequency counted twice as heavily. The edit distance works as a penalty against suggestions that are less similar to the original query.

In a recent sample of M1 suggestions, the new token-aware edit distance generates scores from ~0.6 to ~1.8, suggestion frequency runs from 1 to ~20K (with ~4K being a common high value), and hit count from ~100 to ~1.1M (with 500–1000 being a typical low value, and 50K–200K being a typical high value).

frhedScore values could theoretically range from about –2 (e.g., fr=1, h=1, ed=2) to about 5 (e.g., fr=250K, h=5M, ed=0), but values from about –0.5 (e.g., fr=3, h=450, ed=1.7) to about 3.5 (e.g., fr=20K, h=75K, ed=0.7) are what I’ve actually seen.

Assumptions

 * higher suggestion frequency ≫ lower suggestion frequency
 * lower edit dist > higher edit dist
 * more hits > fewer hits
 * M0 > M2 > M1

Edit distance is token-aware edit distance with non-integer values, when possible.

The first three assumptions are built into the frhedScore. I suggest using this score for M0 suggestions as well, and an adapted version of it for M2, and that we come up with new way to combine them all into the final score for a given suggestion.

Reward….
Rather than scaling the scores (i.e., multiplying them by some constant) to satisfy the M# ranking requirement, I suggest shifting them (with simple addition and subtraction) so they overlap in a way that accomplishes the same goal. Adding 0.5 to M2 and 1.0 to M1 would do the trick.

…and Punishment
Looking through a number of M1 suggestions in English, French, and Korean, it looks like a good approximation of the cut-off between “good” and “very good” suggestions is a frhedScore of ~1.5. Also, suggestions scoring below ~0.5 are generally not great and shouldn’t win out over better suggestions from other algorithms.

I suggest further pushing down scores below 0.5 (so that other methods can overtake them) by giving an additional –5 penalty.

Fudging M2
Looking at a small sample of Chinese M2 data, all of the Levenshtein distances are 1 or 2, but the best ranked suggestions (ranked by a speaker) are Traditional-to-Simplified conversions, and all of the ed==2 suggestions are in that group. The default M1 token-aware edit distance params don’t work well with this data; in particular, because Chinese words are much shorter than English words—often 1 or 2 characters—even a three word query (which in English might be ~15 letters) might not allow any edits with an edit limit of 30%. Removing the proportional edit limit would help, but that requires more investigation. For now we can stick with the Levenshtein distance, which will have a value of 1 or 2.

The problem with frhedScore for M2 is that we don’t have suggestion frequency or hit count data for M2 suggestions. We could give all suggestions a value of, say, 2.75 for the “log₁₀( fr² * h′ ) / 3” portion of the frhedScore (corresponding to approximately fr=500, h=750; or fr=200, h=5000), making the fakeFrhedScore:

$$\qquad frhedScore(M2) = 2.75 - ed$$

If ed==1, the fakeFrhedScore would be 1.75 (at the low end of “very good” for M1), and if ed==2, the fakeFrhedScore would be 0.75 (at the low end of “good” for M1).

Other values instead of 2.75 might be reasonable, in the 2.5–3.0 range

Formula
$$\begin{aligned} \qquad &frhedScore(M\#) = \cdot \left\{\!\begin{aligned} &M0:\ frhedScore(fr, h, ed)\\ &M1:\ frhedScore(fr, h, ed)\\ &M2:\ 2.75 - ed \end{aligned}\right\}

\\ \\

&methodScore = frhedScore(M\#) + \left\{\!\begin{aligned} &M0:\ 1.0\\ &M1:\ 0.0\\ &M2:\ 0.5 \end{aligned}\right\} + \left\{\!\begin{aligned} &frhedScore < 0.5:& -5\\ &otherwise: & 0 \end{aligned}\right\}

\end{aligned} $$

Next Steps
If this all sounds reasonable, then I will investigate M0 and M2 a little more closely, and…


 * Make sure M0 suggestions are scored reasonably well with the current token-aware edit distance algorithm and settings.
 * Investigate M2 edit distance scoring options (Levenshtein vs token-aware edit distance with no proportional limit).
 * Make code changes to implement the new scoring proposal

Future Steps

 * We could look into bringing in M2 hits information (at least for a subset of likely suggestion candidates) by re-running queries in batch mode.
 * Might it be possible to get frequency information for them, too, by querying the Glent logs? We could give all candidates that aren’t in the logs a suggestion frequency of 1.