Topic on User talk:TJones (WMF)

Diacritic overfolding in Vietnamese

2 comments • 20:58, 2 May 2023 1 year ago

2

Mxn (talkcontribs)

Hi Trey, thanks for your recent blog post – it's a good overview of many of the challenges I encounter in multilingual text processing as a software engineer, not only in search.

Since you mentioned Vietnamese, I'd like to call your attention to phab:T78485: if the user enters search terms that contain diacritics, especially tone marks, MediaWiki should not direct the user to any other article title that matches only the base letters but not the diacritics. This is important because Vietnamese words pack a lot of meaning into diacritical marks. There are a great many minimal pairs of 6–12 three-character-long words that differ only by diacritics, especially in proper names.

You have a point that we can't rely on users to always enter all the diacritical marks. But if they enter any diacritical marks, they expect those particular diacritical marks to be respected for the most part. The impact of diacritic-folding already marked text is similar to redirecting a query for "résumé" to "resume" in English. Sometimes users enter the wrong diacritics, especially when using the VNI input method or the "VIQR" keyboard on iOS, which both place all the diacritic keys next to each other. But such mistakes can be counted less than a base-character difference when calculating edit distance; these typos don't necessarily require diacritic folding.

What's more, Vietnamese organizes the marks into two tiers: one tier (such as circumflexes) is considered part of the base letter, while another tier of tone marks is considered separate from the base letter. Traditionally, the tone marks apply to the word as a whole. Analytics bear out the fact that, in an autocompleting textbox, users commonly enter some diacritics while omitting the tone marks until after they spell out the whole word. So if anything, Vietnamese queries should be evaluated three times: first literally, then after folding the tone marks in the target text, then after folding all diacritics in the target text. But diacritics in the query should never be folded automatically.

Reply Edited 01:02, 29 April 2023 1 year ago

TJones (WMF) (talkcontribs)

Hi Minh, sorry that phab task has been languishing for so long. Unfortunately it is from before the early days of what is now the Search Platform team, so it is on the wrong phab board.. or, rather, it is not also on the right one. I've added it to our main board and put it in "needs triage" status so the team will discuss it next week. It still has to make its way through our triage and prioritization process, and it'll probably end up on the backlog for now, but I'd prioritize it as high.

I'm not 100% sure what's going on, but I think I've got it down to the fact that the "Go" feature (which is what you get when you hit the "search" button at the top of the page) has a secondary index that uses ASCII folding. The primary index is boosted more strongly, and it uses ICU normalization, which is much less aggressive.

A practical short-term fix would be to upgrade Vietnames to use ICU folding, because that allows for folding exceptions. ASCII folding does not (which presents a potential problem for third party MediaWiki users without the ICU plugin, but we can think about that more later).

I may need to check with our Elastic expert but I'm not sure that we have separate analyzers for query text and title text in the Go feature (we do for full text, for example). He's not available this week, so I may not get a quick answer.

Assuming we don't have separate analyzers for the Go feature, would it be better for the secondary index to maintain the full diacritics (e.g., "trường hộp" & "trường hợp"), or to drop the tone diacritics but preserve the base letters ("trương hôp" & "trương hơp", if I did that correctly)? It sounds like the second option is better, but I defer to your opinion. Looking more carefully, the first option makes the secondary index identical to the primary index, other than a 5000 character limit, which shouldn't come up too much. Also, does the same level of folding—either none or just removing tone diacrtics—make sense for full text results? The exceptions are linked in our current config.

If just adding the ICU upgrade and a mapping to remove tone diacritics is enough, I might be able to get to this much sooner as a 10% project outside our team's normal plans. (I've got a soft spot for language fixes that are approaching 10 years old! Ask me about Crimean Tatar transliteration!)

Final question for now! While I can see that this is terribly annoying, is it particularly common? It seems that it requires the exact match to fail and then for there to be exactly one alternative after folding. Given the number of similar syllables, I wouldn't think that happens a lot. I typed cờm and hit "search", and I got rolled over to full text results because there are at least 7 similarly-folded alternatives (cơm, cớm, còm, cộm, cốm, cỡm, cỏm). I guess it's more likely with longer titles like trường hộp / trường hợp because most similarly-folded combinations of syllables don't occur.

Reply Edited 20:58, 2 May 2023 1 year ago

Reply to "Diacritic overfolding in Vietnamese"