User talk:TJones (WMF)/Notes/Khmer Reordering/Examples

About this board

បេឡា (talkcontribs)

Hi, I'm just a new user who could not care less about the quality of my little Khmer community on Wikipedia. Fortunately, I happened to come across your page and this is one of many issue on Khmer script on site! with my understanding of the language, I would like lend you a hand on this one:

(Pardon if I'm coming out as being rude; I taught myself English.)

??? : The subscript for ត and ដ​ are identical (almost); there are no rule for the order since they cannot be used on the same consonant.

from the context; the subscript of ត(្ត) are grammatically used (ស្តា : ស​ + ្ត + ា), though you can use ដ(្ដ) with the same effect (ស្ដា : ស + ្ដ + ា).

Syllable boundary errors: The original កុំែ ( ក + ុ + ំ + ែ )​ is in correct order, កុែំ ( ក + ុ + ែ + ំ ) is not. The same goes for the other involve ុំ(ុ + ំ), ុះ​ (ុ​ + ះ) and ​េះ(េ + ះ), using the order as above. (the context are full of grammatical typos)

Another one is ពា្ឈ ( ព + ា + ្ឈ ). From the grammatical standpoint, that is certainly incorrect, but I think the author are using it to achieve this effect: ញ​ + ្ឈ = ញ្ឈ . The possible explanation is Khmer Unicode wasn't fully developed back then, so the writer had to substitute it with ព+ ា = ពា .

ច៎ា ( ច + ៎ + ា ) is the correct one. ៎ is used to emphasize the sound of ច to make it sound more short and sharp, so the reorder one is incorrect.

That is all that I can help for now. And thanks for your hard works!

(P.S. half of the the sample you're using are broken beyond recognition; common reordering won't make them readable)

TJones (WMF) (talkcontribs)

Thanks for the feedback—your English is great and I appreciate the help!

I've been off this project for a while, so I may have some more questions later if you are available. I have some questions right now:

Should I just ignore the weird case with both subscript ត and ដ as an error in the text? Or should I try to fix it? Would it make sense to treat subscript ត and ដ as the same since they look the same? That would be easy to do.

I'm not sure how to understand កុំែ. In most fonts, it has a dotted circle (like this: ) at the end. That means the font can't display it correctly. Other fonts display it correctly.

For ពា្ឈ / ញ្ឈ, is there a consistent rule to apply? Should "ព + ា" followed by a subscript consonant be converted to "ញ"? That seems like it could cause errors, but I would have to test it.

Is there a more general pattern for the case of ច៎ា? Should ៎ always be close to the main consonant, or just ច, or just in this one word?

Thanks for the help!

Reply to "Some Corrections?"
Eltimbalino (talkcontribs)


Firstly, TJones, you're doing an incredibly good job in a very difficult task. Well done and thankyou.

I am no expert in Khmer script, after spending four years half-heartedly learning to read, write, and speak it. But I'm sure I'm better than nothing.

This section looks like it is all typos to me. And your assumption about incorrect syllable boundaries is correct.

The characters following this កុ in this កុែំ do not belong after the syllable កុ on their own and are not connected to it. They are the start of the next syllable or word and missing their consonant.


Your interpretation of this one looks correct:

ញុាំ ( ញ + ុ + ា + ំ ) ញុំាំ ( ញ + ុ + ំ + ា + ំ )

The word means to eat.

But as you discovered in the "Original is More Common" section, there are multiple ways of typing this. It is possible that there is not actually a correct key sequence and that it is a matter of "if it looks right, then it is right"

ញុាំ ( ញ + ុ + ា + ំ ) ញុំា ( ញ + ុ + ំ + ា )


I believe your interpretation of this one is also correct:

ឆ្មាំ ( ឆ + ្ម + ា + ំ ) ឆាំ្ម ( ឆ + ា + ំ + ្ម )

This is because the subconsonant ( ្ម)m follows the consonant (ឆ)ch when spoken and the vowels follow that.


Questionably Reordered Syllables

There are some incorrect syllable boundaries in here. But there is also a problem where the subconsonant has been typed last, instead of directly after the initial consonant. I can't think of an instance where within a single syllable anything gets between the initial consonant and its subconsonant.

The original in this one is a commonly used syllable at the start of a word. So my guess is that the following syllable is missing its consonant.

កុេំ ( ក + ុ + េ + ំ ) កុំេ ( ក + ុ + ំ + េ )

Basically, if it fails to render, then it is incorrect and probably a typo.


Visible Duplicates

I'm pretty confident that these are all just typos.


Duplicate Supplementary Consonants

Subconsonants in Khmer do get stacked up sometimes. I've got a feeling that I once saw one that was a stacked duplicate but my memory is very vague there. If I did, it was probably dealing with an introduced word.

Where there is a consonant that is typed, but it doesn't render, I'm going to deduce that it is always wrong because what is the point of typing a character that is never seen?

ស្កូ ( ស + ្ក + ូ ) ស្កូ្ក ( ស + ្ក + ូ + ្ក )




Eltimbalino (talkcontribs)

This ក្ដេា looks okay, but it may not be a real word. It is very similar to ក្ដៅ which means hot. If you paste ក្ដេា into the "read" tab of https://kheng.info/ you'll see that it breaks it into syllables that don't render. Matt, the creator of kheng.info has done some really great work with breaking sentences into words, his partner is Khmer, and he is a friendly and helpful person. Maybe you should get in touch with him?


Consonants Swaps

I think you're right in putting the subconsonent ( ្រ)r after pairs of preceding consonant/subconsonants because that is the order the sounds would be made in.


Zero-Width (Non-)Joiners

When these go between a consonant and a vowel, I think they must always be wrong. These are used to separate words that in the script look joined together. They should never be inside a word, and much less inside a syllable.


Split Vowels

This one is really tricky. Different keyboards have different options. So while some keyboards let you type a combined vowel in a single keystroke, other keyboards require you to enter them as a sequence of vowels that appear and reform as you type. On my keyboard, I was able to type the left columns version. My partner can do all three of these vowel characters as a single stroke using shift+; on her keyboard which in English renders : but in Khmer is the vowel in (ហោះ)flight, and (កោះ)island.

កោះ ( ក + ោ + ះ ) កេាះ ( ក + េ + ា + ះ )

Invisible Duplicates

I think all of these are typos.


Reordered Syllables

These look good to me, but I am not even nearly informed enough to judge. This is the collection that inspired me to raise the issue in the first place. Even if mediawiki were to run a script that corrected everything to be in an approved sequence, that would be only half of the battle. That script would also need to be run on any search phrase before the normal processes took over.

TJones (WMF) (talkcontribs)

Thanks so much, @Eltimbalino!

> But I'm sure I'm better than nothing.

Your help is much, much better than nothing—and much appreciated!

???

The general pattern I’m getting for a lot of these is that if there is a second vowel (other than the ones can be a split vowel, like េ and ា) we should consider it a different syllable. Sound right?

Alternatively, we could say that typos are typos and they mess things up, and whatever happens, happens. (I’d prefer to fix things when I can, but I may have more limitations in the final implementation than I have in this prototype.)

Cases like ញុាំ ( ញ + ុ + ា + ំ ) / ញុំាំ ( ញ + ុ + ំ + ា + ំ ) and ឆ្មាំ ( ឆ + ្ម + ា + ំ ) / ឆាំ្ម ( ឆ + ា + ំ + ្ម ) render just differently enough for me not to be sure. I’ll be a little more forgiving about the ones that are very close.

I need to think about this section more when I have more time—definitely on Monday.

Questionably Reordered Syllables

It sounds like the ones that are “vowel + sub-consonant” I should take as correct to re-order. I’ve was unsure about them because the they don’t render the same (for me) in the two different orders, unlike some others.

> Basically, if it fails to render, then it is incorrect and probably a typo.

The problem I’m having is that rendering seems to be very font-specific, and even application-specific; I’m working on a Mac and TextEdit sometimes renders the same fonts differently than Chrome!

The rest that don’t ever render correctly I’ll move up into the ??? section for more thinking.

Duplicate Supplementary Consonants

> but it doesn't render, I'm going to deduce that it is always wrong because what is the point of typing a character that is never seen?

I agree! But the problem, again, is that different fonts render differently. So I’ll take it that if I have a font doesn’t render them both, it’s okay to deduplicate.

(As a side note, this one—ស្ត្ដា ( ស + ្ត + ្ដ + ា ) / ស្តា្ដ ( ស + ្ត + ា + ្ដ )—from the ??? section is listed there not here because the sub-consonants are ត and ដ!)

Visible Duplicates / Consonants Swaps / Invisible Duplicates / Reordered Syllables

Good news! Woo hoo!

Zero-Width (Non-)Joiners

My info (Unicode spec (PDF), page 382) says they are used to control ligatures in Muul/Muol/Mool–type fonts, and to keep muusikatoan or triisap from being subscripts (which also varies by font). My plan is to just ignore them.

Split Vowels

So it sounds like merging these is a good thing. If someone were using a different keyboard they might type េ + ា and not realize they were still two separate characters because they look like  ោ.

---

> Even if mediawiki were to run a script that corrected everything to be in an approved sequence, that would be only half of the battle. That script would also need to be run on any search phrase before the normal processes took over.

Ahh! You’ve hit on the crux of the problem—and there is a plan! I don’t actually plan to correct the text in the articles. That would be a never-ending task, since people would always be adding new content that could have differently ordered text. (It may be possible to have something like a spell-checker that corrects text as people type, but that’s far outside my area of expertise and there may be rare cases where you wouldn’t want to make those corrections.)

Instead, the plan is to re-order the text on the way into the search index. Both article text and search queries get the same treatment, so everything would match! (We do the same kind of thing for English, for example, just much less complicated—we lowercase and strip diacritics before putting things in the index, so Einstein matches ÉÎÑSTËÌŃ.

---

> Matt, the creator of kheng.info

I will definitely ping him and see if I can get him into the conversation. Thanks!

---

Whew! On Monday I’ll reorganize some of the samples based on our conversation and think harder about some of the ??? examples.

Any additional replies based on what I’ve tried to understand here would be great, too!

Thanks so much. This is definitely helpful, and I feel more confident that we are going in the right direction!

Reply to "???"
There are no older topics