User:TJones (WMF)/Notes/Khmer Reordering/Examples

Below are examples of Khmer syllables I found in a sample of 5,000 Khmer Wikipedia articles, and that I have automatically re-ordered. They are divided into groups that have similar re-ordering. I'm looking for feedback on what is right and what is wrong and advice on how to fix the things that are wrong.

The groups are sorted by how much help I need understanding them. The ones that are the most confusing to me are listed first.

These are only a (diverse) sample of all the syllables I found and re-ordered. Many more examples are on the Khmer Reordering/Examples/More sub-page.

The columns are:


 * rewritten, the re-ordered version of the syllable, expanded out so all the elements are visible.
 * original, the syllable as found on Khmer Wikipedia.
 * context, a selection of text containing the original syllable.
 * The original syllable is highlighted in red. Finding and highlighting the original syllable was done automatically, so there may be errors.
 * The entire context is a link to Khmer Wikipedia, which should bring up a link to the original article containing the text. Of course, there may be no result because the original article has changed since I took the sample.

???
These syllables don't actually have a lot in common other than they are confusing to me and I don't know what to make of them. Perhaps these are not actually single syllables and I have found incorrect syllable boundaries, or they are typing mistakes in the original text, or something else is going on. Any ideas on how to treat these correctly would be appreciated!

Update: After speaker review, I've split this table into three. The first has the one that is still confusing (it has both ត and ដ as subscript consonants—they look the same as subscripts). The second table has the ones that are split into syllables incorrectly because of typos, so I know I need to work on those. The last table are the ones that look funny to me, but are probably reasonably re-ordered.

Questionably Reordered Syllables
These seem to be in the correct order according to the rules I have found, but they look different in all or most fonts.

These usually include ះ, or ្រ (though the first few include ្ស, ្ឈ, and ្យ). My best guess is that I have found incorrect syllable boundaries, but I don't know what the right thing to do is.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Update: After speaker review, I've split the table into two. The first has syllable boundary errors, as in the ??? section above, which I know I need to work on. The second has the ones where the sub-consonant is after the vowel, and even though it renders differently for me, it is probably reasonable to re-order them.

Visible Duplicates
These multiple vowels and other diacritics always show up in all the fonts I have tried. My understanding is that each syllable should have only one dependent vowel. These have multiple dependent vowels (and one has duplicated ះ). I don't think they are mistakes because the duplicates are easy to see when typing. Maybe they look correct using a font or operating system I don't have.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Update: After speaker review, these are probably reasonably re-ordered.

Duplicate Subscript Consonants
Depending on the font, these duplicates are sometimes visible, sometimes not. So, I think they are rewritten correctly, but I want to make sure.

Update: After speaker review, these are probably reasonably re-ordered.

Original Is More Common
These look the same or very similar when rewritten, but the rewritten form is much  less common (in my sample), which makes me worry. Some of these have hundreds more instances of the "original" form than the "rewritten" form. Others appear 3 or 4 times, but only as the "original" form. This makes me worry that there is something wrong with the way I'm re-ordering them, though I think they are correct.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Update: After speaker review, these are probably reasonably re-ordered.

Consonants Swaps
These all have ្រ before another subscript consonant. As far as I can tell, ្រ should always be the third consonant if there are three consonants. In some fonts, the original form doesn't render properly, so I think these are correct.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Update: After speaker review, these are probably reasonably re-ordered.

Zero-Width Spaces & (Non-)Joiners
These have U+200B (zero-width space [ZWSP]), U+200C (zero-width non-joiner, [ZWNJ]), or U+200D (zero-width joiner [ZWJ]) in them, which I believe is intended to change the rendering (but not the meaning) of diacritics or other elements. The rewritten form here isn't necessarily better, but I think it is the form that should be indexed for search.

Update: After speaker review, these are probably reasonably re-ordered. (They may be typos or they may be intended to control ligatures, but either way, the zero-width elements should not affect meaning, so they should be ignored—especially in the cases where they don't change the meaning.)

Soft Hyphens
NEW! The ICU tokenizer for Khmer ignores soft-hyphens (U+00AD), so we should to. These all seem reasonable.

Split Vowels
These have េ +  ា or  េ +  ី (or   ី +  េ) instead of  ោ and  ើ. Since they look the same, I assume that the single vowel form is correct. In some fonts   ី +  េ does not render properly, so I think swapping them is correct.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Update: After speaker review, these are probably reasonably re-ordered.

Invisible Duplicates
In many fonts I looked at, these multiple vowels, subscript consonants, or other multiple diacritics render only once, so I take these to be mistakes that should be de-duplicated.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Update: After speaker review, these are probably reasonably re-ordered.

Reordered Syllables
These seem to be reasonably reordered. These are the ones I am most confident in because they always look the same, or the original renders incorrectly in certain fonts. This is the largest group, but I hope these are easy to review because they are mostly correct!

There are a lot of additional samples like these on the Khmer Reordering/Examples/More page. Part 1, Part 2, Part 3.

Update: After speaker review, these are probably reasonably re-ordered.