User:TJones (WMF)/Notes/Khmer Reordering/Examples

Below are examples of Khmer syllables I found in a sample of 5,000 Khmer Wikipedia articles, and that I have automatically re-ordered. They are divided into groups that have similar re-ordering. I'm looking for feedback on what is right and what is wrong and advice on how to fix the things that are wrong.

The groups are sorted by how much help I need understanding them. The ones that are the most confusing to me are listed first.

These are only a (diverse) sample of all the syllables I found and re-ordered. Many more examples are on the Khmer Reordering/Examples/More sub-page.

The columns are:


 * rewritten, the re-ordered version of the syllable, expanded out so all the elements are visible.
 * original, the syllable as found on Khmer Wikipedia.
 * context, a selection of text containing the original syllable.
 * The original syllable is highlighted in red. Finding and highlighting the original syllable was done automatically, so there may be errors.
 * The entire context is a link to Khmer Wikipedia, which should bring up a link to the original article containing the text. Of course, there may be no result because the original article has changed since I took the sample.

???
These syllables don't actually have a lot in common other than they are confusing to me and I don't know what to make of them. Perhaps these are not actually single syllables and I have found incorrect syllable boundaries, or they are typing mistakes in the original text, or something else is going on. Any ideas on how to treat these correctly would be appreciated!

Questionably Reordered Syllables
These seem to be in the correct order according to the rules I have found, but they look different in all or most fonts.

These usually include ះ, or ្រ (though the first few include ្ស, ្ឈ, and ្យ). My best guess is that I have found incorrect syllable boundaries, but I don't know what the right thing to do is.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Visible Duplicates
These multiple vowels and other diacritics always show up in all the fonts I have tried. My understanding is that each syllable should have only one dependent vowel. These have multiple dependent vowels (and one has duplicated ះ). I don't think they are mistakes because the duplicates are easy to see when typing. Maybe they look correct using a font or operating system I don't have.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Duplicate Supplementary Consonants
Depending on the font, these duplicates are sometimes visible, sometimes not. So, I think they are rewritten correctly, but I want to make sure.

Original Is More Common
These look the same or very similar when rewritten, but the rewritten form is much  less common (in my sample), which makes me worry. Some of these have hundreds more instances of the "original" form than the "rewritten" form. Others appear 3 or 4 times, but only as the "original" form. This makes me worry that there is something wrong with the way I'm re-ordering them, though I think they are correct.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Consonants Swaps
These all have ្រ before another supplementary consonant. As far as I can tell, ្រ should always be the third consonant if there are three consonants. In some fonts, the original form doesn't render properly, so I think these are correct.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Zero-Width Spaces & (Non-)Joiners
These have U+200B (zero-width space [ZWSP]), U+200C (zero-width non-joiner, [ZWNJ]), or U+200D (zero-width joiner [ZWJ]) in them, which I believe is intended to change the rendering (but not the meaning) of diacritics or other elements. The rewritten form here isn't necessarily better, but I think it is the form that should be indexed for search.

Split Vowels
These have េ +  ា or  េ +  ី (or   ី +  េ) instead of  ោ and  ើ. Since they look the same, I assume that the single vowel form is correct. In some fonts   ី +  េ does not render properly, so I think swapping them is correct.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Invisible Duplicates
In many fonts I looked at, these multiple vowels, supplementary consonants, or other multiple diacritics render only once, so I take these to be mistakes that should be de-duplicated.

There are additional samples like these on the Khmer Reordering/Examples/More page.

Reordered Syllables
These seem to be reasonably reordered. These are the ones I am most confident in because they always look the same, or the original renders incorrectly in certain fonts. This is the largest group, but I hope these are easy to review because they are mostly correct!

There are a lot of additional samples like these on the Khmer Reordering/Examples/More page. Part 1, Part 2, Part 3.