User:TJones (WMF)/Notes/Khmer Reordering

September 2019 — See TJones_(WMF)/Notes for other projects. See also T185721.

Background
Khmer, also called Cambodian, is the official language of Cambodia. The Khmer script is a syllabary closely related to Thai and Lao and more distantly to all the Brahmic scripts.

Khmer is written left to right, and often without spaces between words. That's the easy part! (Though it's still hard.)

Khmer orthographic syllables[*] are built around a "base character" that represents a consonant and a vowel. Up to two additional "supplementary" consonants may be added to the onset of the syllable by stacking them underneath the base character (usually), and the vowel of the base character can be changed by adding other characters to the base character. Other diacritics may also be added to alter the pronunciation of the consonant or vowel.

[* Orthographic syllables are a new concept for me, and they are not well documented on English Wikipedia, alas. The basic idea is that orthographic syllables represent the way words are broken down when writing, which may not line up 1-to-1 with they way they are broken down in speaking. This makes more sense in some of the Brahmic scripts—which includes Khmer—than it does in English.]

The various elements that glom on to the base character can attach themselves below, above, to the left, or to the right of the base character, and sometimes multiple elements can stack, especially above or below the base character. Some supplementary consonants can also go to one side of the base character, and some of the vowel diacritics have two parts—one to the left, and the other above or to the right. Some diacritics change shape or location in the presence of other diacritics—for example, if both would normally go on top of the base character—presumably to keep things from getting too crowded.

Different fonts and different operating systems support typing the code points of an orthographic syllable in different orders, in that they will render the resulting syllable the same (or very nearly the same). This happens because—to simplify a bit—there isn't really an obvious linear order to elements that glom on top, to the left, and below the base character.

The Unicode Standard sets out a preferred order, which corresponds to the order the elements are spoken (though there seem to still be some elements that are ordered arbitrarily). Incorrect ordering should result in glyphs with a dotted circle (◌) showing that they aren't combining correctly, but many fonts, applications, and OSes will still render non-canonical orders perfectly fine, or at least reasonably well, and some don't show the dotted circle even when they render poorly.

Another issue is that for some of the diacritics, multiple copies will render directly on top of each other so that you can't see that there are multiple copies of the diacritic. This can apply to vowels, supplementary consonants, and other diacritics.

All of this variability causes great difficulty in search, because two words that look exactly the same could underlyingly be composed of very different sequences of code points.

A similar but much smaller scale problem in English is that é can be either a single character or two characters (e + ´) composed together—but turned up to 11! A more analogous example would be if the character sequences strap, srtap, satrp, and straaaaap all looked identical.

Khmer Examples
As an example, the syllable ង្រ្កា consists of a base character ង, plus the supplementary consonants ្រ and  ្ក, and the vowel ា. In terms of rendering  ្រ goes to the left,  ្ក goes below, and ា goes to the right. Depending on how smart and how forgiving the font you are using is, the elements can be re-ordered to get the same or nearly the same final visual form:


 * ង្រ្កា ( ង + ្រ + ្ក + ា )
 * ង្រា្ក ( ង + ្រ + ា + ្ក )
 * ង្ក្រា ( ង + ្ក + ្រ + ា )

Another example, ញ៉ាំ, includes the diacritic ៉, which can be rendered in this context similar to  ុ, which can be used instead to generate a similar looking result. ុ is also more flexible in its ordering than  ៉, so there are up to seven variants that look approximately the same (depending on your font, OS, and application support):


 * ញ៉ាំ ( ញ + ៉ + ា + ំ )
 * ញុាំ ( ញ + ុ + ា + ំ )
 * ញុំា ( ញ + ុ + ំ + ា )
 * ញាុំ ( ញ + ា + ុ + ំ )
 * ញាំុ ( ញ + ា + ំ + ុ )
 * ញំុា ( ញ + ំ + ុ + ា )
 * ញំាុ ( ញ + ំ + ា + ុ )

As mentioned before, some of the vowel diacritics are multi-part, with a component to the left and another to the right or above, such as ើ   or  ោ. The components also exist as independent pieces, and can be combined as such (as always depending on font support):


 * កើ ( ក + ើ )
 * កេី ( ក + េ + ី )

or:


 * កោ ( ក + ោ )
 * កេា ( ក + េ + ា)

The most extreme case of duplicate diacritics I found in my sample include fourteen copies of ំ on top of each other. Depending on the font, it can be rendered as a barely noticeable thickening of the lines, which is almost impossible to see (and definitely easy to miss) in a small text size.


 * តិំ ( ត + ិ + ំ )
 * តិំំំំំំំំំំំំំំ ( ត + ិ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ )

The Plan
The goal, then, is to create an Elasticsearch plugin (or other config) that can re-order Khmer syllables into a standard order before indexing.

To make it easier for me and for Khmer speakers to evaluate the results, my initial goal is to re-order syllables into the correct canonical order, though it may be necessary to back off to a standard order that is not always correct, but always consistent.

Several people have suggested simply "alphabetizing" the syllable elements to give them a standard order. Readers would never see the re-ordered syllables, since they would only be used internally as part of the index. I thought of doing that, too, and it would definitely work for simpler syllables. For the more complex syllables, I think it could cause problems. The elements of the syllable can have structure of their own. For example, the supplementary consonant ្ក is made up of a special symbol, "coeng" (  ្), plus the full consonant ក. I also found at least one case where three consonants seem to be able to appear in different orders: ក្ម្ស (kms-) vs. ក្ស្ម (ksm-).

(Update from the future: There is also an issue of determining syllable boundaries, which turns out to be a problem with some typos. Cleverer processing will make this more accurate.)

The Prototype
I decided that my first task should be to build a prototype command line tool to parse out Khmer syllables and re-order them. This make it much easier to iterate on the re-ordering algorithm, debug, etc. It also puts off any problems that might crop up with Elasticsearch tokenizing while I figure out the re-ordering.

I did find some existing repositories on GitHub that do "Khmer re-ordering". However, the vast majority of them were copies of a library to re-order elements for display as a font, and the one that works on orthographic syllables assumes much cleaner data than what I'm seeing.

With a sample of 5,000 Khmer Wikipedia articles, 5,000 Wiktionary entries, and some guidelines on the theoretical order of syllable elements (see Some References, below), I set out to find and re-order the syllables of the text.

The Algorithm
My current algorithm is as follows (parts of it are driven by the references I've found, and parts of it are data-driven, so it doesn't cover all possible cases, but it covers everything I've found so far, except for the syllable boundary problem):


 * Define useful character classes
 * consonants = ក ខ គ ឃ ង ច ឆ ជ ឈ ញ ដ ឋ ឌ ឍ ណ ត ថ ទ ធ ន ប ផ ព ភ ម យ រ ល វ ឝ ឞ ស ហ ឡ អ [U+1780–U+17A2]
 * ro = រ [U+179A]
 * independent vowels = ឣ ឤ ឥ ឦ ឧ ឨ ឩ ឪ ឫ ឬ ឭ ឮ ឯ ឰ ឱ ឲ ឳ [U+17A3–U+17B3]
 * The first two are deprecated and shouldn't occur after regularization
 * dependent vowels = ា ិ ី ឹ ឺ ុ ូ ួ ើ ឿ ៀ េ ែ ៃ ោ ៅ [U+17B6–U+17C5]
 * coeng = ្  [U+17D2]
 * diacritics = ំ  ះ  ៈ  ៉  ៊  ់  ៌  ៍  ៎  ៏  ័  ៑ [U+17C6–U+17D1 U+17DD]
 * register shifters = [U+17C9 U+17CA]
 * robat = [U+17CC]
 * non-spacing diacritics = [U+17C6 U+17CB U+17CD–U+17D1 U+17DD]
 * spacing diacritics = [U+17C7 U+17C8]
 * zero-width elements = zero width space (ZWSP), zero width non-joiner (ZWNJ), zero-width joiner (ZWJ), soft-hyphen (SHY), invisible separator (InvSep) [U+200B–U+200D U+00AD U+2063]


 * Regularize the text
 * replace obsolete ligature ឨ (U+17A8) with ឧក (U+17A7 U+1780)
 * replace deprecated independent vowel ឣ (U+17A3) with អ (U+17A2)
 * replace deprecated independent vowel digraph ឤ (U+17A4) with អា (U+17A2 U+17B6)
 * replace ឲ (U+17B2) as a variant of ឱ (U+17B1)
 * replace deprecated trigraph ៘ (U+17D8) with ។ល។ (U+17D4 U+179B U+17D4)
 * delete non-visible inherent vowels (឴) (U+17B4) and (឵) (U+17B5) (used for transliteration)
 * replace obsolete ATTHACAN ៝ (U+17DD) with VIRIAM  ៑ (U+17D1)
 * replace deprecated BATHAMASAT ៓ (U+17D3) with NIKAHIT  ំ (U+17C6) as a likely error


 * Find syllables
 * A syllable is a consonant or independent vowel followed by any sequence of coeng+consonant clusters, coeng+independent vowel clusters, dependent vowels, diacritics, or zero-width elements.
 * KNOWN BUG: This works well for reasonably formatted text, but when there are certain typos—in particular a missing base consonant in the next syllable—this can gather up too many vowels. I haven't decided yet whether to try to detect the excess vowels or just let typos do silly things, as typos often do. (We want to catch duplicate vowels and split vowels, so we do need to allow multiple vowels in some cases.)


 * Re-order each syllables
 * remove zero-width elements
 * save the first character (consonant or independent vowel) as the base character
 * split the remainder of the syllable into chunks
 * coeng + (consonant or independent vowel) + register shifter
 * coeng + (consonant or independent vowel)
 * everything else splits into one-character chunks
 * collect the chunks, in their original order, into the following groups:
 * chunks starting with a coeng
 * dependent vowels
 * register shifters
 * robats
 * non-spacing diacritics
 * spacing diacritics
 * [note that there shouldn't be any leftovers]
 * de-duplicate each group
 * if the same chunk occurs multiple times in a row, reduce it to one instance
 * repair split vowels
 * replace េ +  ី (U+17C1 U+17B8) with  ើ (U+17BE)
 * replace ី +  េ (U+17B8 U+17C1) with  ើ (U+17BE)
 * replace េ +  ា (U+17C1 U+17B6) with  ោ (U+17C4)
 * re-order supplementary consonants (ro is always last)
 * if coeng + ro comes before coeng + consonant, swap them
 * join the elements: base character + register shifters + robat + coeng chunks + dependent vowels + spacing diacritics + non-spacing diacritics
 * note that some register shifters could be in the coeng chunks

Early Results
I've broken the results up into several categories for review by readers of Khmer. I've had one round of review so far, and things look pretty good. There are lots of re-ordered syllables that look the same or very nearly the same[*] as the originals.

[* I've slowly relaxed my criteria for "looking the same" as I've gained more experience with Khmer text. I installed a couple dozen Khmer fonts, and I've found that different ones are more or less forgiving of different kinds of non-canonical orderings. So, if things look the same or very nearly so in the more forgiving fonts, I consider them to be the same.]

In order of decreasing confidence, the groups are:


 * Syllables that were relatively simple and that looked the same when their elements were re-ordered into the canonical order.
 * Syllables with "invisible duplicates"—those with multiple copies of elements that in most fonts don't show up as duplicates.
 * Split vowels: syllables where the multipart vowels are entered in parts instead of as a single character.
 * Syllables with zero-width spaces, zero-width non-joiners, and zero-width joiners—these are supposed to control ligatures and shouldn't have any effect on meaning.
 * Syllables with swapped supplementary consonants—these all have the supplementary consonant ្រ ("ro") in them. Ro always comes third in a set of three consonants, but it is written to the left of the base character, so it is often typed second.
 * Syllables where the frequency of the non-canonical order is much higher than that of the canonical order. These all seem reasonable by looking at them, but the inverted frequency was worrisome, so I pulled them out for special consideration.
 * Syllables with duplicate supplementary consonants. These duplicates are more often visible than not (using my font collection), so I wasn't as comfortable putting them in the "invisible duplicates" section.
 * Other visible duplicates—these have duplicate elements that show up in almost all of my fonts.

All of the above groups, after speaker review, are generally probably reasonably re-ordered.

The difficult groups include:


 * "Questionably Reordered Syllables"—these seem to follow the rules, but often looked really different before and after re-ordering, or rendered poorly in all or most fonts even after re-ordering.
 * "???"—these were ones that were so confusing I didn't know how to categorize them.

After speaker review, these mostly fell into two groups:


 * The majority of the "Questionably Reordered Syllables" and a minority of the "???" syllables were actually re-ordered okay.
 * The majority of the "???" syllables and a minority of the "Questionably Reordered Syllables"—the ones that looked the worst—have syllable boundary errors, caused by typos in the original text.

(There's one example left in the "???" category, which has both ត ("ta") and ដ ("da") as subscript consonants. They look the same as subscripts, but they are technically different letters, and in this case they both always show up. Maybe in some font somewhere they overlay each other and you can only see one of them. Not sure—but if I escape with only one really unclear example unresolved, that'll be pretty good!)

For the syllables with boundary errors, after the correct syllable there are additional unattached dependent vowels. These seem to be missing their base consonant. The best approach is probably to split these extra vowels off into a separate syllable. It's a little complicated because "split vowels" can have two vowel elements, and both visible and invisible duplicates can have the same element repeated.

Alternatively, we could say that typos are typos and they mess things up, and whatever happens, happens. I’d prefer to fix things when I can, but I may have more limitations in the final implementation in Elasticsearch than I have in the prototype.

Syllable Stats
From the 5K Wikipedia article sample:
 * 4,165,057 total syllables
 * 7,172 changed syllables (0.17% of total)
 * 28 syllable boundary errors (0.00067% of total, 0.39% of changed)
 * 19 distinct syllables

From the 5K Wiktionary item sample:


 * 178,392 total syllables
 * 260 changed syllables (0.15% of total)
 * 1 syllable boundary error (0.00056% of total, 0.38% of changed)
 * 1 distinct syllable, obviously

These results are surprisingly consistent, compared to some of the stats I've seen when doing language analyzer analysis. I think that is because this is based on how people generally type words, rather than on the differences in style and content between Wikipedia and Wiktionary.

It's overall good that less than 0.2% of syllables need to be re-ordered and very good that less than 0.4% of syllables that we try to re-order (and approximately 0% of all syllables) have boundary errors.

Quick Review of Khmer Wikipedia Queries
I pulled 90 days'–worth of queries from Khmer Wikipedia, giving 27,747 queries total. I did a quick analysis of the queries since I had the data in front of me.

Slightly less than half of all queries contain any Khmer characters. Slightly more than half were primarily Latin characters, with the usual preponderance of porn-related search terms (primarily xxx or some variant of the xnxx web site name).


 * 27,747 queries
 * 13,624 with some Khmer characters (49.1%)
 * 13,904 Mostly Latin (50.1%)
 * 5,447 xnx (xnxx, xnx, xxnx, etc.) (19.6%)
 * 1,855 xxx (6.7%)

A handful of queries were mostly numbers, symbols, or emojis:


 * 176 mostly numerals
 * 62 all punctuation
 * 12 mostly emojis

And there was the usual small collection of queries in other scripts:


 * 52 Thai
 * 28 Chinese
 * 23 Cyrillic
 * 8 Korean
 * 3 Hebrew
 * 3 Greek
 * 2 Katakana
 * 2 Hiragana
 * 1 Lao
 * 1 Devanagari
 * 1 Bengali

Syllable Stats
I also processed the queries using my command line prototype...

From the 27K Wikipedia query sample:


 * 60,800 total syllables
 * 800 changed syllables (1.3% of total)
 * 3 syllable boundary errors (0.0049% of total, 0.38% of changed)
 * 3 distinct syllables

There are roughly 7 or 8 times the rate (1.3%) of syllables being re-ordered as in the Wikipedia and Wiktionary article text, and 8 or 9 times the rate of syllable boundary errors—though that is still only 3 total.

Other observations

 * There is the usual junk queries (especially repeated characters, like ឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥ or ,,,,,,,       zzzzzzzz), and at least one of the syllable boundary errors look like a junk query (as opposed to a typo).
 * The re-ordered syllables in the query text are generally similar to the ones in the article text, and the majority of them were ones I'd seen before (597/800, or 74.6%)—which makes me think that certain errors are reasonably common, but there is a long, long tail.
 * There were probably more unattached coeng characters in the queries, though.

Enter Elasticsearch
Khmer-language wikis use Elasticsearch's ICU tokenizer. Some documentation for it says that it detects syllables, but other documentation says it also has a dictionary to detect longer words.

Testing shows that it does have a dictionary. That's probably good for text with properly ordered Khmer elements, but might be bad for text with improperly ordered elements—the dictionary may not recognize all the variants, even if they look the same on-screen.

The easiest approach to implement would be to re-order elements after the tokenizer has tokenized them. However, improperly ordered elements could change the tokenization.

Next Steps

 * ✔︎ Get a sense of the scope of the typo-caused incorrect syllable boundary problem. The complexity of fixing it and the scope of the problem will help determine whether it's worth it to try to handle these cases.
 * DONE: The syllable boundary error rate is less than 0.001% of all syllables, so it isn't a huge concern. I will address it if it is easy to do so, but I won't worry about it too much, since these are caused by typos—which often lead to strange results—and we can fix 250x as many currently unfindable syllables.
 * ✔︎ Gather query data and check the prevalence of syllable re-ordering in queries. (I expect it will be a bit higher, as queries generally are less carefully/formally written.)
 * DONE: See notes above.
 * Test the effects of re-ordering on tokenization and matching of re-ordered words. I'll look at re-ordering using the command line prototype both before tokenization and after, and see how big a difference it makes on the results. (My guess is that pre-tokenization re-ordering will be much better, but it's complex enough that it's worth it to see how big a difference it makes.)

Some References
These are some references that have proven useful to me in trying to figure out Khmer syllables.


 * The order of components in Khmer orthographical syllables (PDF)
 * The Unicode Standard, Version 6.1—Southeast Asian Scripts (PDF), specifically Section 11.4.
 * Ordering rules for Khmer (PDF)
 * Report on the Current Status of United Nations Romanization Systems for Geographical Names—Khmer (PDF)
 * Khmer Romanization Table - Proposed Revision 2011 (PDF)
 * Khmer character notes
 * Inherent Vowels