User:TJones (WMF)/Notes/Myanmar Zawgyi Encoding Initial Survey

March/April 2018 — See TJones_(WMF)/Notes for other projects. See also T191535.

Summary
Before considering some form of Zawgyi detection and transliteration for Myanmar-language wikis, we should:


 * get a sense of the frequency of Zawgyi-encoded queries
 * get a sense of the accuracy of Google’s detection library on short (i.e., query-length) strings
 * evaluate available transliteration tools and transliteration complexity
 * maybe evaluate other detection tools that would be more convenient to implement (like TextCat)
 * evaluate detection and transliteration on non-Myanmar text, too

Fonts and Scripts
In the pre-Unicode days, a font called Zawgyi was came to be widely used on Myanmar-language (aka Burmese) websites, and it is still the most popular font on Myanmar-language websites. It differs from the Unicode standard, but shares many code points, and includes additional characters to render ligatures and generally handle the complex text rending needed for the Burmese script.

The Myanmar Wikipedia and Wiktionary both require Unicode for editing and reading, but can’t really enforce it for search.

However, it's possible, if not likely, that mywiki users know to use Unicode. It seems to be a known issue in the Myanmar online community, and there's a big notice on the Myanmar Wikipedia home page and it is listed prominently on the Myanmar Wiktionary, too. (Those are the only two currently open projects in Myanmar.) Both have links to conversion tools and fonts, operating system info, etc. So it's actually quite possible that mywiki and mywikt users are fairly Unicode-compliant already.

Tools and Implementation
Google’s Apache 2.0–licensed Myanmar Tools support detection of Zawgyi in several programming languages (C++, Java, Javascript, and Ruby at the moment). The Myanmar Tools READMEs point to additional libraries for transliteration, which differ by programming language.

The best way to implement the detection and transliteration is not entirely clear, since there are many options:


 * in Javascript in the search box itself
 * in PHP in CirrusSearch—though that may require a new port of the detector
 * in Java inside Elasticsearch, perhaps as a token filter (though I worry that a token filter, because it is working on individual tokens, could perform much more poorly, and could do silly things, like convert half a query but not the other half)

I’ve only looked briefly at the Javascript and Java implementations. They use a Markov model, which makes sense since the encodings have lots of overlapping code points, and the model file itself is small—only 25.4 KB—so it's quite reasonably sized and could be hosted locally.

We still need to think about additional use cases, too. For example, an editor might want to search for an exact string of a common word encoded in Zawgyi in order to find places where another editor made contributions in Zawgyi instead of Unicode. Is using quotes enough of an escape mechanism? Or do we only offer Did You Mean–like suggestions? Do we want something in the Completion Suggester and offering conversion and/or completion based on the conversion.

Obviously, if we want to move forward, we’d need to consult with the community, but it’s best to have some idea of the prevalence of Zawgyi-encoded searches first.

Queries
We can pull a reasonable sample of Myanmar Wikipedia and Wiktionary queries for testing. While the Burmese script is used for a few other languages, separating out the queries by script should give us a rough but reasonable estimate of how often and what kind of non-Burmese queries show up on Wikipedia and Wiktionary. Wiktionary seems to have more English head words than Burmese, and at least a few in other languages, like Thai.

Using the detection tools, we can also estimate how many Zawgyi-encoded queries there are.

Evaluation Plan
There are lots of online conversion tools. It should be possible to get a sense of how straightforward transliteration is by converting known Unicode text (i.e., a moderate sized sample of text from Myanmar Wikipedia) to Zawgyi and back in several converters and see how often they agree. 100% agreement indicates a straightforward conversion; much lower agreement indicates much more complexity.

With known encoding texts in hand, we can estimate the accuracy of Zawgyi detection be pulling out phrases of similar length to queries in both Unicode and Zawgyi texts. This won’t be exact, because different kinds of phrases (like noun phrases) are probably more likely than others, and that may skew the presence of obviously Zawgyi encodings because of particular words or morphemes being more or less likely to be present. However, the accuracy on such a sample is a useful data point.

Assuming the Zawgyi detection is reasonably accurate on the known Zawgyi and known Unicode texts, we can run it against the Burmese script query data and get a sense of how often Zawgyi-encoded text appears in Myanmar Wikipedia and WIktionary queries. We should also run it against non–Burmese script queries (especially mixed-script queries, if any) and make sure there are no surprises there. Transliteration tools should also not affect non-Burmese scripts, but it wouldn’t hurt to check a bit—I’ve had some interesting surprises with other script conversion tools.

Considering TextCat
This is also possibly something that TextCat could handle—it would be easy to take Myanmar Wikipedia pages, convert them to Zawgyi, and use that as training and testing data. And it would be easy to see how often a TextCat implementation and Google's Myanmar Tools agree. I wouldn't be shocked if Google was better, but the differences might be small enough to not be worth the extra implementation complexity—but there's a lot of overlap in the encodings, so I won't predict success at this time.

Using TextCat would be the same intervention as the Wrong Keyboard detection for Russian (T138958) or Hebrew (T155104)—though we don't have a concrete plan for implementing that yet, either. I think we should wait until after David's parser refactoring (T185108), which should make it easier and more logical to insert this kind of detection (whether with TextCat or with a PHP port of Google's detector if we go that way).