User:TJones (WMF)/Notes/Crimean Tatar Transliteration

May-July 2017 — See TJones_(WMF)/Notes for other projects. See also T23582.

Implementation
I picked up the phabricator ticket (T23582) for Crimean Tatar transliteration in May 2017 and worked on it at the Vienna Hackathon, and as part of my 10% time since then. I've been working on re-implementing a not-quite-working transliteration module for Crimean Tatar (Latin to Cyrillic and Cyrillic to Latin) from 2010.

Part of the original implementation included a lot of exceptions (including names, acronyms, and few general patterns). There were about 200 Cyrillic to Latin (C2L) exceptions, and about 300 Latin to Cyrillic (L2C) exceptions, though most included three variants, lowercase, UPPERCASE, and Capitalized, and many were present in both directions. I consolidated the exceptions into a list of bi-directional mappings, and refactored out the lower/UPPER/Caps variation; this is a bit less computationally efficient, but much easier on the human who has to maintain the exception list.

There are also a number of fairly complicated regexes for converting common prefixes and suffixes, and dealing with certain more difficult context-dependent characters in both directions. I consolidated the simple prefixes and suffixes into bi-directional mappings, and again applied the automatic lower/UPPER/Caps variation, and created unidirectional mappings for the more complex context-dependent regexes.

For L2C, there were also a number of additional "clean up" regexes that apply after the transliteration is done, mostly to remove unneeded soft signs (Ь/ь) and make some adjustments to multi-word representations of numbers.

What is Subject to Transliteration
One of the design decisions I made was that only words that "look like" Crimean Tatar words can be transliterated. That is, words that have Cyrillic or Latin letters in them that are not part of the Crimean Tatar Cyrillic or Latin alphabets are not transliterated. If they were, they would come out in a mixed alphabet. This still isn't a perfect solution. In the case of the article on Aşğabat, the Persian version of the name is given in Latin characters as "Aşq-ābād". Since hyphens separate tokens, this would be transliterated to Cyrillic as "Ашкъ-ābād", as the first half, "Aşq", doesn't have any non-Crimean-Tatar letters in it.
 * For example, "Waterfront" (from the movie On the Waterfront, in the article about Elia Kazan, would be transliterated into Cyrillic as "Wатерфронт", with the English/Latin W still present.
 * Similarly, "Фернґейм" (the Ukrainian name for Ferngeym) would be transliterated into Latin as "Fernґyeym", with the Ukrainian/Cyrillic ґ still present.

Roman numerals are also not subject to transliteration. This may or may not be desirable as could affect some initials (I, V, X, C, D, M)—but it seems to be present in some of the other transliteration modules, too.

Ideally, pronunciations, foreign titles of books or movies, names of people and places, etc, would be marked as -{not-for-transliteration}- anyway, so this shouldn't come up too often.

Testing
Once I got it working, being the big nerd that I am, I had to systematically test it on real data, of course.

On Types and Tokens
A lot of my analysis looks at types and tokens, and it's important to distinguish the two. Tokens refer to individual words, counted each time they appear. Types count all instances of a word as one thing. So, in the sentence, The brown dog jumped and the grey dog jumped., there are nine tokens (usually more or less "words"), but only six types (the, brown, dog, jumped, and, grey).

Parallel Corpora
I was fortunate to find parallel Cyrillic and Latin corpora online, which is perfect for testing the effectiveness of the transliteration. I tokenized a relatively large sample text and reviewed the tokens to ensure alignment. There were 23,630 tokens (total words in the corpus), and 8,816 types (distinct words in the corpus).

I loaded up the list of parallel words (types) in Latin and Cyrillic, and applied the transliteration to them. I was then able to compare the automatically transliterated form with form from the parallel corpus. (Unfortunately, I don't know whether the parallel forms there were transliterated automatically or manually, but it's a corpus of literature, so I assume there was some form of review to check for errors.)

This corpus primarily for automated review, with problem cases to be extracted for speaker review.

Wikipedia Corpus
I also extracted the text of 500 Crimean Tatar Wikipedia articles and tokenized them. There were 13,920 tokens (individual words) and 5,685 types (distinct words). As I understand it, the Crimean Tatar Wikipedia is currently generally written in the Latin alphabet, so most of the words are in the Latin script, though there are also various words in Greek, Cyrillic, Georgian, Armenian, Arabic, Devanagari, and Chinese.

I assume that at least some of the Cyrillic words are not Crimean Tatar, so if the transliteration is enabled, they will need to be marked as -{not-for-transliteration}-.

This corpus is primarily for speaker review.

Cyrillic to Latin (C2L) vs Parallel Corpus
Errors: After the C2L transformation, there were a small number of mismatches in the tokens extracted from the parallel corpus: 72 tokens / 55 types (out of 23,630 tokens / 8,816 types), so the transformation is generally very accurate (>99%). Exceptions: The exception list matched 52 types and 187 tokens. The most common of the exceptions, кой/köy, appeared 24 times, putting it just barely in the top 100 most common words in the corpus. So, it seems the exception list is useful.
 * 72/23,630 tokens = 0.30% of the text is transliterated incorrectly.
 * 55/8,816 types = 0.62% of individual words are transliterated incorrectly.
 * There is a higher percentage of types than tokens transliterated incorrectly because the more common types are done correctly.

Exceptions as Errors: A few exceptions on the list also showed up as C2L parallel transliteration errors. Possible reasons include errors in the parallel corpus (which may have been partially automatically generated), and errors in the exception list.

Speaker review notes: in the table below, what is the correct Latin transliteration of the Cyrillic? These should be reviewed by a speaker and corrected if needed.

Patterns of Errors: Of the 55 types that had transliteration errors, all of them involve ü/u, ö/o, or y. Clearly those are the hard letters to transliterate to. The list of exceptions is below. Parallel Cyrillic/Latin are the words as found in the parallel texts. Transliterated Latin is the Cyrillic automatically transliterated.

''Speaker review notes: in the table below, what is the correct Latin transliteration of the Cyrillic? Are there any obvious general patterns we are not taking advantage of?'' Speaker review of the errors for corrections, a list of exceptions, or a better transliteration rule would be helpful.

Latin to Cyrillic (L2C) vs Parallel Corpus
Unfortunately, Latin to Cyrillic seems to be considerably more difficult.

Errors: After the L2C transformation, there were significantly more mismatches than with C2L in the tokens extracted from the parallel corpus: 1811 tokens / 704 types (out of 23,630 tokens / 8,816 types), so the transformation is generally moderately accurate (>90%). Exceptions: The exception list matched 53 types and 263 tokens. The most common of the exceptions, İsmail/Исмаил/Исмаиль, appeared 73 times, and appears to be incorrect! It is the 23rd most common word in the corpus; presumably İsmail is a character in the story, and its frequency does not reflect general text. The next most common exception is кой/köy, with 24 occurrences, as in the C2L case. The exception list still seems useful, though we need to get the transliteration of İsmail. worked out.
 * 1,811/23,630 tokens = 7.66%% of the text is transliterated incorrectly.
 * 704/8,816 types = 7.99% of individual words are transliterated incorrectly.
 * There is a higher percentage of types than tokens transliterated incorrectly because the more common types are done correctly.
 * Getting definitive transliteration answers for the top 28 words (with ≥ 10 occurrences) would reduce the number of type errors by 28 (to 7.67%), but the number of token errors by 710 (to 4.66%).

Exceptions as Errors: A few exceptions on the list also showed up as L2C parallel transliteration errors. Possible reasons include errors in the parallel corpus (which may have been partially automatically generated), and errors in the exception list.

Speaker review notes: in the table below, what is the correct Cyrillic transliteration of the Latin? These should be reviewed by a speaker and corrected if needed.

Patterns of Errors: Of the 704 types that had transliteration errors, most of them involve ю/у, ё/о, э/е, or ь. A smaller number involve ц/тс, щ/шч, and ъ/ь. Clearly those are the hard letters to transliterate to. The list of exceptions is below. Parallel Cyrillic/Latin are the words as found in the parallel texts. Transliterated Cyrillic is the Cyrillic automatically transliterated.

''Speaker review notes: in the table below, what is the correct Cyrillic transliteration of the Latin? Are there any obvious general patterns we are not taking advantage of?'' Speaker review of the errors for corrections, a list of exceptions, or a better transliteration rule would be helpful.

Conclusions and Next Steps
I feel like the Cyrillic to Latin accuracy is pretty good at >99%. However, the Latin to Cyrillic is only ~92%, which may not be good enough.

The immediate next step is to get some speaker review of the inconsistencies in the transliteration of the parallel texts, make improvements, and see where to go from there.

Steps after that could include:
 * Put the improved version of the transliteration in a patch for review. There may be some technical/programming issues beyond transliteration accuracy, so it may be a good idea to submit this to Gerrit as [WIP] even if the accuracy isn't quite good enough.
 * Get speaker review of CRH Wikipedia tokens. This list is all manual review, but it should be more representative of words in Wikipedia.
 * Process more parallel texts to add to the parallel corpus. I processed one long story (~20K words, ~8800 unique words). I could add several more and try to get the total corpus up to ~100K words. It's a semi-manual process, but it's less tedious that looking at all the Wiki tokens.