User:TJones (WMF)/Notes/Removing Stress Accents and Folding Ё to Е for Russian Wikis

September 2016 — See TJones_(WMF)/Notes for other projects. (T102298 / T124592) For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Summary
Stress accents (e.g., бори́тесь) are only used in Russian dictionaries and encyclopedias as an aid to pronunciation. This extra diacritic makes text matching difficult, and interferes with stemming, so we should remove them. The letter ё is rarely used outside pedagogical contexts, and is usually replaced with е, so we should do likewise when stemming.

Highlights:
 * Unpacking the default Russian analysis chain per the Elasticsearch docs leads to different results, but most of the changes are desirable, and the effect size is very small.
 * Removing stress accents and folding ё to е affect a fairly large number of words, but the effects are almost all desirable (with the possible exception of a few etymological roots in the Latin alphabet, but the number is small and the effect is arguably positive or at least non-negative).

Introduction
As noted in T102298, acute accents are only used in Russian to mark stress as a pronunciation guide. There are no precomposed Cyrillic vowels with the acute accent, so the combining acute accent is always used for this purpose. These stress accents are a big problem in Russian Wiktionary, where sometimes the only instance of an inflected form has the stress accent. In Russian Wikipedia the stress accents are often used in the first, bolded instance of the subject of the article as an aid to pronunciation, but unaccented versions are present in the title and throughout the article. (e.g., Михаи́л Серге́евич Горбачёв vs the title, Горбачёв, Михаил Сергеевич)

As noted in T124592 and on the English Wikipedia page for Ё, the use of ё is not obligatory (it’s really interesting, go read it!), and so it isn’t common: “By and large, it is used only in dictionaries and in pedagogical literature intended for children and students of Russian as a second language.” In Russian Wikipedia, this is overcome by the liberal use of redirects (see Черная дыра for Чёрная дыра), though the difference can lead to changes in stemming. In Russian Wiktionary, the results can even be misleading; when searching for “черная дыра” the top results are entries citing that form, while the “proper” dictionary form, чёрная дыра, is the 10th result.

Of course, Russian Wiktionary is a dictionary, and as a result, both ё and stress accents are common. As a result, we are unpacking the Elasticsearch Russian analyzer and adding a character filter to remove the stress accent and fold ё to е, for both the plain analyzer (exact matches, no stemming) and the text analyzer (with stemming).

As with unpacking the French analyzer, there are some differences in the way some Unicode characters are treated, though most are positive.

Corpora Generation and Processing
I extracted 10,000 article/entries each from Russian Wikipedia and Russian Wiktionary. I also extracted smaller corpora for testing code and for more careful inspection of results.

I ran the analysis chain on the extracted texts, unpacked the Russian analysis chain and re-ran them, and then made the desired changes—stripping the stress marks and folding ё to е—and re-ran them again.

Differences in Unpacked Analysis Chain
The differences were of the same sort seen when unpacking the French analysis chain, though there were fewer specific characters affected, since I was using a smaller corpus (10K vs 50K).

The differences seem to mostly be improvements: It has the same regression as unpacking the French analyzer: Turkish dotted I (İ) not being folded to I.
 * Invisible Unicode characters are stripped; they would otherwise keep some words from being indexed properly. Examples:
 * 'zero width non-joiner' (u+200c)
 * ’zero width joiner' (u+200d)
 * 'left-to-right mark' (u+200e)
 * 'right-to-left mark' (u+200f)
 * 'zero width no-break space' (u+feff)
 * soft hyphens
 * Misc Unicode characters
 * ß -> ss
 * Raised letters: ʲ -> j, ª -> a, º -> o
 * Ligatures: ﬁ -> fi
 * Roman numerals: Ⅿ -> M
 * Special use symbols: Ｚ -> Z; µ -> μ. (Could go either way.)

Results—Built-in Russian vs Unpacked Russian
The size of the effect is very small, and generally positive. Note that "new collisions" are post-analysis types (final buckets), and all other are pre-analysis types (original forms). This is confusing—sorry. The number of types changes (it is reduced) after analysis, but the number of tokens doesn't.
 * total tokens: basically the total number of words in the corpora
 * pre-analysis types: the number of unique different forms of words before stemming and folding
 * post-analysis types: the number of unique different forms of words after stemming and folding
 * new collision types: the number of unique post-analysis forms of words that are bucketed together by the change (the % changed is in comparison to the post-analysis types)
 * new collision tokens: the number of individual words that are bucketed together by the change (the % changed is in comparison to the original total tokens)

Also note that the effect is smaller, percentage-wise, than it will be for the entire corpus of either Russian Wiktionary or Wikipedia because the sample is small, and not all potential collisions occur.

Results—Unpacked Russian vs Unpacked and accent-stripped and ё-folded Russian
On Russian Wiktionary, a small number of non-Cyrillic characters are affected by the stripping of the combining accent acute accent, in particular, ā́ -> ā and v́ -> v. These occur in etymologies, and only accounted for 10 tokens in this sample. I don’t think it poses a problem, but if it does, we can make the character filter more complex and explicitly map accented vowels to unaccented ones (e.g., и́ -> и, etc.), though that wouldn’t catch accents misplaced on consonants, our multiple accents (see below). On Russian Wikipedia, in addition to accents on Cyrillic vowels, a few words have accents on Cyrillic consonants, and one instance of an accent on an accent (probably all typos). The double accent is probably something that would not be found for a long, long time. (I only noticed it because my text editor doesn’t combine accents properly, so instead of Ли́́вшиц (it’s double accented*), I saw Ли´´вшиц.

* Hmm. It really depends on the application. My non-dev text editor showed one accent (presumably the two overlapped). Chrome shows them double stacked. Anyway, it's a weird case that may or may not be obvious depending on how it's rendered, and it could be very hard to search for.

Conclusions

 * The overall impact of performing this character filtering on Russian Wiktionary and Russian Wikipedia is positive. We should do it!
 * We should probably set up a character map from İ to I so as not to regress on Turkish names.