User:TJones (WMF)/Notes/Greek and Unexpected Empty Tokens

From MediaWiki.org
Jump to navigation Jump to search

February/March 2019 — See TJones_(WMF)/Notes for other projects. See also T203117. For help with the technical jargon used in Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background[edit]

While looking into empty tokens created by ICU folding, I discovered that the monolithic Greek analyzer generates some empty tokens, too, particularly for these words: εστάτο, εστερ, εστέρ, έστερ, έστέρ, εστέρα, εστέρας, εστέρες, εστέρησε, εστερία, εστερικό, εστερικού, εστερικών, εστέρο, εστέρος, εστέρων, ήσανε, ότερ, οτέρι, ότερι, οτερό, οτέρο.

As a result, searching for any of them finds the others. Some are related, but as far as I can tell, searching for εστάτο (estáto) should not return articles with Εστέρες (estéres) and Οτερό (oteró) in the title as top hits—yet that's what happens!

A straightforward solution would be to unpack the Greek analyzer and add a filter for empty tokens. These words would no longer be conflated, and exact matches would still be available through the plain index.

Data and Unpacking the Analyzer[edit]

As usual, I pulled 10,000 random Wikipedia articles and 10,000 random Wiktionary articles to test changes on. Markup was stripped and lines were deduplicated.

I unpacked the monolithic analyzer—making sure the “greek” option was specified for the lowercasing filter—and disabled the automatic upgrade to ICU normalization. The analysis results were identical, as expected.

I re-enabled the ICU normalization upgrade and that’s when all hell broke loose!

ICU Normalization[edit]

The first big surprise was that there were 6.362% more tokens for the Wikipedia sample (3,194,734 vs 3,003,632) and 3.118% more for the Wiktionary sample (102,233 vs 99,142).

And while there were a tiny number of new collisions (words being newly analyzed the same), there were a huge number of splits. For the Wikipedia data: 27,697 pre-analysis types (8.849% of pre-analysis types) / 1,126,081 tokens (37.491% of tokens) were lost from 4,117 groups (2.242% of post-analysis types), affecting a total of 33,933 pre-analysis types (10.842% of pre-analysis types) in those groups. For Wiktionary: 595 pre-analysis types (1.644% of pre-analysis types) / 6,251 tokens (6.305% of tokens) were lost from 194 groups (0.715% of post-analysis types), affecting a total of 826 pre-analysis types (2.283% of pre-analysis types) in those groups.

So, over a third of tokens in the Wikipedia data got split from the analysis group they were in because ICU normalization got turned on? That’s nuts!

Greek “Lowercasing”[edit]

As noted before, when unpacking the Greek analyzer, I had to enable the Greek option for the lowercase filter. The Greek version of lowercasing not only lowercases letters, it also converts final sigma (ς) to regular sigma (σ) and, very importantly, removes a few diacritics—tonos (ά), dialytika (ϊ), or both (ΰ)—from various vowels.

Side Note: There are also special lowercase settings for Irish and Turkish. Turkish I knew about, since it has I/ı and İ/i as different letters, and you want to do the right thing and not lowercase I to i or, worse, İ to i̇ (an i with an extra dot). The Irish situation was news to me: Irish has t-prothesis and n-eclipsis, which can add t- or n- to the beginning of a word with a vowel, though the added t- or n- is still lowercase, and the original initial vowel can still be uppercase, as in the Irish Wikipedia titles “An tAigéan Ciúin” and “Ceol na nOileán”. So the lowercase of Nathair is naither, but the lowercase of nAthair is n-aither (which gets stemmed to just aither, which is what you want).

It seems that the Greek stemmer and Greek stop word filter don’t handle diacritics. For example, από (“from”) is not filtered by the stop word filter, but απο is. The stemmer can’t handle the diacritic on Δωρικός (ICU normalized to δωρικόσ) and returns it unchanged, while Δωρικος is stemmed to δωρικ.

So, all the extra tokens are stop words with diacritics that aren’t being dropped, and all the extra splits are either words that failed to stem because of an diacritic on the affix blocking its removal, or roots that still differ by a diacritic after stemming (e.g., Φιλόσοφοι is stemmed to φιλόσοφ (with diacritic) and φιλοσοφια is stemmed to φιλοσοφ (without diacritic), creating a split).

Unpacked vs Greek Lowercase + ICU Normalization[edit]

I refactored the code that automatically upgrades the lowercase filter to ICU normalization and changed it to add ICU Normalization rather than replace lowercase with it if the lowercase filter has a “language” attribute defined (i.e., for Greek and Turkish as currently configured, and for Irish as it could be configured).

So, for Greek, this keeps the diacritic-stripping features of the Greek-specific lowercase filter, plus all the other ICU normalizations.

Adding ICU normalization had a very small impact on the Wiktionary corpus: 3 pre-analysis types (0.008% of pre-analysis types) / 3 tokens (0.003% of tokens) were added to 3 groups (0.011% of post-analysis types), affecting a total of 6 pre-analysis types (0.017% of pre-analysis types) in those groups.

The Wikipedia corpus lost 10 tokens (~0%) from words normalized to stop words, and had a few more collisions: 259 pre-analysis types (0.083% of pre-analysis types) / 302 tokens (0.010% of tokens) were added to 241 groups (0.131% of post-analysis types), affecting a total of 2259 pre-analysis types (0.722% of pre-analysis types) in those groups.

The main changes that affected Greek words included:

  • normalization of vowels with ypogegrammeni (iota subscript): ᾳ → α, ῃ → η, ῳ → ω.
  • normalization of µ (U+00B5, micro sign) to μ (mu), which, depending on your font, can be visually identical.
    • Two types of tokens were lost: µε and µετά (both with micro signs)—once regularized, they are stop words!

Other changes include the usual:

  • Removal of invisible characters, including bi-directional markers, soft hyphens, zero-width non-joiners, and non-breaking spaces.
  • Normalization of characters in Arabic, German, IPA, and Korean.

Overall, that’s what we expected.

Adding the Minimum Length Filter[edit]

And now, finally, the original reason we are here: adding the filter for the zero-length tokens.

For the Wiktionary corpus, there was no change.

For the Wikipedia corpus, the only changes are the loss of the empty tokens.

All as expected!

ICU Folding in the Plain Field[edit]

In order to address the problems of Greek accents, including those used in Ancient Greek, we introduced ICU Folding into the Greek plain analyzer. I considered removing the ICU Folding and letting the Greek lowercasing do the work it should have been doing all along. However, the ICU Folding, with preservation of the original tokens, is probably providing help in non-Greek searches, so it makes sense to leave it—even though we’ll now be triple normalizing the Greek plain field (it should be very normal!!!).

Next Steps[edit]

  • ✓ Investigate the effects of the general approach to language-specific lowercasing on Turkish and Irish. (DONE)
  • ✓ Commit the changes. (DONE)
  • Reindex Greek-language wikis. (SOON)