User:TJones (WMF)/Notes/Unpacking Notes/Bengali

From mediawiki.org

Bengali Notes (T294067)[edit]

The situation with Bangla/Bengali is a little different than others I've worked on so far. The Bengali analyzer from Elasticsearch has not been enabled, so I need to enable it, verify it with speakers, and unpack it so that we don't have any regressions in terms of handling ICU normalization or homoglyph normalization.

  • Usual 10K sample each from Wikipedia and Wiktionary.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u-encoded tokens, file names, numbers, IPA transcriptions (in Wiktionary), etc.
  • A few mixed Latin/Cyrillic words with homoglyphs (in Wikipedia)

Baseline

  • ICU & homoglyph normalization enabled; no stopwords, no stemmer
  • Lots of invisibles (stripped out by ICU normalization)
    • Tons of zero-width joiner and zero-width non-joiners (U+200C, U+200D)—mostly in Bengali words
    • Up to 5 versions of the same word with different variations of invisibles
    • Several bidi marks, esp. LTR marks—mostly Arabic, but also with Hebrew, Bengali, and Latin
  • Some normalization of Unicode characters
    • Mostly Arabic and Latin characters
  • A handful of double-dotted I tokens (after lowercasing, e.g., i̇stanbul), and one token with double umlauts: kü̈ltü̈r
  • Out of ~3.7M tokens in the Wikipedia sample, there are 401 Bengali tokens, 7 Latin tokens, and 37 integer tokens that occur ≥1K times each, with the top entry being এবং ("and") with over 57K occurrences—that's over 1.5% of all tokens!
    • Looks like plenty of good stop word candidates!

Enabling the Bengali Analyzer[edit]

I enabled the Bengali analyzer and ran it against my corpus, and brought some of the "potential problem" stemming groups (i.e., those that have neither a prefix nor suffix in common) to a group of Bangla speakers who work at the Foundation. The most common unexpected alternation in my sample was initial শ, ষ, স (shô, ṣô, sô). The Bengali speakers were skeptical about some of the groupings and I found more examples for them, including word-internal শ, ষ, স alternations. Nahid in particular explained some of the not-great-looking groupings—such as বি (the number ‘20’); বি (‘poison’); বি (‘Lotus stalk’), and several others. The grouped/merged letters are obviously similar but it's not clear that they make good groupings. (Thanks, Nahid!)

I generated my typical more complete new stemmer analysis groupings (and upgraded my tools to make it more automatic), including a random sample, largest groups, and "potential problem" groups. Aisha helped me a lot by looking over them. (I am always very very grateful for the help I get from native speakers when looking at analyzers, because without it I'd be stuck, but having a data analyst review the data is just the best thing ever! Thanks, Aisha!)

Obvious Changes[edit]

Fewer Tokens: My Wikipedia sample had 10.8% fewer tokens, and my Wiktionary sample had 5.7% fewer tokens. These are presumably from the new stopword list enabled as part of the Bengali analyzer.

Collisions/Mergers:

  • Wikipedia: 34,858 pre-analysis types (11.911% of pre-analysis types) / 933,885 tokens (25.363% of tokens) were added to 15,453 groups (5.407% of post-analysis types), affecting a total of 50,312 pre-analysis types (17.191% of pre-analysis types) in those groups.
  • Wiktionary: 1,606 pre-analysis types (4.137% of pre-analysis types) / 8,767 tokens (8.310% of tokens) were added to 895 groups (2.331% of post-analysis types), affecting a total of 2,501 pre-analysis types (6.443% of pre-analysis types) in those groups.
  • TL;DR: A lot of words got merged: ~25% from Wikipedia, ~8% from Wiktionary—which is what you expect when you enable a stemmer!

Splits:

  • Wikipedia: 489 pre-analysis types (0.167% of pre-analysis types) / 1,052 tokens (0.029% of tokens) were lost from 465 groups (0.163% of post-analysis types), affecting a total of 963 pre-analysis types (0.329% of pre-analysis types) in those groups.
  • Wiktionary: 13 pre-analysis types (0.033% of pre-analysis types) / 18 tokens (0.017% of tokens) were lost from 13 groups (0.034% of post-analysis types), affecting a total of 26 pre-analysis types (0.067% of pre-analysis types) in those groups.
  • TL;DR: A very small number of tokens got removed from groups they previously joined with: ~0.03% from Wikipedia, ~0.02% from Wiktionary.
  • These are largely from invisible characters that are no longer normalized with the monolithic analyzer—mostly bidi marks in RTL tokens (but also Bengali tokens), and zero-width joiners & zero-width non-joiners in Bengali and other Brahmic script tokens. There aren't many of them—though there are somewhat more than these counts suggest because many are unique—but not normalizing them makes words with these invisible characters unfindable.

Misc:

  • Bengali numerals are normalized to Arabic numerals, increasing mergers (and recall).
  • Some tokens with Cyrillic/Latin homoglyphs no longer get fixed.
  • Miscellaneous normalization in other scripts is no longer done. Includes Arabic, Armenian, Devanagari, Greek, IPA, Lao, Latin, Malayalam, Thai, and misc STEMmy symbols.
  • The monolithic analyzer fixes the existing double dotted I (which comes from capital İ, as in İstanbul/Istanbul).
  • There's some general Indic normalization that happens, like Devanagari क + ़ (U+0915 + U+093C) → क़ (U+0958).

Unpacking the analyzer and doing the usual upgrades should fix all the regressions while enabling the new cool features. The one thing I'm still worried about is the শ, ষ, স merging that the Bengali speakers all noted...

Double the Metaphone, Double the Fun(etics)[edit]

I set out to track down the source of the শ, ষ, স confusion, and it turns out to be the bengali_normalization filter in the Bengali analyzer.

I tracked down the relevant Lucene class, BengaliNormalizer.java, and right there in the comments, it claims to be an implementation of a particular paper, "A Double Metaphone Encoding for Bangla and its Application in Spelling Checker", by Naushad UzZaman and Mumit Khan. The provided link to the paper is dead, but the PDF is available at the Internet Archive's Wayback Machine (and elsewhere on the web).

The paper title immediately set off some alarm bells for me, because Metaphone (and Double Metaphone) are well-known phonetic algorithms. Phonetic algorithms are designed to provide an encoding of a word based on its pronunciation. Phonetic algorithms have been used in genealogy to group similar-sounding names and in spell checkers (for example, to help people spell genealogy correctly—or to better find the many spellings of Caitlin... though few will have any chance of matching KVIIIlyn—but I digress).

The "compression" of a phonetic algorithm can be more or less "aggressive", but in general, the English ones I'm familiar with are much too aggressive for indexing for search. As an example, it's good for a spell checker to take the unknown word foar and recommend for, four, and fore as possible replacements. It is not so good to index for, four, and fore together as "the same word" (in the way that hope, hopes, hoped, and hoping are indexed together by stemming). More aggressive phonetic compression (like English Double Metaphone), might also group all of these words and names with for, four, and fore: fairy, farrah, farraj, farrow, fear, ferrao, ferrier, ferry, fire, foyer, fray, fry, furry, phair, pharao, phare, vario, veer, verhaeghe, vieira, viray, wahr, war, waroway, wear, weary, weir, werry, weyer, wire.

Given my level of familiarity with Bengali, it isn't clear from the a quick perusal of the paper or the code whether the "Double Metaphone Encoding for Bangla" is as aggressive as Double Metaphone for English, but it does explain the otherwise unexpected merger of শ, ষ, স (shô, ṣô, sô). (However, the fact that it is called an encoding is a hint that it is more aggressive than milder forms of normalization, like lowercasing or ICU normalization.)

I've used the comments from the code as labels in the list below, replaced the \u-encoded characters with the actual characters (and descriptions based on their Unicode names), and replaced the code logic with text descriptions. The simpler conversions—like doubled vowel to vowel or the variants of na, ra, or sa—are easy enough to understand, but definitely feel like something for spelling correction search rather than normalization.

  1. "delete Chandrabindu"
    • Delete ঁ (chandrabindu / candrabindu)
  2. "DirghoI kar -> RosshoI kar"
    • convert ী (ii) to ি (i)
  3. "DirghoU kar -> RosshoU kar"
    • convert ূ (uu) to ু (u)
  4. "Khio (Ka + Hoshonto + Murdorno Sh)"
    • ক্ি (ka + virama + i)
      • convert to খ (kha) at the beginning of a word
      • convert to কখ (ka + kha) elsewhere
  5. "Nga to Anusvara"
    • convert ঙ (nga) to ং (anusvara)
  6. "Ja Phala"
    • if a word starts with <any> + virama + ya, e.g. ত্য, or <any> + virama + ya + aa, e.g. ত্যা, convert it to <any> + e, e.g. তে
    • otherwise, delete ্য (virama + ya)
  7. "Ba Phalaa"
    • if a word starts with <any> + virama + u, e.g. ত্ব, delete ্ব (virama + u)
    • otherwise, given the sequence <any> + <any> + virama + <any> + virama + u, delete the final ্ব (virama + u)
    • otherwise, delete ্ব (virama + u) and double the preceeding letter.
  8. "Visarga"
    • given ঃ (visgara)
      • convert ঃ to হ (ha) if it is the last letter of a word of length 3 or less
      • delete ঃ if it is the last letter of a longer word
      • delete ঃ and double the following letter if it is not the last letter
  9. "All sh"
    • convert শ (sha) and ষ (ssa) to স (sa)
  10. "check na"
    • convert ণ (nna) to ন (na)
  11. "check ra"
    • convert ড় (rra) and ঢ় (rha) to র (ra)
  12. (no comment provided)
    • convert ৎ (khanda ta) to ত (ta)

Aisha looked over these rules individually, and rejected all of them. As expected, they are phonetic:

  • None of these rules should be applied to ''all'' words or even most words. These are only based on the sound of the word/letter. So when searching by writing, I don't think these rules should apply.

The plan now is to do the usual unpacking, make sure there are no changes, disable the bengali_normalization filter and review those focused changes and make sure they don't do anything unexpected (like interact with the stemmer in unexpected ways—which I don't expect, but we gotta check to be sure), and then enable the regular analysis upgrades and make sure they do what is expected.

Upgrade ES 6.5 to ES 6.8[edit]

Well, I took too long working on this and we've upgraded to Elastic 6.8, so I also upgraded my dev environment to ES 6.8 and looked at the diffs it generated. The results are similar to what I saw in general for the upgrade:

  • There were 70 (0.002%) more tokens than before
    • Most were new "EMOJI" characters (8 types, 69 tokens): ▫ ☃ ★ ☉ ♄ 🇧🇩 ® ™
    • Plus one Modi character: 𑘎
  • There were 24 changed tokens with narrow no-break spaces (NNBSP, U+202F) added.
    • The majority of these are 7-digit numbers that are part of arXiv paper IDs. They are a bit tricky to find because the current Bengali language config doesn't strip the NNBSPs either, so searching for the number alone returns 0 results!

These changes were all in my Wikipedia sample. My Wiktionary sample (which is less text overall) didn't show any changes.

Back to Our Regularly Scheduled Program[edit]

Now that all most at least some of the changes caused by enabling the Bengali analyzer and upgrading to ES 6.8 accounted for, we can get back to unpacking.

  • Stemming observations:
    • Bengali Wikipedia had 62 distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
  • Enabled homoglyphs and ICU normalization and mostly saw the usual stuff.
    • There are a fair number of mixed-script tokens, but they aren't confusing like homoglyphs. There are also 3 Latin/Cyrillic homoglyphs (that are automatically fixed) and a few potential Greek-Latin homoglyph confusions.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations:
      • the usual various character regularizations
      • lots of invisibles (bidi and zero-width (non)joiners in particular)

Disable bengali_normalization[edit]

Token counts and stop words: An unexpected consequence of disabling the bengali_normalization filter, which is obvious in retrospect, is that it changed the total number of tokens in the corpus.

Stop words with characters affected by the bengali_normalization filter were no longer the same word by the time they got to the stop word filter. For example, the words সাধার, বেি, and অবশ্য are in the Bengali stop word list. However, they come out of the bengali_normalization as সাধার, বেি, and অব, which are not stop words.

A smaller number of words became stop words as a result of bengali_normalization: উপ়ে and ক়া, which are not stop words, come out of bengali_normalization as উপে and কা, which are.

Overall, more stop words were correctly filtered out than non–stop words were restored when disabling bengali_normalization, which resulted in a net decrease in tokens by about 0.9% for both the Wikipedia and Wiktionary samples.

Widespread token changes: In the neighborhood of 20-25% of tokens in each sample were new, never before seen tokens. This isn't shocking. A parallel of just one of the bengali_normalization rules in English would be if every instance of sh were changed to s. Changing them back would introduce a lot of new words.

In many cases, there wasn't a big impact on stemming groups. As a parallel in English, if shipwreck, shipwrecks, shipwrecked, and shipwrecking were all changed to sipwreck, sipwrecks, sipwrecked, and sipwrecking, or changed back, they would all likely still stem together (as shipwreck or sipwreck) and not with anything else.

Of course, shorter words are more likely to differ by only a letter or two, and be affected by the aggressive normalization of bengali_normalization. (English parallels: shear/sear, sheep/seep, shave/save, or shack/sack being indexed together.) So there is plenty of that...

Stemming group splits (and just a few mergers): This is where we expect to see the most happening, and we do. In the Wikipedia sample, about 5.1% of distinct pre-analysis words and about 12.1% of all words were split off from stemming groups. In the Wiktionary sample, about 2.2% of distinct pre-analysis words and about 4.0% of all words were split off from stemming groups.

There were also some collisions/mergers of stemming groups, but it affected less than 0.4% of words and distinct words in both samples. The overwhelming majority of words added to stemming groups end in the suffix - ী, which seems to mean something like "person of", roughly parallel to -ist or -ian in English. It turns "Greece" into "Greek" and "science" into "scientist", for example.

Back to the splits:

  • It looks like sometimes ঃ (visgara) is used as a colon in times and chapter & verse numbers. The normalization rule for visgara doubles the digit after it, so ১৬ঃ৯ (16:9—the widescreen aspect ratio) becomes ১৬৯৯/1699. Good for that to stop happening.
  • The vast majority of the splits are readily attributable to the easier-to-understand rules in the bengali_normalization filter, such as "convert ii ী to i ি ", "convert nna ণ to na ন", and "convert sha শ and ssa ষ to sa স".
  • I asked Aisha to review a random selection of splits, to look to see if the words are being correctly removed from stemming groups where they don't belong. About 80-85% of the changes were clearly positive, so it's best to take it out!

Enable ICU Folding[edit]

  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • A few tokens were removed that consist only of modifier characters and others that usually don't appear alone.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Variants in various scripts are normalized.
    • Specific to Bengali, the virama (which suppressed an inherent vowel) and the nukta/nuqta (which indicates that a character has a different pronunciation not otherwise found in the script) are both stripped.
      • I was worried these should not be removed (so we'd need ICU folding exceptions), but Aisha was happy with the results. Apparently virama is not always written/typed, and nukta is somewhat esoteric/academic and not commonly used.
      • Virama and nukta in other Indic scripts—Devanagari, Tamil, Telugu—are also stripped, but that's less of a concern.
      • On a side note—I was talking to Mike about the nukta and he pointed that Hebrew has something similar called a geresh—though an apostrophe can also be used. Other languages have similar diacritics or other marks, and I realized that in English, our nukta/geresh–equivalent is the letter h. Need something like a letter but not exactly that letter? Add an h! Some are so standard we don't even notice them anymore: ch, sh, th, sometimes gh; others are historical but don't sound different anymore: ph, wh for most speakers, and sometimes gh; some are common enough but not 100% standardized: zh, kh; and some are only regularly used in transliteration of certain languages: bh, dh, jh. Kinda neat!

Since I didn't have to do anything about the virama and nukta, no ICU folding exceptions are needed for Bengali.

Overall Impact[edit]

  • About 5-10% of tokens were removed from the index—mostly stop words.
  • Stemming is the biggest source of words newly being indexed together.