User:TJones (WMF)/Notes/Analysis of Applying Indonesian Analysis Chain to Malay

June 2018 — See TJones_(WMF)/Notes for other projects. See also T196780.

Background
After working on Serbian (T178926/T192395) and Slovak (T178929) and looking at the papers they were based on or translated from, I decided to reconsider what counts as "implementable" for Malay, and review the papers on Malay stemming and compare it to the existing Indonesian analysis.

My understanding of Indonesian and Malay was pretty simple, and that they are "more distinct than American and British English, but less distinct than Spanish and Portuguese". Also, Malay and Indonesian didn't interact in my investigation into fallback languages, where each is used as a fallback language for other languages.

However, looking at the wiki page on the matter, and reviewing some other sources, it seems that a lot of the difference is in Dutch-influenced vs English-influenced spelling of certain sounds, Dutch vs English loanwords, other vocabulary differences, and some pronunciation differences—all of which can decrease mutual intelligibility—but the grammar of the two standard forms seems to be essentially the same.

I also compared the Malay stemmer papers with the Lucene Indonesian stemmer implementation, and verified that they are working on similar affixes. There are some discrepancies, but the core affixes are the same, and the differences seem to come down to what affixes to try to account for (some derivational vs inflectional).

While it's possible that spelling differences or vocabulary differences could increase the error rate for Malay vs Indonesian, it seems to be worth testing; if it is successful, all we need to do it configure it—everything is not only already built, it's already installed, too!

Data
I pulled 10,000 random Malay Wikipedia articles and 1,000 random Malay Wiktionary entries (there are only ~3,600), and did my usual stripping of markup and deduplication of individual lines (to get rid of excess copies of the equivalent of commonly used headings like "References", "See Also", "Noun", language names, etc.).

Baseline vs Indonesian Analysis Chain
The current default analysis chain (used for Malay) uses the standard tokenizer and the ICU Normalizer filter. The default/monolithic Indonesian analysis chain uses the same tokenizer, but only does lowercase normalization.

The Indonesian analysis chain also uses a stopword list (see "Appendix: Stopwords" below), which removes a significant number of tokens. The Wikipedia corpus has 1,118,865 tokens with the current default analysis chain (using no stopwords), but only 847,280 tokens with the Indonesian analysis chain—that is, 24.3% fewer tokens after stopwords are removed. (Note that stopwords are still indexes in the plain field, so ignoring stopwords in the text field generally improves recall while still allowing for exact matches for phrases in the plain field.)

The stopword list has a smaller impact on the Wiktionary corpus. The default analysis chain gave 64,613 tokens, while the Indonesian analysis chain gave 59,370 (an 8.1% decrease).

Stemming affects 14.6% of types (unique words) and 15.0% of tokens (individual instances of words) in the Wikipedia corpus—that is ~15% of words are merged with at least one other word as the result of stemming. For Wiktionary, it's 15.5% of types but only 6.4% of tokens.

The difference in normalization (ICU normalization vs lowercase) results in some of the typical ICU normalizations being lost when switching to the Indonesian analysis chain, including regularization of some characters to standard forms (like Greek ς vs σ, German ß vs ss, some IPA characters, etc.); removal of bi-directional markers, soft hyphens, zero width spaces, and other space characters; conversion of full-width characters, etc.

For the Wikipedia corpus, 0.076% of types and 0.005% of tokens were "split" (that is, the normalization group they were in was split). For Wiktionary, 0.367% of types and 0.076% of tokens. So, overall the effect is small.

However, since we're here, it makes sense to unpack the Indonesian analysis chain and re-enable the ICU normalization so we don't have any normalization regressions for Malay. It also makes sense to test the same config for Indonesian, to improve normalization there.

As usual, the analysis chain applies to things it probably shouldn't, including names, foreign words, and URLs. Some of these will result in unexpected but understandable collisions, while others will have no effect because the oddly stemmed form doesn't match anything else.

Indonesian vs Unpacked Indonesian Analysis Chain
I unpacked the Indonesian analysis chain and disabled the lowercase-to-ICU-normalization upgrade, and there were no differences in the analysis of either corpora, so the unpacking was done correctly!

Re-enabling the ICU normalization upgrade showed the usual changes, as above. The overall impact is very small: on the Wikipedia corpus new collisions—all expected—affected 0.432% of types, 0.012% of tokens.

There were some unexpected—but not unfamilar!—splits, affecting 0.008% of types and 0.001% of tokens. They are the problems we've seen before with dotted-I (İ) not being handled properly, and as such İstanbul and Istanbul no longer index together. I added a character filter to fix that.

Test on Indonesian Data
Since the unpacked analysis chain with the ICU normalization seems to offer the best of all worlds, I ran a quick test on 5K Indonesian Wikipedia articles and 5K Indonesian Wiktionary articles, comparing the current Indonesian monolithic analysis to the unpacked Indonesian analysis chain with ICU normalization.

There were no new splits, and only the Indonesian Wikipedia articles had any new collisions, all of which were expected (and good!) ICU normalizations: removal of bi-directional marks, normalization to standard forms of letters, etc.

Despite the lack of collisions, aabout 150 tokens were changed—they were still unique after normalization. However, these changes increase search recall for words with "unusual" characters.

Baseline vs Unpacked Indonesian Analysis Chain
So, the unpacked, ICU-normalized, dotted-I folding Indonesian analysis chain does pretty much what we want and expect, so it's time for review!

Stemming Groups for Review
Below are some stemming groupings for review by speakers of Malay. These are tokens that would be indexed together, so searching for one would find the others. The format is  - [ ][  ].... The  is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the tokens being considered. The is a token found by the language analyzer (more or less a word) and is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters. Some errors are expected, because language is messy.

Large Groups
None of the groups are particularly large, given the number of affixes there are. Nonetheless, the largest groups from the Wikipedia (>20, ignoring case) and Wiktionary (>10, ignoring case) corpora are below.

Wikipedia:


 * dapat: [3 Dapatkah][4 Didapati][1 MENDAPAT][15 Mendapat][2 Mendapati][4 Pendapat][3 Pendapatan][673 Terdapat][2 Terdapatnya][2 dapatan][16 dapati][1 dapatlah][1 dapatnya][2 didapat][194 didapati][1 didapatinya][2 didapatkan][1 medapatkan][1 memdapat][516 mendapat][92 mendapati][193 mendapatkan][1 mendapatkannya][1 mendapatnya][2 men­dapatkan][59 pendapat][57 pendapatan][1 pendapatmu][2 pendapatnya][747 terdapat][15 terdapatnya]
 * ingat: [1 Beringat][1 Diingatkan][9 Ingatan][2 Ingatkan][2 Ingatlah][3 Peringat][15 Peringatan][1 diingat][8 diingati][2 diingatkan][1 diperingati][11 ingat][35 ingatan][3 ingatannya][1 ingatkah][7 ingatlah][19 memperingati][1 memperingatinya][1 memperingatkan][3 mengingat][7 mengingati][9 mengingatkan][1 mengingatkannya][29 peringatan][1 peringatannya][6 teringat][2 teringatkan]
 * ajar: [2 Ajar][12 Ajaran][1 Ajari][1 BELAJAR][3 Belajar][1 Dipelajari][2 Mengajar][46 Pelajar][1 Pelajarnya][7 Pembelajaran][1 Pengajar][4 Pengajaran][5 ajar][44 ajaran][3 ajarannya][1 ajarkan][132 belajar][19 diajar][5 diajarkan][11 dipelajari][2 membelajari][29 mempelajari][47 mengajar][4 mengajarkan][2 mengajarnya][1125 pelajar][8 pelajarnya][48 pembelajaran][1 pembelajarannya][13 pengajar][20 pengajaran][2 pengajarannya][1 pengajarnya]
 * cinta: [5 Bercinta][2 CINTA][279 Cinta][1 Cintailah][1 Cintakan][4 Cintaku][5 Cintamu][1 Cintanya][1 Kecintaan][2 MENCINTAI][1 Mencintai][3 Mencintaiku][3 Mencintaimu][1 Pecinta][1 Pencinta][1 Percintaan][8 bercinta][55 cinta][2 cintai][3 cintakan][4 cintanya][3 dicintai][2 dicintainya][2 kecintaan][7 mencintai][1 mencintaimu][1 mencintainya][1 pecinta][4 pencinta][1 pencintaan][9 percintaan][1 percintaannya][3 tercinta][2 tercintanya]
 * guna: [1 Berguna][4 Digunakan][5 Guna][3 Gunaan][4 Gunakan][11 Kegunaan][15 Menggunakan][28 Pengguna][52 Penggunaan][2 Penggunaannya][44 berguna][19 diguna][1866 digunakan][3 digunakannya][8 dipergunakan][35 guna][1 gunaan][12 gunakan][1 gunakanlah][1 gunakannya][2 gunanya][202 kegunaan][10 kegunaannya][6 mempergunakan][8 mengguna][958 menggunakan][24 menggunakannya][156 pengguna][234 penggunaan][14 penggunaannya][8 penggunanya][1 pergunakan]
 * laku: [9 Kelakuan][1 Kelakuannya][14 Lakukan][1 Lakukannya][1 Melakukan][1 Pelakunya][3 Perlakuan][32 berlakunya][2 diberlakukan][257 dilakukan][11 dilakukannya][3 diperlakukan][1 di­lakukan][31 kelakuan][3 kelakuannya][40 laku][28 lakukan][1 lakukannya][261 melakukan][17 melakukannya][2 memberlakukan][2 memperlakukan][1 me­lakukan][2 pelakunya][17 perlakuan][1 perlakuannya][2 perlakukan][1 perlakukannya]

Wiktionary:
 * bilang: [1 Pembilangan][6 berbilang][3 bilang][33 bilangan][4 bilangannya][1 dibilang][2 kebilangan][8 membilang][1 membilangi][2 membilangkan][1 pembilang][1 pembilangan][1 perbilangan][4 terbilang]
 * bahasa: [270 Bahasa][1 Bahasanya][169 bahasa][4 berbahasa][1 dibahasakannya][1 diperbahasakan][2 kebahasaan][2 membahasakan][1 memperbahasakan][1 perbahasa][1 perbahasaan][1 perbahasaannya]
 * lepas: [1 berlepas][4 lepas][1 lepasan][1 melepas][5 melepaskan][1 memperlepaskan][1 pelepas][2 pelepasan][1 penglepasan][1 perlepasan][1 terlepas]

Random Groups
Below are 25 random groups from each corpus.

Wikipedia:


 * ambara: [1 Pengambaraan][1 mengambara][1 pengambaraan]
 * asosiasi: [1 Asosiasi][1 asosiasi][1 diasosiasikan]
 * bebas: [23 Bebas][11 Kebebasan][11 Pembebasan][203 bebas][1 bebaskan][1 bebasnya][44 dibebaskan][52 kebebasan][1 kebebasannya][1 membebas][49 membebaskan][4 membebaskannya][27 pembebasan][1 pembebasannya][2 terbebas]
 * bengal: [15 Bengal][7 Bengali]
 * bijibenih: [1 bijibenih][1 bijibenihnya]
 * canggih: [1 Canggih][1 Kecanggihan][23 canggih][4 kecanggihan][3 tercanggih]
 * cucu: [3 Cucu][1 Cucuku][1 Cucunya][24 cucu][1 cucukan][1 cucuku][5 cucunya]
 * definisi: [7 Definisi][17 definisi][3 didefinisikan][1 mendefinisi][5 mendefinisikan]
 * dina: [3 Dina][8 Medina]
 * electric: [11 Electric][1 dielectric]
 * endali: [1 Mengendalikan][4 Pengendali][1 Pengendalian][56 mengendalikan][1 mengendalikannya][15 pengendali][22 pengendalian][2 pengendalinya]
 * gendang: [7 Gendang][2 gendang][1 pergendangan]
 * guna: [1 Berguna][4 Digunakan][5 Guna][3 Gunaan][4 Gunakan][11 Kegunaan][15 Menggunakan][28 Pengguna][52 Penggunaan][2 Penggunaannya][44 berguna][19 diguna][1866 digunakan][3 digunakannya][8 dipergunakan][35 guna][1 gunaan][12 gunakan][1 gunakanlah][1 gunakannya][2 gunanya][202 kegunaan][10 kegunaannya][6 mempergunakan][8 mengguna][958 menggunakan][24 menggunakannya][156 pengguna][234 penggunaan][14 penggunaannya][8 penggunanya][1 pergunakan]
 * inovasi: [4 Inovasi][11 inovasi][1 inovasinya]
 * jara: [13 Penjara][76 penjara][1 penjaraan][1 penjaranya]
 * juri: [8 Juri][35 juri][1 jurinya]
 * khuatir: [3 dikhuatiri][2 khuatir][1 khuatiri]
 * kuang: [18 Kuang][19 Mengkuang][1 kuang][3 mengkuang]
 * lembu: [10 Lembu][62 lembu][1 lembunya]
 * lintasi: [1 dilintasi][20 melintasi]
 * ohon: [1 Pemohon][41 memohon][3 memohonkan][2 pemohon][1 pemohonan]
 * optimum: [4 dioptimumkan][1 mengoptimumkan][4 optimum]
 * sangka: [1 Disangkakan][1 Sangkaan][9 disangka][4 disangkakan][14 menyangka][2 menyangkakan][8 sangka][1 tersangka]
 * tersedia: [4 ketersediaan][2 ketersediaannya]
 * use: [22 Meuse][4 Use][48 use]

Wiktionary:


 * anut: [1 anut][1 anutan][1 menganut][1 menganuti][2 penganut][1 penganutan]
 * asap: [2 asap][1 berasap][1 mengasap][1 mengasapi][1 pengasapan][1 perasapan]
 * asing: [5 asing][1 berasing][1 mengasingkan][1 pengasingan][1 terasing]
 * bincang: [2 dibincangkan][4 perbincangan]
 * cuci: [1 cuci][1 cucian][3 mencuci][2 pencuci]
 * dalam: [1 dalaman][1 dalamnya][2 mendalam]
 * dukung: [1 dukung][1 pendukung]
 * ena: [5 mengenai][1 mengenainya][4 mengenakan]
 * etam: [6 mengetam][1 mengetamkan][3 pengetam][2 pengetaman]
 * gulung: [2 gulung][1 gulungan][1 menggulung]
 * habis: [1 Habis][2 habis][3 menghabiskan]
 * ikan: [12 ikan][1 menikan]
 * imigrasi: [1 imigrasi][1 keimigrasian]
 * jatuh: [1 jatuh][1 menjatuhkan]
 * jebak: [1 jebak][1 menjebak][1 pejebak][1 penjebakan][1 terjebak]
 * kerangka: [1 mengkerangkai][1 terkerangka]
 * kosong: [1 Kekosongan][1 kekosongan][10 kosong]
 * lecut: [1 lecut][1 melecut]
 * lubang: [2 berlubang][9 lubang]
 * mabuk: [1 kemabukan][5 mabuk]
 * majmuk: [3 Majmuk][1 kemajmukan][18 majmuk]
 * pangkat: [2 berpangkat][5 pangkat]
 * putera: [1 kepu­teraan][1 putera]
 * rongkong: [1 kerongkong][1 kerongkongan]
 * umum: [1 Umum][1 diumumkan][1 mengumumkan][3 pengumuman][10 umum][1 umumnya]

Problem Groups
My analysis tools couldn't quite fully handle Malay grammar. Some of the prefixes and suffixes are a bit ambiguous for certain words—for example, there are both -an and -kan suffixes, which on the end of a stem ending in -k are ambiguous. The stemmer does a better job of doing the right thing than my simple process that strips the longest matching affix. It did get the number of items I had to inspect down from thousands to fewer than a hundred, though. None are actual "problems".

Next Steps

 * Get speaker review of the stemming groups.
 * Assuming the review is positive, commit the unpacked ICU-normalizing config for Malay and Indonesian.
 * Once the config is deployed, reindex Malay- and Indonesian-language wikis with the new config.

Appendix: Stopwords
Below are a list of stopwords, which are words that are dropped from the text field. These come from the 10K Wikipedia article corpus. (Note that stopwords are still indexes in the plain field, so ignoring stopwords in the text field generally improves recall while still allowing for exact matches for phrases in the plain field.)

This first group includes the most common stop words in the Indonesian analysis chain; they each occur more than 5K times in the 10K Wikipedia corpus.


 * yang (27309), di (26216), dan (25852), dalam (10008), pada (9633), dengan (8691), merupakan (8532), sebuah (7985), ini (7040), dari (6023), sebagai (5587), oleh (5290), adalah (5139)

Next is the full list of words dropped by the Indonesian analysis chain. There may be other words in the stopword list that didn't appear in the 10K Wikipedia corpus, but there shouldn't be many.


 * ada, adalah, adanya, adapun, agak, agar, akan, akhirnya, aku, akulah, amat, amatlah, anda, antar, antara, antaranya, apa, apabila, apakah, apalagi, apatah, atau, ataupun, bagai, bagaikan, bagaimana, bagaimanakah, bagaimanapun, bagi, bahkan, bahwa, banyak, beberapa, begini, beginilah, begitu, begitulah, begitupun, belum, berapa, berapakah, bermacam, bersama, biasa, biasanya, bila, bilakah, bisa, bisakah, boleh, bolehkah, bolehlah, buat, bukan, bukanlah, bukannya, cuma, dahulu, dalam, dan, dapat, dari, daripada, dekat, demi, demikian, demikianlah, dengan, depan, di, dia, dialah, diantara, diantaranya, dini, diri, dirinya, disini, disinilah, dong, dulu, entah, hal, hampir, hanya, hanyalah, harus, haruslah, harusnya, hendak, hendaklah, hingga, ia, ialah, ibarat, ingin, inginkan, ini, inikah, inilah, itu, itulah, jangan, janganlah, jika, jikalau, juga, justru, kala, kalau, kalian, kami, kamilah, kamu, kan, kapan, karena, ke, kecil, kemudian, kenapa, kepada, kepadanya, ketika, khususnya, kini, kiranya, kita, kitalah, kok, lagi, lah, lain, lainnya, lalu, lama, lamanya, lebih, macam, maka, makanya, makin, malah, malahan, mampu, mampukah, mana, manakala, masih, masing, mau, maupun, melainkan, melalui, memang, mengapa, mereka, merekalah, merupakan, meski, meskipun, mungkin, nah, namun, nanti, nyaris, oleh, olehnya, pada, padahal, padanya, paling, pantas, para, pasti, per, percuma, pernah, pula, pun, rupanya, saat, saatnya, saja, saling, sama, sambil, sampai, sana, sangat, sangatlah, saya, se, sebab, sebabnya, sebagai, sebagaimana, sebagainya, sebaliknya, sebanyak, sebegini, sebegitu, sebelum, sebelumnya, sebenarnya, seberapa, sebetulnya, sebuah, sedang, sedangkan, sedemikian, sedikit, segala, segalanya, segera, seharusnya, sehingga, sejak, sejenak, sekali, sekalian, sekaligus, sekalipun, sekarang, seketika, sekiranya, sekitar, sekitarnya, sela, selagi, selain, selaku, selalu, selama, selamanya, seluruh, seluruhnya, semacam, semakin, sementara, sempat, semua, semuanya, semula, sendiri, sendirinya, seolah, seorang, sepanjang, seperti, sepertinya, sering, seringnya, serta, serupa, sesaat, sesama, sesekali, seseorang, sesuatu, sesudah, sesudahnya, setelah, seterusnya, setiap, setidaknya, sewaktu, siapa, siapakah, siapapun, sini, sinilah, suatu, sudah, supaya, tadi, tadinya, tak, tanpa, tapi, telah, tentang, tentu, tentulah, tentunya, terdiri, terhadap, terhadapnya, terlalu, terlebih, tersebut, tersebutlah, tertentu, tetapi, tiap, tidak, tidakkah, tidaklah, toh, wah, wahai, walau, walaupun, wong, yaitu, yakni, yang