TJones (WMF) (talkcontribs)

@Tofeiku, @Bennylin, and @Yosri: thanks for your questions and comments!

My goal is to decide whether to enable the Indonesian language analyzer for Malay wikis. It includes the stemmer and stop word list.

The stemmer isn’t perfect, but it has been enabled on the Indonesian wikis for a long time, and I think it is a lot better than nothing. You can test it some on the Indonesian Wikipedia. For example, searching for mengajar, belajar, or pembelajaran shows the differences in ranking based on the exact form of the word used.

Do we have enough information to decide what to do?

  • (1) If using the Indonesian language analyzer is clearly better than nothing, we can deploy it. (Also note that it can be removed as easily as it is deployed. It takes one to two weeks, but it is easy to do.)
    • If we find something better in the future, we can replace it, too.
  • (2) If the Indonesian language analyzer is terrible for Malay, then we can abandon the project.
  • If the decision is not clear, we can (3) look for more people to bring into the discussion, or we can (4) set up a search demo that uses the Indonesian language analyzer on Malay Wikipedia data, or both.

What are your thoughts? Thanks!

TJones (WMF) (talkcontribs)

@Tofeiku, @Bennylin, @Yosri, and any others who are interested: My plan is to go ahead with the implementation of the Indonesian language analysis on the Malay-language wikis. The results look reasonable to me, and the feedback so far doesn’t indicate that the word groupings are so bad that they wouldn’t be useful.

If we have an opportunity to implement a better stemmer in the future, I’d support that. Also, we can edit the stop word list in the future if we need to, though the effect of stop words on our system is only to discount them, not to ignore them completely, so I’m not too worried about them.

If you have any objections to me continuing with the implementation, please let me know. Thanks!

TJones (WMF) (talkcontribs)

After some delays, this is now available on Malay-language wikis.

Tofeiku (talkcontribs)

I'm a Malay native speaker. So, there are some wrong base words. For example, ohon supposed to be pohon.

TJones (WMF) (talkcontribs)

Thanks, @Tofeiku! The base forms don't have to be correct, since they are only an internal representation of the stem and the users will not see them. I show them because sometimes seeing them makes it easier to understand how the inflected forms ended up grouped together. And of course the closer they are to being correct the less likely other errors are. In that case, as long as memohon, memohonkan, pemohon, and pemohonan are all forms of the same word, the form as a stem doesn't matter very much.

Any other thoughts on the groupings?

Tofeiku (talkcontribs)

Oh ok. And "use" and "electric" are not Malay words.

TJones (WMF) (talkcontribs)

I saw electric and dielectric in the list and suspected they weren't Malay. That's expected. A rules-based stemmer isn't very good at detecting foreign words—though sometimes you can rule them out based on a foreign letter or impossible stem form, but that's subject to errors, too. I consider "stupid but understandable" errors to be tolerable; even though I don't speak Malay, I understand why electric and dielectric end up together, because di- is a common prefix, and stemmers are dumb. (I was less sure about "use" since almost any three letter CVC or VCV sequence could be a word in many languages.)

Anyway, it's good to see how the stemmer treats foreign words and names and what kinds of errors it makes, because foreign words and names are unavoidable on our projects.

Tofeiku (talkcontribs)

I don't think there's a VCV with an "e" at the end in Malay. Sorry for the late replies. Busy celebrating Eid.

TJones (WMF) (talkcontribs)

No worries about slow replies! I guess I meant that VCV is plausible in any language you don't know a lot about. It's (linguistically) interesting that it doesn't happen in Malay.

Bennylin (talkcontribs)

About ten years ago I've made an Indonesian stemmer for my workplace. I'll try to find the code and refresh my memory again.

TJones (WMF) (talkcontribs)

I saw your link to Github. Your stemmer looks a lot more complex than the Lucene one, so I'm guessing it does a better job, too. For now, though, I think having the Indonesian projects and Malay projects use the same stemmer makes sense, and we could look at upgrading them both in the future if you think your stemmer could do a better job.

Bennylin (talkcontribs)
  • Addendum from my notes:

engkau dikau beliau kau begituan betapa bilamana daku gue situ ah ayo deh ding kek hei halo lho lo mari yuk sih ya yah aduh astaga duh hai eh oh ah ih ai aih asyik alhamdulillah insyaallah astagafirullah masyaallah semoga alamak sang sri kaum umat si andai andaikan biarpun untuk beserta semenjak menjelang mengenai kecuali lewat menuju menurut seantero sekeliling seputar tatkala lantas sungguhpun

  • Non-rootwords: these are not necessary, as the meaning is the same with their rootwords

karenanya dikarenakan olehnya padanya sesudahnya hendaknya inginkan khususnya lainnya lamanya saatnya akankah akulah amatlah andalah ataukah bagaimanakah beginikah beginilah begitukah begitulah begitupun belumlah berapakah berapalah berapapun betulkah bilakah bisakah bolehkah bolehlah bukanlah demikianlah dialah diantara diantaranya dirinya disini disinilah enggaknya entahlah hanyalah haruslah harusnya inginkah inikah itukah janganlah kalaulah kamilah kamulah kapankah kapanpun kepadanya kinilah kitalah mampukah masihkah masing merekalah mungkinkah pastilah sajalah sangatlah sayalah sebabnya sebagainya sebelumnya sekitarnya seluruhnya semaunya seolah sepantasnyalah seringnya sesuatunya siapakah siapapun sinilah sudahkah sudahlah tentulah tentunya terhadapnya tersebutlah tidakkah tidaklah

  • Non-stopwords: these either clearly non-stopwords or have more than one meaning that are stopwords and non-stopwords

dalam dini kecil kini nanti lama lebih hampir nyaris amat paling pantas pasti tentu percuma wong

  • Maybe non-stopwords: I need to check them again

ada adanya agak agaknya akhirnya antaranya banyak bermacam biasa biasanya boleh dekat depan diri entah hal harus kala lagi lagian lah lain macam makin mampu masih memang mungkin per pernah saat saja saling sama sangat se sebagai sebagaimana sebaliknya sebanyak sebegini sebegitu sebenarnya seberapa sebetulnya sebisanya sebuah sedang sedikit sedikitnya segala segalanya segera seharusnya sejenak sekali sekalian sekaligus sekalipun seketika sekiranya sela selagi selain selaku selalu semacam semakin semasih sementara sempat semua semuanya semula sendiri sendirinya seorang sepantasnya sering serupa sesaat sesama sesegera sesekali seseorang sesuatu setelah seterusnya setiap setidaknya sewaktu suatu supaya tadi tadinya tak tapi telah terdiri terlalu terlebih tersebut tertentu tiap | apakah bukankah bukannya dapatkah sepertinya selamanya hendaklah jangankan kiranya makanya nantinya rupanya

TJones (WMF) (talkcontribs)

Thanks, @Bennylin.

None of the words you have listed under "Addendum from my notes" are currently stop words. It is always difficult to draw a perfect line between what should and should not be a stop word. If any of these are critical, we might be able to add them, possibly at a later time. I think the biggest and most obvious impact will come from having a decent stemmer.

For "Non-rootwords", I checked and the stop words are filtered before stemming, so only exact spellings are affected, so these do have to be listed.

For the "Non-stopwords" list, again it's hard to draw the perfect line, and if words have multiple meanings, some people prefer to let the stop word meaning take priority. After a quick look on English Wiktionary and using Google Translate (which is extremely imperfect), these don't look terrible as stop words. Also, we analyze text multiple ways for search, and stop words are only dropped from the "text" analysis. There is also a "plain" analysis which does not do any stemming or filter any stop words, and which allows for exact matches, especially for phrases. So, stop words are only "discounted" not entirely ignored like they are in some search engines.

If there are potential problems with the stop words, we have a few options:

  • We can disable stop words entirely.
  • We might be able to create a custom list of stop words for Malay, but I'll have to look into how complicated it is. to set up and maintain.
  • If we aren't sure what to do, I can set up a test environment with a snapshot of the Malay Wikipedia for people to try. It's extra work, but if we need it to determine whether all of this is a good idea, I can do it.

Thanks for the feedback! Any thoughts on the stemming groups?

Bennylin (talkcontribs)

I have came across the list in 2014, but I'm still not sure how they got their data and their approach. My approach was linguistical one. The stopwords are mostly from Particle class (Articles, Conjunctions, Prepositions, Interjections, Phatic) and Pronouns. The purpose of my addendum is to complete the rest of the particles and pronouns that were not in the original list; while the non-stopwords came from other classes, such as Adjectives and Adverbs, and as such, they should not be stopwords.

Caveat: these words came from Indonesian KBBI dictionary.

Yosri (talkcontribs)

Listing the word is herculean task since the prefix and suffix can be tag to any word to make it as verb, or etc. Assuming taking the word WEIGHT and treating it as Malay word would be.

  1. diWEIGHT - being weight
  2. diWEIGHTkan - given more weight/consideration
  3. pemWEIGHTan - more weight/consideration
  4. pemWEIGHT - the weight/cause became weight
  5. memWEIGHTkan -
  6. WEIGHTnya

In addition to above example, you can refer to:-

I am off grid for a week. Contact me later if you need more clarifications. Yosri (talk) 02:37, 14 June 2018 (UTC)

TJones (WMF) (talkcontribs)

Thanks for the link to the paper, @Yosri. I'm not quite sure what you are trying to say about listing words. We don't have to list all the words specifically. Instead there are rules that do a pretty good job of removing prefixes and suffixes (and in some cases restoring the likely original letter, such as removing "meny-" and restoring "s" at the beginning of the word).

The on-wiki search engine, CirrusSearch, is built on Elasticsearch, which is built on Lucene. Lucene provides the Indonesian stemmer that I am hoping to use. The stemmer is based on this paper (and implemented in Java, available on GitHub). The goal of that paper is to provide a stemmer that is only based on rules and doesn't need a big dictionary. That approach sacrifices some accuracy for speed and ease of implementation. The hope is that the stemmer improves searching by providing many more helpful results than results with errors., even if it isn't perfect.

Also, the examples I've given aren't supposed to include every version of a word that could exist. They are based on words that occur in a sample of Wikipedia articles and Wiktionary entries, which helps give us an idea of what will happen most often with real text. For example, some stemming errors may only happen to words that are very rare and so we can worry about them less.

Please let me know if I've missed something!

