Topic on User talk:TJones (WMF)/Notes/Analysis of Applying Indonesian Analysis Chain to Malay

Bennylin (talkcontribs)
  • Addendum from my notes:

engkau dikau beliau kau begituan betapa bilamana daku gue situ ah ayo deh ding kek hei halo lho lo mari yuk sih ya yah aduh astaga duh hai eh oh ah ih ai aih asyik alhamdulillah insyaallah astagafirullah masyaallah semoga alamak sang sri kaum umat si andai andaikan biarpun untuk beserta semenjak menjelang mengenai kecuali lewat menuju menurut seantero sekeliling seputar tatkala lantas sungguhpun

  • Non-rootwords: these are not necessary, as the meaning is the same with their rootwords

karenanya dikarenakan olehnya padanya sesudahnya hendaknya inginkan khususnya lainnya lamanya saatnya akankah akulah amatlah andalah ataukah bagaimanakah beginikah beginilah begitukah begitulah begitupun belumlah berapakah berapalah berapapun betulkah bilakah bisakah bolehkah bolehlah bukanlah demikianlah dialah diantara diantaranya dirinya disini disinilah enggaknya entahlah hanyalah haruslah harusnya inginkah inikah itukah janganlah kalaulah kamilah kamulah kapankah kapanpun kepadanya kinilah kitalah mampukah masihkah masing merekalah mungkinkah pastilah sajalah sangatlah sayalah sebabnya sebagainya sebelumnya sekitarnya seluruhnya semaunya seolah sepantasnyalah seringnya sesuatunya siapakah siapapun sinilah sudahkah sudahlah tentulah tentunya terhadapnya tersebutlah tidakkah tidaklah

  • Non-stopwords: these either clearly non-stopwords or have more than one meaning that are stopwords and non-stopwords

dalam dini kecil kini nanti lama lebih hampir nyaris amat paling pantas pasti tentu percuma wong

  • Maybe non-stopwords: I need to check them again

ada adanya agak agaknya akhirnya antaranya banyak bermacam biasa biasanya boleh dekat depan diri entah hal harus kala lagi lagian lah lain macam makin mampu masih memang mungkin per pernah saat saja saling sama sangat se sebagai sebagaimana sebaliknya sebanyak sebegini sebegitu sebenarnya seberapa sebetulnya sebisanya sebuah sedang sedikit sedikitnya segala segalanya segera seharusnya sejenak sekali sekalian sekaligus sekalipun seketika sekiranya sela selagi selain selaku selalu semacam semakin semasih sementara sempat semua semuanya semula sendiri sendirinya seorang sepantasnya sering serupa sesaat sesama sesegera sesekali seseorang sesuatu setelah seterusnya setiap setidaknya sewaktu suatu supaya tadi tadinya tak tapi telah terdiri terlalu terlebih tersebut tertentu tiap | apakah bukankah bukannya dapatkah sepertinya selamanya hendaklah jangankan kiranya makanya nantinya rupanya

TJones (WMF) (talkcontribs)

Thanks, @Bennylin.

None of the words you have listed under "Addendum from my notes" are currently stop words. It is always difficult to draw a perfect line between what should and should not be a stop word. If any of these are critical, we might be able to add them, possibly at a later time. I think the biggest and most obvious impact will come from having a decent stemmer.

For "Non-rootwords", I checked and the stop words are filtered before stemming, so only exact spellings are affected, so these do have to be listed.

For the "Non-stopwords" list, again it's hard to draw the perfect line, and if words have multiple meanings, some people prefer to let the stop word meaning take priority. After a quick look on English Wiktionary and using Google Translate (which is extremely imperfect), these don't look terrible as stop words. Also, we analyze text multiple ways for search, and stop words are only dropped from the "text" analysis. There is also a "plain" analysis which does not do any stemming or filter any stop words, and which allows for exact matches, especially for phrases. So, stop words are only "discounted" not entirely ignored like they are in some search engines.

If there are potential problems with the stop words, we have a few options:

  • We can disable stop words entirely.
  • We might be able to create a custom list of stop words for Malay, but I'll have to look into how complicated it is. to set up and maintain.
  • If we aren't sure what to do, I can set up a test environment with a snapshot of the Malay Wikipedia for people to try. It's extra work, but if we need it to determine whether all of this is a good idea, I can do it.

Thanks for the feedback! Any thoughts on the stemming groups?

Bennylin (talkcontribs)

I have came across the list in 2014, but I'm still not sure how they got their data and their approach. My approach was linguistical one. The stopwords are mostly from Particle class (Articles, Conjunctions, Prepositions, Interjections, Phatic) and Pronouns. The purpose of my addendum is to complete the rest of the particles and pronouns that were not in the original list; while the non-stopwords came from other classes, such as Adjectives and Adverbs, and as such, they should not be stopwords.

Caveat: these words came from Indonesian KBBI dictionary.

Reply to "Stopwords"