Topic on User talk:TJones (WMF)/Notes/Analysis of Applying Indonesian Analysis Chain to Malay

Yosri (talkcontribs)

Listing the word is herculean task since the prefix and suffix can be tag to any word to make it as verb, or etc. Assuming taking the word WEIGHT and treating it as Malay word would be.

  1. diWEIGHT - being weight
  2. diWEIGHTkan - given more weight/consideration
  3. pemWEIGHTan - more weight/consideration
  4. pemWEIGHT - the weight/cause became weight
  5. memWEIGHTkan -
  6. WEIGHTnya

In addition to above example, you can refer to:- https://www.google.com/url?sa=t&source=web&rct=j&url=http://repository.um.edu.my/25530/1/ExhaustiveAffixStripping.pdf&ved=2ahUKEwik4LeWjtLbAhXYbn0KHT5FD7wQFjAEegQIAxAB&usg=AOvVaw04ulyttCvRxoljbHN1Nlf7

I am off grid for a week. Contact me later if you need more clarifications. Yosri (talk) 02:37, 14 June 2018 (UTC)

TJones (WMF) (talkcontribs)

Thanks for the link to the paper, @Yosri. I'm not quite sure what you are trying to say about listing words. We don't have to list all the words specifically. Instead there are rules that do a pretty good job of removing prefixes and suffixes (and in some cases restoring the likely original letter, such as removing "meny-" and restoring "s" at the beginning of the word).

The on-wiki search engine, CirrusSearch, is built on Elasticsearch, which is built on Lucene. Lucene provides the Indonesian stemmer that I am hoping to use. The stemmer is based on this paper (and implemented in Java, available on GitHub). The goal of that paper is to provide a stemmer that is only based on rules and doesn't need a big dictionary. That approach sacrifices some accuracy for speed and ease of implementation. The hope is that the stemmer improves searching by providing many more helpful results than results with errors., even if it isn't perfect.

Also, the examples I've given aren't supposed to include every version of a word that could exist. They are based on words that occur in a sample of Wikipedia articles and Wiktionary entries, which helps give us an idea of what will happen most often with real text. For example, some stemming errors may only happen to words that are very rare and so we can worry about them less.

Please let me know if I've missed something!

Reply to "Malay"