Analysis of Applying Indonesian Analysis Chain to Malay

TJones (WMF) (talkcontribs)

@Tofeiku, @Bennylin, and @Yosri: thanks for your questions and comments!

My goal is to decide whether to enable the Indonesian language analyzer for Malay wikis. It includes the stemmer and stop word list.

The stemmer isn’t perfect, but it has been enabled on the Indonesian wikis for a long time, and I think it is a lot better than nothing. You can test it some on the Indonesian Wikipedia. For example, searching for mengajar, belajar, or pembelajaran shows the differences in ranking based on the exact form of the word used.

Do we have enough information to decide what to do?

  • (1) If using the Indonesian language analyzer is clearly better than nothing, we can deploy it. (Also note that it can be removed as easily as it is deployed. It takes one to two weeks, but it is easy to do.)
    • If we find something better in the future, we can replace it, too.
  • (2) If the Indonesian language analyzer is terrible for Malay, then we can abandon the project.
  • If the decision is not clear, we can (3) look for more people to bring into the discussion, or we can (4) set up a search demo that uses the Indonesian language analyzer on Malay Wikipedia data, or both.

What are your thoughts? Thanks!

TJones (WMF) (talkcontribs)

@Tofeiku, @Bennylin, @Yosri, and any others who are interested: My plan is to go ahead with the implementation of the Indonesian language analysis on the Malay-language wikis. The results look reasonable to me, and the feedback so far doesn’t indicate that the word groupings are so bad that they wouldn’t be useful.

If we have an opportunity to implement a better stemmer in the future, I’d support that. Also, we can edit the stop word list in the future if we need to, though the effect of stop words on our system is only to discount them, not to ignore them completely, so I’m not too worried about them.

If you have any objections to me continuing with the implementation, please let me know. Thanks!

TJones (WMF) (talkcontribs)

After some delays, this is now available on Malay-language wikis.

