Topic on Help talk:Extension:AdvancedSearch

Translating "stemming"

6
Amire80 (talkcontribs)

I was translating this extension in translatewiki and ran into the message about stemming. It gives an example from English, with "car" and "cars".

How should this be translated? It would be nice to give an example in the same language into which the message is translated, but not all languages have stemming support and it's bad to mislead users by giving them an example that doesn't work. In fact, it would be nice to say explicitly that this is not supported for these languages or not to mention this at all.

Is there a list of languages for which stemming is supported? I couldn't find it on Help:CirrusSearch, but that's a pretty long page and may I missed it :)

(Tagging also @TJones (WMF), who may know something about it.)

Nikerabbit (talkcontribs)
Amire80 (talkcontribs)

Yeah, but I'm not sure that all of these are actually supported in Wikimedia code. @TJones (WMF) probably knows.

TJones (WMF) (talkcontribs)

As far as I know, there's no canonical list of stemmers in use. There's not even a canonical list of analyzers in use—note that not all analyzers have stemmers.

The ones from the Elasticsearch page that are used on the usual Wiki projects include: Arabic, Armenian, Basque, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish.

Brazilian Portuguese is not used on a typical Wiki project (Wikipedia, Wiktionary, etc), but is used on br.wikimedia.org, which is the Wikimedia Users Group of Brazil.

The Persian, Thai, and CJK analyzers don't have stemmers. CJK is used for Japanese- and Korean-language projects, but not for Chinese.

Chinese-language projects use a custom analysis chain that pulls together a few Elasticsearch plugins, but no stemmer.

Polish, Ukrainian, and Hebrew use third party plugins that do feature stemming. Others are in the works: I'm working on Serbian now, and expect to apply that to Croatian, Serbo-Croatian, and Bosnian projects next quarter; others will hopefully follow.

In terms of examples, it wouldn't hurt to verify that specific examples actually work. The stemmers sometimes don't do what you'd expect them to for various reasons. You can test them on the Wikipedia in the appropriate language by searching for term -"term" (so, for example, car -"car")—this will return stemmed matches but not exact matches. The bold terms in the snippets are stemmed matches.. so you can see that car, cars, Carly, Carment, Caral, and Carrer all stem to the same thing. (You can remove more common terms from the results by negating the terms in quotes—like car -"car" -"cars" -"carly"—to see more of the rarer words.) I don't suggest sharing all the unexpected ones in a short example—though -ly, -ment, -al, and -er are perfectly fine regular suffixes! As one counterexample, walk and walker do not stem together, presumably because Walker is a common name.

A quick side note: when trying to come up with specific examples in languages you aren't familiar with, Wiktionary is a big help. I look up a common word ("dog" is translated into tons of languages) and then the Wiktionary entry for that word often has a declension/conjugation table with several variant forms you can try out.

Let me know if I can help with anything else!

Lea Voget (WMDE) (talkcontribs)

Thank you for looking into this! I created a ticket so we can look into improving the info messages to reflect the fact that stemming is not applied for all languages.

This post was hidden by 108.48.171.156 (history)
Reply to "Translating "stemming""