Topic on User talk:TJones (WMF)

Jump to navigation Jump to search

Followup on the link suggestion tool

Ostrzyciel (talkcontribs)

Hi! On the search team meeting you asked me why I've decided to declense titles that we look for before searching instead of relying on a stemmer. I tried a few things and I seem to remember why now :D

The biggest problem is we're looking for exact matches of an inflected title, so we can't use the standard "no operator" search mode that uses stemming. A partial result or a result with two matching words separated by other words is useless in our case, so we have to use double quotes. I don't think it's possible to search the stemmed text with exact matching… or is it? To be honest I'm not much of an expert in Elastic, so I may be wrong here :)

Another problem is that we sometimes don't want certain words to be declensed. For example "Sejm Rzeczpospolitej Polskiej" (our parliament thing) has the "Rzeczpospolitej Polskiej" fixed in that case. We would declense it Sejm, Sejmu, Sejmowi, etc. without changing the rest of the name. This is a rare case though, as invalid forms that may stem from that are rather unlikely to occur in articles.

So that's why :)

TJones (WMF) (talkcontribs)

Thanks for the information! There are always interesting new use cases to be discovered.

Not surprisingly, we focus on search on Wikipedia and its sister projects, but we certainly want to help other users of Mediwiki when we can. The information below is based on my experience and testing with Wikipedia, etc., but may be helpful to you.

The tilde operator (~) is terribly and confusingly overloaded in our search. I think it means "do something different"—but what is different changes for each use case.

One use case is after a phrase in quotes, like "Sejmowi Rzeczpospolita polski"~ In this case, it maintains the order of the words and doesn't allow any words in between, but still allows stemming. So, on Polish Wikipedia, "Sejmowi Rzeczpospolita polski"~ brings up good-looking results even though there are no exact matches. The ranking for the phrase-with-tilde is not great because there aren't any exact phrase matches.

In general, you can add &cirrusDumpQuery to a query to see the full query we build up to send to Elasticsearch:

  • "Sejmowi Rzeczpospolita polski"~
    • this stems the words, but keeps them in order
    • the query string is "Sejmowi Rzeczpospolita polski"
  • "Sejmowi Rzeczpospolita polski"
    • this uses our "plain" fields, which don't do stemming
    • the query string is (title.plain:"Sejmowi Rzeczpospolita polski"~0^20 OR redirect.title.plain:"Sejmowi Rzeczpospolita polski"~0^15 OR category.plain:"Sejmowi Rzeczpospolita polski"~0^8 OR heading.plain:"Sejmowi Rzeczpospolita polski"~0^5 OR opening_text.plain:"Sejmowi Rzeczpospolita polski"~0^3 OR text.plain:"Sejmowi Rzeczpospolita polski"~0^1 OR auxiliary_text.plain:"Sejmowi Rzeczpospolita polski"~0^0.5)

Maybe some of that will help if you decide to implement stemming on your project. Feel free to come back to our office hours and talk to us again if you want to chat more!

Ostrzyciel (talkcontribs)

Thanks! I had no idea CirrusSearch could do that.

Reply to "Followup on the link suggestion tool"