User talk:TJones (WMF)

Jump to navigation Jump to search

About this board

Followup on the link suggestion tool

3
Ostrzyciel (talkcontribs)

Hi! On the search team meeting you asked me why I've decided to declense titles that we look for before searching instead of relying on a stemmer. I tried a few things and I seem to remember why now :D

The biggest problem is we're looking for exact matches of an inflected title, so we can't use the standard "no operator" search mode that uses stemming. A partial result or a result with two matching words separated by other words is useless in our case, so we have to use double quotes. I don't think it's possible to search the stemmed text with exact matching… or is it? To be honest I'm not much of an expert in Elastic, so I may be wrong here :)

Another problem is that we sometimes don't want certain words to be declensed. For example "Sejm Rzeczpospolitej Polskiej" (our parliament thing) has the "Rzeczpospolitej Polskiej" fixed in that case. We would declense it Sejm, Sejmu, Sejmowi, etc. without changing the rest of the name. This is a rare case though, as invalid forms that may stem from that are rather unlikely to occur in articles.

So that's why :)

TJones (WMF) (talkcontribs)

Thanks for the information! There are always interesting new use cases to be discovered.

Not surprisingly, we focus on search on Wikipedia and its sister projects, but we certainly want to help other users of Mediwiki when we can. The information below is based on my experience and testing with Wikipedia, etc., but may be helpful to you.

The tilde operator (~) is terribly and confusingly overloaded in our search. I think it means "do something different"—but what is different changes for each use case.

One use case is after a phrase in quotes, like "Sejmowi Rzeczpospolita polski"~ In this case, it maintains the order of the words and doesn't allow any words in between, but still allows stemming. So, on Polish Wikipedia, "Sejmowi Rzeczpospolita polski"~ brings up good-looking results even though there are no exact matches. The ranking for the phrase-with-tilde is not great because there aren't any exact phrase matches.

In general, you can add &cirrusDumpQuery to a query to see the full query we build up to send to Elasticsearch:

  • "Sejmowi Rzeczpospolita polski"~
    • this stems the words, but keeps them in order
    • the query string is "Sejmowi Rzeczpospolita polski"
  • "Sejmowi Rzeczpospolita polski"
    • this uses our "plain" fields, which don't do stemming
    • the query string is (title.plain:"Sejmowi Rzeczpospolita polski"~0^20 OR redirect.title.plain:"Sejmowi Rzeczpospolita polski"~0^15 OR category.plain:"Sejmowi Rzeczpospolita polski"~0^8 OR heading.plain:"Sejmowi Rzeczpospolita polski"~0^5 OR opening_text.plain:"Sejmowi Rzeczpospolita polski"~0^3 OR text.plain:"Sejmowi Rzeczpospolita polski"~0^1 OR auxiliary_text.plain:"Sejmowi Rzeczpospolita polski"~0^0.5)

Maybe some of that will help if you decide to implement stemming on your project. Feel free to come back to our office hours and talk to us again if you want to chat more!


Ostrzyciel (talkcontribs)

Thanks! I had no idea CirrusSearch could do that.

Reply to "Followup on the link suggestion tool"

CirrusSearch suggestion (^ and $ anchors)

2
Zabavuju flašku chlastu maskovanou jako zubní pastu (talkcontribs)

Hi, thank you for solving the "blocking" thread. And since I can see you are probably CirrusSearch developer (?), I'd like to show you my suggestion to the Community Wishlist Survey 2020. What do you think? I personally ran across to some cases where having ^ and $ anchors would have helped a lot.

TJones (WMF) (talkcontribs)

Thanks for contacting me! Yeah, I'm on the Search Platform team at WMF, so I do work with CirrusSearch and the underlying technology stack.

I think it makes a lot of sense for intitle searching, especially Wiktionary. I'm not sure about whole documents, but with the multiline option it could still be useful there. I've also run into some cases where it would have been helpful.

There are a couple of potential hold-ups. The Community Tech team, which sponsors the Community Wishlist may not have the skills needed to work on this project—though we (the Search Platform team) do look at the Community Wishlist, too, and see if there are promising projects there that we should take on. We don't follow the Community Wishlist timeline, though. Also, it might turn out to be too expensive in multiline mode on large WIkipedia docs on big wikis; I'm not sure, and I don't think it should be, but it's possible.

That said, I do think it's definitely worth proposing and discussing!

Reply to "CirrusSearch suggestion (^ and $ anchors)"
Alsee (talkcontribs)

Hi. I just came across your message on wikimedia-l, specifically the part about "removing quotes within queries by default".

I think the focus on avoiding zero-results is leading to a misstep here. The true goal is to have the best search engine with the most useful results. If I'm using quotes for an exact phrase search, and that phrase doesn't exist, then "zero hits" is the exact answer I wanted! That's far more valuable than digging through junk results, trying to figure out whether my quoted phrase exists.

If you still want to avoid a zero-result, just do what Google does. Give the zero-hit answer and re-run the search:

No results found for "foo bar baz".
Results for foo bar baz (without quotes):
TJones (WMF) (talkcontribs)

Hi @Alsee. Sorry for the confusion. The short version is that our long term plan is to do what you suggest.

The longer version:

While the zero results rate is an easy indicator to compute, we do recognize that it is low resolution and of limited value. A big swing up or down is a cause for concern—so it's a good metric to track on the dashboards—but getting it to zero is no longer a goal. One of my earliest write-ups covers lots of cases that do, in fact, deserve no results (the write up itself is a bit of a mess—sorry).

Mikhail's Zero to Hero analysis, which Deb linked to in the email, highlights the text characteristics that are most often associated with zero results. While zero results may be appropriate, a very high failure rate points to places where we could possibly make improvement.

An area that I'd identified earlier in my research was queries in the wrong language, so now we run language detection on poorly performing queries for some wikipedias and search other more appropriate wikipedias. As an example, a search in Russian on English Wikipedia can show results from Russian Wikipedia.

Two areas that Mikhail's report found as potentially high-impact (both relatively common and relatively unsuccessful query types) were queries with question marks and queries with quotation marks. I did a quick analysis of both and found that they did look promising. This led to a more thorough analysis of dropping question marks, and eventually a change in the question mark syntax that makes naive use of question marks behave as a naive searcher would expect.

Quotes are harder, because as you point out, the query you intend (with quotes) and the modified query (without quotes) are not the same query. We would of course want to show the "before" and "after" like we do with "Did You Mean" queries, and cross-wiki searches based on language detection (as above).

The actual implementation of quote removal is complicated by the fact that it could interact with Did You Mean, language detection, sister-wiki results (as discussed in Deb's email), the modified question mark syntax, and other forms of "second chance" searches we might implement in the future. We have an outline of the problem and the beginning of a discussion of how to deal with it, but it's not high on the priority list right now.

So, to sum up, what Deb was referring to in her email was the idea of automatically/"by default" taking poorly-performing queries that have quotes and re-processing them without the quotes, rather than relying on the user do it, if they choose to and they realize that it could help. The UI for such a process would include the before and after versions, as you suggest. Again, sorry for the confusion.

Alsee (talkcontribs)

Thanks, sounds good. The links you gave were interesting too. It's a very curious detail that zero-results from Ireland are 1/3 the zero rate from Australia. By the way, I found and fixed the 'paperr' typo mentioned in your research. Chuckle. It remained in that article for two years.

TJones (WMF) (talkcontribs)

I wonder if the Australian numbers are still coming from the National Library of Australia. They had a glitch that seemed to be converting quotation marks to " and then presumably sanitizing that to plain quot—which makes for a poor search term.

Thanks for fixing paperr. I have to fight against the urge to go on an error-fixing spree when I stumble across them, especially semi-systematic ones. One of my favorites is when people accidentally type one letter in another character set. There are at least dozens of cases of Cyrillic о used in place of Latin o on English Wikipedia. Depending on your font, they are indistinguishable–but it messes up searching for those words.

Alsee (talkcontribs)

Error-fixing sprees are always welcome, chuckle. Tho I guess it's not supposed to interfere with your paid job. If you discover come group of errors that need clean up, you could post it to EN:WP:VPM (Village Pump Miscellaneous). There's a good chance someone will pick up the task.

I just tried looking for cases of Cyrillic о you mentioned. I get 10,595 hits for the letter. I tried to narrow the search somewhat, I got to 4,583 hits for о -Cyrillic -insource:"«о". Scanning through a bunch of the hits, all I could find was clear cases of Russian text containing the letter.

Do you have any suggestion on how to find cases that need cleanup?

TJones (WMF) (talkcontribs)

> Do you have any suggestion on how to find cases that need cleanup?

Sure! The idea is that it's very unlikely that you'd have a Cyrillic Letter next to a Latin letter in a word—possible, but unlikely. So, you want an insource regex search for any Latin character next to a Cyrillic character, or vice versa. It's an expensive query and it times out—fortunately we now get partial results!—so you can break it into two pieces that are still too expensive, but less so that one combined regex:

  • insource:/[А-Яа-яЅІЈѕіј][A-Za-z]/ — a Cyrillic character followed by a Latin character
  • insource:/[A-Za-z][А-Яа-яЅІЈѕіј]/ — a Latin character followed by a Cyrillic character

You do get false positives like "KoЯn" and "NGiИX". You also get unexpected typos like "LГhomme Amérindien dans son environnement" which is almost certainly supposed to start with "L'homme".

I think the two most common sources of actual errors are probably

  • Users of phonetic keyboards—these keyboards have, for example, the Russian letters on the same keys as their English counterparts, so you can't tell if you mis-typed a Cyrillic or Latin o because they are the same key.
  • People working on Serbian or Macedonian topics that don't have ready access to keyboards for those languages and so substitute non-Russian Cyrillic ЅІЈѕіј with Latin SIJsij.

You could extend this with more accented Cyrillic and Latin letters, but this is a good start.

Reply to "Quotes in search"
There are no older topics