Topic on User talk:TJones (WMF)

Alsee (talkcontribs)

Hi. I just came across your message on wikimedia-l, specifically the part about "removing quotes within queries by default".

I think the focus on avoiding zero-results is leading to a misstep here. The true goal is to have the best search engine with the most useful results. If I'm using quotes for an exact phrase search, and that phrase doesn't exist, then "zero hits" is the exact answer I wanted! That's far more valuable than digging through junk results, trying to figure out whether my quoted phrase exists.

If you still want to avoid a zero-result, just do what Google does. Give the zero-hit answer and re-run the search:

No results found for "foo bar baz".
Results for foo bar baz (without quotes):
TJones (WMF) (talkcontribs)

Hi @Alsee. Sorry for the confusion. The short version is that our long term plan is to do what you suggest.

The longer version:

While the zero results rate is an easy indicator to compute, we do recognize that it is low resolution and of limited value. A big swing up or down is a cause for concern—so it's a good metric to track on the dashboards—but getting it to zero is no longer a goal. One of my earliest write-ups covers lots of cases that do, in fact, deserve no results (the write up itself is a bit of a mess—sorry).

Mikhail's Zero to Hero analysis, which Deb linked to in the email, highlights the text characteristics that are most often associated with zero results. While zero results may be appropriate, a very high failure rate points to places where we could possibly make improvement.

An area that I'd identified earlier in my research was queries in the wrong language, so now we run language detection on poorly performing queries for some wikipedias and search other more appropriate wikipedias. As an example, a search in Russian on English Wikipedia can show results from Russian Wikipedia.

Two areas that Mikhail's report found as potentially high-impact (both relatively common and relatively unsuccessful query types) were queries with question marks and queries with quotation marks. I did a quick analysis of both and found that they did look promising. This led to a more thorough analysis of dropping question marks, and eventually a change in the question mark syntax that makes naive use of question marks behave as a naive searcher would expect.

Quotes are harder, because as you point out, the query you intend (with quotes) and the modified query (without quotes) are not the same query. We would of course want to show the "before" and "after" like we do with "Did You Mean" queries, and cross-wiki searches based on language detection (as above).

The actual implementation of quote removal is complicated by the fact that it could interact with Did You Mean, language detection, sister-wiki results (as discussed in Deb's email), the modified question mark syntax, and other forms of "second chance" searches we might implement in the future. We have an outline of the problem and the beginning of a discussion of how to deal with it, but it's not high on the priority list right now.

So, to sum up, what Deb was referring to in her email was the idea of automatically/"by default" taking poorly-performing queries that have quotes and re-processing them without the quotes, rather than relying on the user do it, if they choose to and they realize that it could help. The UI for such a process would include the before and after versions, as you suggest. Again, sorry for the confusion.

Alsee (talkcontribs)

Thanks, sounds good. The links you gave were interesting too. It's a very curious detail that zero-results from Ireland are 1/3 the zero rate from Australia. By the way, I found and fixed the 'paperr' typo mentioned in your research. Chuckle. It remained in that article for two years.

TJones (WMF) (talkcontribs)

I wonder if the Australian numbers are still coming from the National Library of Australia. They had a glitch that seemed to be converting quotation marks to " and then presumably sanitizing that to plain quot—which makes for a poor search term.

Thanks for fixing paperr. I have to fight against the urge to go on an error-fixing spree when I stumble across them, especially semi-systematic ones. One of my favorites is when people accidentally type one letter in another character set. There are at least dozens of cases of Cyrillic о used in place of Latin o on English Wikipedia. Depending on your font, they are indistinguishable–but it messes up searching for those words.

Alsee (talkcontribs)

Error-fixing sprees are always welcome, chuckle. Tho I guess it's not supposed to interfere with your paid job. If you discover come group of errors that need clean up, you could post it to EN:WP:VPM (Village Pump Miscellaneous). There's a good chance someone will pick up the task.

I just tried looking for cases of Cyrillic о you mentioned. I get 10,595 hits for the letter. I tried to narrow the search somewhat, I got to 4,583 hits for о -Cyrillic -insource:"«о". Scanning through a bunch of the hits, all I could find was clear cases of Russian text containing the letter.

Do you have any suggestion on how to find cases that need cleanup?

TJones (WMF) (talkcontribs)

> Do you have any suggestion on how to find cases that need cleanup?

Sure! The idea is that it's very unlikely that you'd have a Cyrillic Letter next to a Latin letter in a word—possible, but unlikely. So, you want an insource regex search for any Latin character next to a Cyrillic character, or vice versa. It's an expensive query and it times out—fortunately we now get partial results!—so you can break it into two pieces that are still too expensive, but less so that one combined regex:

  • insource:/[А-Яа-яЅІЈѕіј][A-Za-z]/ — a Cyrillic character followed by a Latin character
  • insource:/[A-Za-z][А-Яа-яЅІЈѕіј]/ — a Latin character followed by a Cyrillic character

You do get false positives like "KoЯn" and "NGiИX". You also get unexpected typos like "LГhomme Amérindien dans son environnement" which is almost certainly supposed to start with "L'homme".

I think the two most common sources of actual errors are probably

  • Users of phonetic keyboards—these keyboards have, for example, the Russian letters on the same keys as their English counterparts, so you can't tell if you mis-typed a Cyrillic or Latin o because they are the same key.
  • People working on Serbian or Macedonian topics that don't have ready access to keyboards for those languages and so substitute non-Russian Cyrillic ЅІЈѕіј with Latin SIJsij.

You could extend this with more accented Cyrillic and Latin letters, but this is a good start.

Reply to "Quotes in search"