Topic on User talk:TJones (WMF)/Notes/Top Unsuccessful Search Queries

Think outside the box, return queries that are very close to article titles or n-gram

5
197.218.83.53 (talkcontribs)

Given Wikimedia's huge corpora, a lot of trash can be excluded by simply excluding all unsuccessful queries that don't :

  1. Have more than 95% fuzzy match to an existing article title
  2. Have more than 95% match to a token in the search index
  3. Have at least one token of X length that is an exact or close match to an existing article

Then to prevent possible exposure of private data only show those tokens , at best there will only be a few letters added, and given the amount of queries that will hardly be enough to identify a single person. Alternatively, just expose the titles which are close match to the search tokens.

It is also a good idea to turn this around. Top search queries that match existing articles are very useful to both editors and readers. For instance, as an editor one would be more interested in improving some stub article that a lot of people search for, or potentially delete it, if it is being used as some hoax or whatever. It can also expose hot spots areas (or categories) that a lot of people try to find more information.

As a reader, knowing that people who searched for X also searched for Y is very useful, as that may help me discover more information about the subject that was harder to find the right keyword to search for.

Page views can easily be distorted by bots, and so can search terms, but combined they are greater than the sum of their parts.

TJones (WMF) (talkcontribs)

Interesting ideas!

If I'm reading this right, though, then a query like "john smith 1 main street houston 77001" (in this case, that's the address for the University of Houston–Downtown's main building) would pass your filter, unless you mean to meet all the criteria at once.

I'm not sure, but I think many queries that meet your criteria might also do well with the completion suggester, which allows looser matches because it only compares queries to titles (and redirects).

A bigger problem is that there's just not a lot of info to be gained from the zero results queries. The most common ones are mostly junk, and the less common ones are too numerous to sift through effectively.

197.218.83.53 (talkcontribs)

>If I'm reading this right, though, then a query like "john smith 1 main street houston 77001"

Ah close, but that wouldn't be an issue. The idea here is not to show exactly the whole string they search whenever the search query contains more than one token. So in this case the results would be discarded because of too many tokens (maybe a maximum of two or 3 would suffice). The idea is that the whole search token needs to be a close match to a title in the first place. So if someone searches for

  • James bond 1 - This is more than a 95% match, and the result can be shown, even if the number 1 is discarded
  • "james smith 1 main street houston" - This would fail because the tokens are too different from the article name, and frankly it has too many tokens
  • "Tyronasouras animal" - Cirrusearch fails completely, (a simple typo), google fuzzy matching catches it as "Tyrannosaurus", now, even if you discard the word "animal", this is still a valuable query, if this term didn't exist at all in the wiki titles, but did as a token within the corpora or even wiktionary.
  • "yrannosaurus" - even an almost 100% near match fails

These were just a simple examples, but I'd bet that people search for items that would maybe match tokens in a wiktionary, but not here. For example, it may be a latin word related to some field of science.

Although there are plans to get interwiki searches, and those won't cover terms that don't exist in the same language versions, and wiktionaries are probably often out of sync.

There is still value in identifying the non-zero matches, there might just be no value in identifying the top 100 or even top 10000, especially if there is a focus on few tokens instead of overly complex searches.

TJones (WMF) (talkcontribs)

Interesting. I'll keep this token-based approach in mind for when I have a chance to look at this again. I still think there might not be enough there to be worth mining, but I could take a look and see what comes of it.

197.218.81.64 (talkcontribs)

Thanks, its a long-shot but worth checking anyway.

Reply to "Think outside the box, return queries that are very close to article titles or n-gram"