User:TJones (WMF)/Notes/Top Unsuccessful Search Queries

July 2016 — See TJones_(WMF)/Notes for other projects.

Introduction
As a 10% project I decided to look into the top 100 zero-result queries for a one-month period (May 2016). There has been a lot of discussion both recently and in the past about mining the top 100+ zero-results queries for items that could be included in Wikipedia (particularly English Wikipedia in the discussions I've seen). Typos and newly trending topics seem like the most likely result of this mining; however, my experience over the last year—reviewing tens of thousands of queries—has made me skeptical that much good would come of this. In addition to privacy concerns—which are paramount—I just don't expect that there's much there worth gathering.

I've written about this on the Discovery mailing list, and in a message forwarded to the Wikimedia mailing list. Rather than re-summarize, I'll just quote myself:

"So I want to say that this is an awesome idea—which is why many people have thought of it. It was apparently one of the first ideas the Discovery department had when they formed (see Dan's notes linked below). It was also one of the first ideas I had when I joined Discovery a few months later.""Dan Garry's notes on T8373 and the following discussion pretty much quash the idea of automated extraction and publication from a privacy perspective. People not only divulge their own personal information, they also divulge other people's personal information. One example: some guy outside the U.S. was methodically searching long lists of real addresses in Las Vegas. I will second Dan's comments in the T8373 discussion; all kinds of personal data end up in search queries. A dump of search queries was provided in September 2012, but had to be withdrawn over privacy concerns.""Another concern for auto-published data: never underestimate the power of random groups of bored people on the internet. 4chan decided to arrange Time Magazine poll results so the first letter spelled out a weird message. It would be easy for 4chan, Reddit, and other communities to get any message they want on that list if they happened to notice that it existed. See also Boaty McBoatface and Mountain Dew 'Diabeetus' (which is not at all the worst thing on that list). We don't want to have to try to defend against that.""In my experience, the quality of what's actually there isn't that great. One of my first tasks when I joined Discovery was to look at daily lists of top 100 zero-results queries that had been gathered automatically. I was excited by this same idea. The top 100 zero-results query list was a wasteland. (Minimal notes on some of what I found are here.) We could make it better by focusing on human-ish searchers, using basic bot-exclusion techniques, ignoring duplicates from the same IP, and such, but I don't think it would help. And while Wikipedia is not for children, there could be an annoying amount of explicit adult material on the list, too. We would probably find some interesting spellings of Facebook and WhatsApp, though."

The purpose of the current exercise is to pull the top 100 zero-results queries from a one-month period and review them for quality and relevance. If it's mostly not worthwhile, we'll know; if there is a lot of worthwhile material there, then we'll be able to make better informed decisions about pursuing this properly in the future. (Please note that no matter what I find, there is no guarantee of prioritizing related work in the near or even distant future.)

Data
Data collected was all zero-results queries for the month of May, with some “basic bot exclusion applied”:


 * Query came (or appears to have come) from the search box on en.wikipedia.org
 * Exclude any IP that made 100 or more queries per day
 * Only the en_content index was searched
 * Query had 0 results
 * Each query was counted from the same IP only once, across the whole month.

In previous sampled corpora, we only considered one query from each IP per day; here we are allowing multiple queries from the same IP, but only counting each distinct query once (so if a given IP searches for the same thing twelve times on five different days, it only counts as one query). In previous corpora we excluded IPs that made more than 30 queries; I wanted to try to be more inclusive of heavier users (while still excluding the apparent bots that generate thousands of queries in a day).

For the purposes of search, case doesn’t matter, so I lowercased everything before generating frequency statistics.

Corpus Stats
My main corpus (May 2016) included 8,654,954 (8.7 million) queries—with 7,437,747 (7.4 million) unique queries. I compared it to a corpus limited to users with fewer than 30 queries per day. The more limited corpus had 8.4 million queries and the top 100 most common (see below) were not significantly different—the counts were slightly smaller and particular queries moved a few places in the rankings, but the differences are pretty much what you’d expect from slightly different samples.

The top three most frequent queries occurred 11,944 (0.14%), 1,959 (0.02%), and 1,807 (0.2%) times. The top 100 most frequent queries have a total count of 53,907 (0.62%). The top 1,000 have a total count of 125,709 (1.45%). The distribution conforms fairly well to a power law distribution.

Review
For the privacy reasons that have been discussed before, and to avoid the extra work of a privacy-related review, I’m not going to include any of the queries here. I also don’t particularly want to give free advertising to all the web sites and internet personalities on the list.

I categorized the queries primarily based on what they seem to be referring to. The categories included:

I don’t have any opinions on the appropriateness of the inclusion of adult content in any list. I mention the adult content separately because there’s plenty of it, and it has led to confusion/frustration/bad press in the past.
 * ???: this is either junk or something I just can’t figure out, like a single character repeated many times, an apparently non-sensical string
 * sites: there are a number of sites, could be by name (e.g., “Wikipedia”) or by url (“wikipedia.org”) or by “mangled url” (looks like a url, but isn’t well-formed, e.g., “wikipedia org”)
 * internet meme: something going around the internet that may or may not be real, includes creepypasta, urban legends, etc.
 * media: tv shows, films, anime, etc.
 * person: seems to be a particular real live person; internet personalities, porn stars, people in the news, historical figures, politicians, etc. In a couple of cases the line between internet meme and person is blurry, especially with the creepypasta-type stories.
 * porn query: seem to be searches for generic or specific kinds of pornography
 * xxx: almost but not quite captured by the regex /w+\.?x+(\.?com)?/
 * misc: everything else, not enough instances of any particular thing to warrant it’s own category
 * non-English: queries in some language other than English (reported on separately)—doesn’t include names of people or websites in the Latin alphabet.
 * on-other-wiki: the query was a title/redirect match for an article on the same-language wiki.
 * adult-content: explicit sexual content, porn sites, etc.

This list is sorted by total count (i.e., the sum of all counts of all queries in the category), except for Non-English, which is listed last.

I tested all of the top 100 most frequent queries by searching for them myself on English Wikipedia. I checked for results from both the completion suggester (suggestions made while you type) and “did you mean” spelling correction (suggestions made after you submit your query). I also checked to see how many queries got results (searching them verbatim), which shows the effects of updates to English Wikipedia over the course of about two months.

For “did you mean” and the completion suggester, the apparent correct item came up in the top three results. For “results”, I just checked that any results were shown, which would mean that the query would likely not show up in later lists of top zero-results queries.

Observations

 * There’s less gibberish than I expected in the top 100, but about the expected amount of adult content.
 * For one of the gibberish queries (which also happened to be in Arabic script), I found a URL-shortened link to the query, wild on the internet.
 * The most common query is the name of a porn site. It always shows up in my data when I look at poorly performing queries. I found five variants of it in the top 100, and they account for 15,380 of the 53,907 total count in the top 100.
 * All of the typos (including all the media queries and two of the three misc queries), except for a couple of poor spellings of “sex video” and “video porno”, could be corrected by the completion suggester, the “did you mean” spelling suggestions, or both.
 * The most common people who failed to be found are either internet personalities or porn stars.
 * Misspellings of Facebook, Whatsapp, and YouTube did make the list (one of each), and they all could be corrected by the completion suggester.
 * There’s an unexpected minor theme of weird tales/unsolved mysteries/urban legends in the zero-results queries—in particular the creepypasta internet memes, a person and a local sports team in the news, etc.
 * I don’t have any opinions on the notability of any particular person or website, but some of these are clearly popular (based on searches), but deemed not notable by the community. The most frequent porn site, the two most frequent internet personalities, a couple of other sites, and couple of the internet memes show up in the deletion logs as having been created—some multiple times—and deleted. The page for the top porn site was deleted in 2007, 2009, and 2015. With a little bit of extra digging, I was able to find other variants of the pages for websites (e.g., with different capitalization) that were also deleted.

More Data—Top 1000
I don’t have time right now for a careful review, but I skimmed the top 1000 most frequent zero-results queries from May 2016. Some more observations:


 * The lowest frequency count in the top 1000 most common queries is only 48 (out of 8.6 million), which is not terribly frequent.
 * Out of an entire month of data, only 281 queries come from at least 100 different IP addresses/day, accounting for 77,651 (0.90%) of the 8.6 million zero-result queries for the month. (This 100-IP threshold has been suggested as a way to deal with personally identifiable information.)
 * The most common porn site has 23 variants accounting for 16,999 queries in the top 1000. Using a very rough regular expression, there are almost 5000 distinct queries including this site, accounting for over 23K total queries. The long tail includes specifics of various kinds of porn. I think there are a lot of misdirected searches here.
 * There are a handful of apparent gibberish queries (all punctuation, or long strings of repeated letters)—strings of q are particularly popular.
 * There are at least 10 obvious addresses, a phone number, an ISBN number, an obfuscated email address, a Twitter account, and two instances of a person + city (which looks like a mis-aimed public records search to me).
 * There are lots of long strings that I don’t believe can be coincidental. They are coming from links somewhere (note that these are all from different IPs). Either our HQL is not filtering linked-to queries (which may not be possible), or people are actively cutting and pasting these queries from somewhere.
 * There are more variants of the websites in the top 100, and a lot of other websites.
 * There are plenty of additional names.
 * I recognized misspellings of some really famous people/things with existing wiki pages (this is not exhaustive or careful, just what I noticed and recognized), including one for Celine Dion, one for Cristiano Ronaldo, two for Deadpool, four for Donald Trump, two for Draymond Green. All but one were corrected by the completion suggester, and all but one were corrected by the “did you mean” spelling suggestions (different ones).
 * I found 17 variants for Facebook (starting with f). All but three were corrected by the completion suggester, and about half were corrected by “did you mean” spelling suggestions.
 * The long tail is very long. The top 100, as mentioned before, account for 0.62% of the zero-results queries. Items 901-1000 collectively account for only 0.057% (4,938 queries). Items 1901-2000 account for only 0.035% (2,986 queries). While new higher-frequency items will appear on the list over time, it would take a lot of review to put a serious dent in the zero-results query list. The top 25,000 queries in May account for just shy of 5% of the total, and the frequency of the 25,000th item on the list is only 7 (and it is corrected by “did you mean”)

More Data—Top Queries in June
Again, I didn’t have time for a careful review, but I looked at the top 100 most frequent zero-results queries from the next month, June 2016, and skimmed the top 1,000. More observations:


 * Stats: 7,991,530 (8.0 million) queries (6,867,420 / 6.9 million unique); the top 100 account for 48,662 (0.61%). The top 3 had counts of 10,465 (0.13%), 1,937 (0.02%), and 1,588 (0.02%), for a total of 13,990 (0.18%). The top 1000 account for 114,458 (1.43%)—all very similar to the May numbers.
 * 71 of the top 100 queries from May showed up on the June list. The top 10 on both lists are the same, though the order is a bit different. All are websites.
 * Presumably, if we excluded previously seen queries month-over-month, at least we’d get to something different—though the frequency per query would go down significantly.
 * Of the 11 non-English queries from May, seven are repeated in June. Six of them are related to porn/adult content.
 * It would be very interesting to extract referrer information where available and see just how many links out there are floating around generating these repeated queries.
 * There were 8 misspellings of Brexit in the top 1,000 (none in the top 100—the most frequent was #102). The completion suggester corrected 6 and “did you mean” spelling suggestions corrected 3.
 * In the top 1000, some of the same street addresses show up, so I hypothesize that these are coming from a link on the internet (my best guess), or a better class of bot.
 * Other street addresses have potentially pattern-defeating typos, too.

Conclusions
The next obvious refinements to the search strategy to consider would be to:
 * exclude previously reviewed items from earlier months
 * exclude items that get “did you mean” results
 * exclude items with identifiable deleted pages
 * taking referrer information into account and try to filter queries that come from a common source

I think the problem with all of these strategies is that so many high-frequency queries would be eliminated by any of them that any useful mining would be down to slogging through the low-impact long tail.

I don’t think there’s a lot here worth extracting, though others may disagree. The privacy concerns expressed earlier are genuine, and simple attempts to filter PII (using patterns, minimum IP counts, etc) are not guaranteed to be effective.