User:TJones (WMF)/Notes/Survey of Zero-Results Queries

Introduction
In an effort to reduce the rate of Wikipedia search queries that produce no results (see the Discovery team's proposal), I've undertaken a manual review of three batches of 500,000 full-text queries that returned no results (taken from the top 52 wiki's, with 100K+ articles).

Samples
My first sample ("7/24") is the first 500,000 zero-result full-text queries from the 2015-07-24 Cirrus Search Request log. The queries are time-stamped from 2015-07-23 07:51:29 to 2015-07-23 10:11:29. (The time zone is not indicated, but I assume it is consistent from file to file.) I reviewed this sample the most extensively, reviewing patterns I had previously found in a sample of 100,000 similar queries restricted to enwiki, as well as looking for new patterns.

I also reviewed similar samples of 500,000 zero-result full-text queries from the Cirrus Search Request logs dated 2015-07-10 ("7/10", time-stamped from 2015-07-09 07:43:26 and 2015-07-09 10:20:40) and 2015-07-17 ("7/17", 2015-07-16 07:42:37 to 2015-07-16 09:52:23). In these samples I only looked for the patterns I had previously identified in the 7/24 sample.

Caveats and Limitations
An important part of this process has been looking for patterns of queries that would not show up when listing the individual top zero-result queries. However, I could not manually review every unique query individually in a timely fashion, so I have resorted to heuristics (mostly grep patterns) for counting instances of the various patterns. The numbers are not exact, but they sometimes vary significantly from sample to sample anyway. I do believe that large systematic query patterns have been identified.

The samples were limited to the first 500K relevant queries in each log file, and do not represent the full 24-hour day.

This review is necessarily subjective and also limited by my familiarity with the languages and writing systems involved. (Hence, there's often more detail in enwiki and the top wikis in various Romance languages.)

Recurring Patterns
I've sorted the patterns by maximum frequency of occurrence in my three samples. Unless otherwise noted, we haven't yet tracked down a source for the queries, and their intent is generally unclear.

DOI
Examples: We had between 15,393 and 96,998 such queries in our samples, representing 3.08% to 19.40% of all zero-results queries. And while there were more in enwiki, there were many in these wikis as well: nl, ja, zh, war, vi, uk, sv, pt, pl, no, ko, it, id, hu, fr, fi, fa, de, cs, ceb, ca, ar, es, ru.
 * "10.3897/zookeys.457.6760" OR "http://zookeys.pensoft.net/articles.php?id=4267"
 * "10.1371/journal.pntd.0003900" OR "http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0003900"
 * "10.3332/ecancer.2013.301"
 * "10.7821/naer.1.1.2-6"

We were able to track the source of these queries down to a software package called Lagotto, used to track references to articles made online and elsewhere.

These queries are well-formed and can return results, but many do not because not all published academic articles are referenced in Wikipedia.

Unix Timestamps
Examples: We had between 26,351 and 42,650 such queries in our samples, representing 5.27% to 8.53% of all zero-results queries. They were spread across many wikis, with substantial numbers in en, it, ru, ja, fa, tr, nl, he, ar, id, cs, hi, vi, ro, hu, and uk.
 * 1431786835781:بيت لحم
 * 1436436482196:Илюзия
 * 1432198699732:Meryl Streep

Though the leading number looks like a Unix timestamp in epoch seconds, the numbers don't make sense. The time indicated spans from last year to next month.

These queries do not return results, though removing the number and colon from the beginning of the query results in a title match on the relevant wiki in all the examples tested.

"Article_title" AND "title of link taken from article"
Examples: We had an estimated 8,174 to 16,657 such queries in our samples, representing 1.63% to 3.33% of all zero-results queries. These are generally restricted to enwiki.
 * "Argentine_football_league_system" AND "Football in Moldova"
 * "Argentine_football_league_system" AND "Football in Mongolia"
 * "Argentine_football_league_system" AND "Football in Mozambique"
 * "Argentine_football_league_system" AND "Football in Papua New Guinea"

These queries seem to consist of a quoted article title (with underscores) ANDed with the quoted title of an article (without underscores) linked to in the first article. There can be hundreds of different queries with the same first component.

Variants of the queries with spaces instead of underscores or with neither quotes nor underscores do not return results. Searching for just the first component does give results.

TV Episodes / Movies—"..." film
Examples: We had between 7,878 and 8,794 such queries in our samples, representing 1.58% to 1.76% of all zero-results queries. These were common in the following wikis: en, nl, de, fr, ja.
 * "88 Minutos" film
 * "Castle S1E1" film
 * "Como treinar o seu Dragão 2 Filmes Completos Dublados" film
 * "30 Rock - Season 3 S3E18" film

These queries consist of quoted material (generally a movie or TV show title, often followed by a season/episode number, e.g. S3E18), followed by the word film, even if the quoted material is not a film.

Many of the titles used in the queries (88 Minutos, Como treinar o seu Dragão 2) return results in the appropriate language wiki.

I believe that these queries are intended to find websites to download these films or TV episodes.

quot
Examples: We had between 5,888 and 7,768 such queries in our samples, representing 1.18% to 1.55% of all zero-results queries. These were generally found in enwiki.
 * quot Anesthesia quot
 * quot Albert Payne quot
 * Canberra AND quot Andy Fisher quot
 * Moira East quot

These queries contain the word quot in them. It appears that quotation marks in the original query were converted to entities (&amp;quot;) and then sanitized (removing & and ; from the query), leaving quot.

Many of the queries without the quot's are exact matches for article titles, and many match with the the quot's either dropped or converted to straight quotation marks (").

term+term+term country
Examples: We had an estimated 3,437 to 6,725 such queries in our samples, representing 0.69% to 1.35% of all zero-results queries. These were generally found in eswiki, with some in enwiki. The country names are generally in Spanish or English, though some were in other languages.
 * ópera+del+estado+de+hamburgo Bangladés
 * zanthoxylum+thomasianum Bélgica
 * finance+and+revenue+f+c Germany
 * finance+and+revenue+f+c Ecuador

The countries included in the query don't seem to necessarily have any relationship with the other search terms.

Many of the queries do return results is the country name is excluded, in both enwiki and eswiki.

term+term+term
(This pattern is out of order to be grouped with the previous one.)

Examples: We had an estimated 1,382 to 2,536 such queries in our samples, representing 0.28% to 0.51% of all zero-results queries. These were generally found in eswiki, with some in enwiki.
 * accountable+care+organizations+and+evidence+based+payment+reform
 * android+phone+gone+awol+just+google
 * como+afilar+una+batidora+picadora+electrica+recupera+batidoras
 * el+peine+de+las+sirenas

These queries are characterized by being run together with pluses instead os spaces.

paint
Examples: We had an estimated 1,094 to 3,554 such queries in our samples, representing 0.22% to 0.71% of all zero-results queries. These were generally found in enwiki.
 * ""abel boulineau"" paint
 * ""wilhelmina k. lagerholm"" paint
 * Carl Friedrich Schulz - Zeitungslektüre am Biertisch (1851) paint
 * Karr par Breton d'après Petit paint

These queries are characterized by ending with the word paint. They come in two formats: seems to be the name of an artist, double quoted twice. is the name of a file on Wikimedia Commons, without the file suffix (e.g., .jpg). The artist names I tried did not return results, though Google found them in various art galleries. The Commons files generally return results when searched on Commons.
 * "" "" paint
 * paint

Highly repeated searches
These are idiosyncratic searches, but can be repeated up to hundred of times per hour, indicating that they are probably not driven by a human typing in the search over and over.

Examples: These are hard to quantify, but I looked for queries that were repeated more than 50 times in my samples, didn't fall into other categories, and were unique enough not to be driven by the day's events or random searching (e.g., one word searches).
 * ou as I can get tonight without being detected, and Tuck and Clay will be there too, along with an undercover team. You’ll have an earpiece no one will be able to see, so we
 * Google books says this is a snippet from a novel. In a two-week sample of the top 100 queries per day, this came up 12 times. It was probably there the other two days, but didn't make the top 100. There are up to 964 queries in a day.
 * Iamlookingfornodethree
 * There is a weird pattern here of iamlookingfornode, where  can be a few different things.
 * form 1+ 3dprinter
 * Just found one day, but 668 times in less than three hours.
 * Dounload feer game
 * Just found one day, but 248 times in less than three hours.

We had an estimated 892 to 3,019 such queries in our samples, representing 0.18% to 0.60% of all zero-results queries.

{searchTerms}
Examples: We had between 1,909 and 2,314 such queries in our samples, representing 0.38% to 0.46% of all zero-results queries. These were generally found in ruwiki.
 * {searchTerms}
 * %7bsearchTerms%7d
 * Liste der {searchTerms} Episoden
 * {searchTerms}'||'

This is likely developer error in an app or other automated search.

Similarly, we had a number of examples of search_suggest_query (440 to 509, 0.09% to 0.10%, en, de, fr) and \{@} (148 to 205, 0.03% to 0.04%, en, de, ru).

## tel fax
Examples: We had between 33 and 1,293 such queries in our samples, representing 0.01% to 0.26% of all zero-results queries. These were generally found in dewiki.
 * Aluminum Bracket 44 uk tel fax
 * diecast aluminum housing 31 nl tel fax
 * plastic injection mold makers 34 es tel fax

These queries seem to be manufacturing terms, a two digit number, a country code, and "tel fax".

Chinese product descriptions .xyz
Examples: We had an estimated 172 to 989 such queries in our samples, representing 0.03% to 0.20% of all zero-results queries. These were generally found in enwiki.
 * 花店用品/蓝色妖姬着色剂 根部吸水浅宝蓝玫瑰*13230782866*QQ34040316座机03177896222*psuddf.zhijieranliao.xyz
 * 直接染料/直接深棕GTL *13230782866*QQ34040316座机03177896222*zkdact.baohuabanranliao.xyz

These appear to be product description in Chinese, along with additional information. They all end in .xyz. Note that 座机 means landline, indicating contact info.

Online searches for parts of these reveal a similar pattern on Chinese-language business/manufacturing sites.

Massive snippets
This particular category is fairly rare, but may incur significant computational cost, so it is worth noting. These are searches that are 500 characters in length or more (up to more than 5,000 characters). Many look like snippets from larger texts, such as books or articles, and are in several different languages.

We had between 183 and 261 such queries in our samples, representing 0.04% to 0.05% of all zero-results queries. These were generally found in en, fr, de, and ru.

Miscellany
Below are some very general impressions from the larger collections of zero-results (10K+ from a given wiki). My ability to analyze languages I don't know is limited, but here is what I noticed:
 * dewiki has a few hundred OR'd together wildcard searches, some of which seem to be trying handle variations in declension.
 * jawiki has lots of "..." film searches.
 * ruwiki has a few non-cyrillic searches
 * itwiki has lots of queries that are multi-word phrases with underscores instead of spaces
 * eswiki and frwiki have a fair number of build up searches and searches in Arabic, and frwiki has a fair number of searches in Chinese
 * zhwiki has lots of non-Chinese searches in various languages
 * plwiki has a fair number of queries of the form * * AND (muzyk* OR Dyskografia) (with asterisks) where seems to be an artist, band, album, or something similar.