User:TJones (WMF)/Notes/Review of Commons Queries

July 2020 — See TJones_(WMF)/Notes for other projects. See also T258297 and T252544.

I undertook a review of queries on Wikimedia Commons to get a sense of how people are using Commons, and how we might improve search on Commons.

= Full Query Corpus Analysis =

Quick Summary
In three month's worth of likely-human queries issued on Commons, over 90% are in the Latin script, about 50% are in English, almost 25% are names, and almost 10% are porn-related.

Among the most common queries, 8 of the top 10 and 66 of the top 100 are porn-related, but even the most common queries are not really that common, and only 6 queries out of over 1.04M unique (lightly normalized) queries were searched 1,000 times or more, and only 660 were searched 50 times or more. Over 950K were unique. There is not really a head—it's pretty much all the long tail.

In a sample of 100 random queries (the long tail), 30 were specific things, 22 people, 14 places, 11 organizations, and 12 were porn. 60 queries were narrow and fairly specific, 17 were fairly broad, and 22 were in the middle. (Broad queries were often one word.)

In a sample of the 100 most common queries (the head-ish), 66 were porn, 7 were looking for "facts", 7 were specific things, 6 were current events, 5 were people. 24 queries were narrow and fairly specific, 46 were fairly broad, and 27 were in the middle. (Broad queries were often one word.)

Only 1.6% of queries used a namespace, 0.9% had a file extension. Boolean and special operators were very rare.

10% of queries got zero results. Less than 1% got a million results or more.

If we break queries on whitespace and punctuation (less than ideal, but easy), 66% of queries are one or two words; 93% are four words or fewer.

Data
I pulled three months' worth of Commons queries from  to analyze—from April 15 to July 15, 2020.

The sample does not include all queries from the time period; I applied some of the usual filters that the search team has found to be useful to get a reasonable sample from "normal" human users:


 * We require the queries to have used the search box in the upper corner. This eliminates some bots, screen scrapers, and links that are queries.


 * We eliminate users who make more than 100 queries in a day. This helps prevent us from oversampling bots, power users and editors, script/gadget users, shared IP addresses, and other users who are either not "normal" or not human.

Other caveats:


 * This sample only includes queries performed on Commonswiki. It does not include queries from other wikis that also search Commons data (like Wikipedias do).


 * I performed some very minimal normalization on all the data to make it easier to process: I converted tabs and newlines in any query to spaces. This shouldn't change the results shown or the intent of the query, but it sure does make a tab-separated file easier to process. (See below for info on other kinds of normalization done to group queries together.)


 * Note that this is a different from the data source that Erik used for his Top N queries per day. See T257361.

The sample contains 1,532,070 (~1.5M) queries.

Scripts and Languages
The first thing I do with a pile of data, of course, is try to figure out what scripts and languages are contained in it.

Scripts

 * The vast majority of queries (1,414,693; 92.3%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category.
 * 1,170,330 (76.4%) of queries consist only of A-Z (upper- and lowercase) and spaces.
 * Another 70,876 (4.6%) are only A-Z and the digits 0-9.


 * The next biggest groups are Arabic (17,675; 1.2%), Cyrillic (15,180; 1.0%), and CJK (25,239; 1.6%)—again possibly with additional numbers, punctuation, and symbols.


 * The 25,239 "CJK" queries include 13,974 queries that are CJK Unified Ideographs, 5,454 queries in Hangul/Korean, 3,007 queries in Japanese Katakana, 497 queries in Japanese Hiragana, and 2,307 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).

The breakdown by script is below:


 * An additional 2,352 queries (0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Latin/CJK (~500), Cyrillic/Latin (~400), and Arabic/Latin (~300). My favorite query in this group is mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ (the bolded characters are not Latin).

Numbers
A small number of queries are mostly numbers:


 * 2,448 (0.2%) are integers (a small number—just three—with invisibles or diacritics, and some—eleven—are longer numbers with commas)


 * 457 look like measurements (e.g., 3mm or 5x5)


 * 313 look like IP addresses


 * Plus a handful (69) of other numbers, including decimals, ordinals, hex numerals, malformed IP addresses, etc.

Misc
There are a small number of additional text patterns not included above.


 * 2,720 (0.2%) queries look like identifiable web domains.


 * 626 queries look like email addresses.


 * 160 Latin-script acronyms (91 upper case, 69 lowercase). Traditionally we don't handle acronyms very well in search, so at least there aren't a lot of them.

Symbols
A small number of queries are all symbols (711; e.g., $600) or punctuation (107; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).

126 additional queries have characters that my Unicode regexes identify as "unassigned" code points, but these are mostly—but not entirely—emoji. (I assume most are newer emoji than have recently been assigned code points.)

Invisibles
1,390 queries (0.1%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.

If these aren't normalized well, they can screw up query results, however, these are clearly not a huge problem on Commons.

Languages
I took a random sample of 1,000 queries and tried to identify them by language.

The categorizations are almost certainly not perfect, but they should be close enough to get a sense of the proportions of different languages used on Commons.

As a general rule, I don't try to identify the "language" of names. North Americans in particular like to combine names from various ethnolinguistic origins, though they aren't the only ones. Some names—Maria is the most egregious—are too ambiguous to categorize. I make exceptions for names in fairly unambiguous scripts; for example, 엔리코 콜런토니 is arguably "in Korean", even though it's a very Italian name ("Enrico Colantoni") of a Canadian actor.

If a query includes a name and text in a particular language, I count that as in that language. So, Abraham Lincoln is a name, but birthplace of Abraham Lincoln would be categorized as English and local de nascimento de Abraham Lincoln would be categorized as Portuguese.

19 queries were "unidentifiable" because they weren't words (11q!) or were too ambiguous as to language (e.g., a single-word query that could be English, French, or Spanish).

I categorized 42 queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number and 1 measurement.

A few items are assignable to a given language even though they are in the wrong script. The Russian song В путь is transliterated as V Put or V Put', which doesn't mean anything in English, so I guess it's in Russian? Sure, why not—there were only a small handful. Also, there were two wrong-keyboard Russian queries that I noticed, which I also counted as Russian.

A couple of queries were mixed-language. I counted them as the predominate language and noted the other language.

The most common categories then are English queries (507; 50.7%) and names (237; 23.7%), with a smattering of technical terms (43; 4.3%), German (34; 3.4%), Spanish (28; 2.8%), French (27; 2.7%), Chinese (14; 1.4%), Russian (13; 1.3%), Italian (11; 1.1%), and others.

Query Patterns
Here are some potentially interesting patterns I noticed in the queries:

Query Frequency
We wanted to look at the "head, torso, and tail" of the distribution of queries by frequency—however, there looks to me to be at most a tiny head and a long, long tail.

I did some very basic normalization of the queries for bucketing; I lowercased them, and normalized whitespace (removing leading and trailing spaces, and reducing multiple spaces to just one space), so that " JoHN   SMiTh  " and "john smith" count as the same query.

There are 1,090,396 unique normalized queries (out of 1,532,070 queries).

I grouped the query frequencies into quasi-logarithmic binary/decimal buckets (1/2/5/10/20/50...) which is approximately logarithmic in both binary and decimal, has relatively fine-grained buckets, and is human-friendly. I also added buckets for 3 and 4, since there are many queries with these very low frequencies.

Only 2 queries appeared more than 2,000 times, only 4 between 1,000 and 2,000. All together, only 660 queries (0.06%) appeared more than 20 times.

So, 99.94% of queries occur less than 20 times in 3 months. 97.89% occur fewer than 5 times. It's all long tail.

Normalized Query Frequency Distribution mean count: 1.40506



Where to break the head and tail in a distribution is generally subjective, however Wikipedia has a page on the [Head/tail Breaks](https://en.wikipedia.org/wiki/Head/tail_Breaks) algorithm, which breaks the head and tail at the mean value in the distribution. The mean frequency for the normalized queries is 1.40506, which means the head would be everything with a frequency of 2+ (12.4% of queries), and the tail would be all of the unique queries (87.6% of queries). Having a frequency of 2 (or even 10) out of over a million queries doesn't seem like the "head" to me, so I'm going to stick with my claim that there is no head!

Query Intent
While talking to Cormac about this analysis, we talked a bit about "modifiers" to queries, such as looking for things with specific colors, etc.—y'know, the kinds of things that might show up in structured data! It's a difficult and subjective line to draw, but I tried to divide between things that are essential and things that are preferences (like color of a car), and maybe things that would be reified in wikidata (e.g., "Chinese art") or not (e.g., "ugly art"), with reifiable things not counting as "modified". There are not many queries with modifiers—none in the top 100 most common queries.

In the random sample of 1,000 queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 95 (9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc. So, roughly 9-10% of Commons queries are likely porn-themed.

I took a sub-sample of 100 queries from this set and tried to categorize them by intent, to compare to the head-ish top 100 queries:


 * 30 were about specific or general things.
 * 23 things hard to classify more specifically
 * 1 had additional modifiers/specifiers
 * 3 specific products
 * 2 logos (also counted as images)
 * 1 film
 * 1 list (of latin phrases—??)
 * 22 were about specific people
 * 1 had additional modifiers/specifiers
 * 1 was about 2 specific people
 * 14 were about specific places
 * 1 had additional modifiers/specifiers
 * 1 was about an activity at that place
 * 12 were porn
 * 11 were about specific organizations
 * 3 were about general concepts
 * 2 were about an activity (one in a particular place)
 * 2 seemed to be looking for specific images (e.g., something fairly specific.jpg)
 * 1 was about art

I reviewed the top 100 most common queries and tried to categorize them by intent:


 * 61 were porn, and another 5 were likely porn. That's 2/3 of the most common queries.
 * The 4 non-English non-name higher-frequency queries were here: two German words and two Persian words.
 * There was one Category in the sample, and it was also in the porn category.
 * 7 were about "facts" (e.g., map of a place).
 * 1 list of...
 * 7 were about specific or general things.
 * 6 were about current events and topics in the news.
 * 5 were about non-pornographic celebrities or historical figures
 * 2 were about art or artists.
 * 2 were about specific places.
 * 2 were about tech topics or companies.
 * 3 I couldn't categorize (2 of them got 0 results).

Query Generality
I tried to categorize queries in various samples as narrow, broad, or somewhere in the middle.


 * From the sample of 1,000 random queries I categorized by language (representing the tail), I took a random subsample of 100 and categorized them by generality:
 * 60 were narrow (specific person, place, or object)
 * 22 were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
 * 17 were broad (many are one word); these were all porn but one
 * 1 was uncategorizable.


 * From the top 100 most common queries (the stubby head):
 * 24 were narrow (specific person, place, or object)
 * 27 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
 * 46 were broad (often one word); these were all porn but one
 * 3 were uncategorizable

Keywords and Specific Purposes

 * 25,047 queries (1.6%) specify a namespace. The most common are Category (15,519), File (7,044), and Template (823). 282 namespace queries were in Talk namespaces. There was one instance of a non-existant namespace: media.
 * 83 queries use the character ː, which is normally used in phonetic transcription to indicate a lengthened vowel (less often a lengthened consonant). It's only used that way once in these 83 queries. 81 of them use it in place of a colon with a namespace!—which does not work. (And there was one other weird one I couldn't figure out.)


 * 14,118 queries (0.9%) end in a file extension (but don't use the File: keyword). The most common are .jpg (6,482), .svg (3,142), and .png (1,786). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .stl, .tab, .tif, .tiff, .txt, .wav, .webm.


 * 13,239 queries (0.9%) have easily detected URL bits—e.g., ?q=query&thing=whatchamacallit. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search, though it isn't clear how the URL components are getting copied from Google to Commons. I suppose it could be a logging error of some sort. A partly sanitized example query is below (the source language is Polish, and kudłaty stwór means "shaggy creature").
 * Chewbacca&tbnid=12345678901234&vet=1&docid=12345678901234&w=3456&h=5184&q=kudłaty+stwór&hl=pl&source=sh/x/im
 * Searching for kudłaty stwór in images on google.pl does give the Polish Wikipedia article and image for Chewbacca as the 4th result... so something is going on here, but it isn't clear what.
 * Of the 12,275 queries with tbnid= in them, all but one start with what looks like a query in Latin script (the other one is Bengali).
 * Of the 7,273 tbnid queries with a q= parameter, all but about 25 are largely Latin script; the others are Arabic, Bengali, Cyrillic, Devanagari, Emoji, Greek, Hebrew, Korean, Tamil, and Thai.


 * Only a handful of queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
 * AND: 83 of 234 queries with AND seem to be using Boolean operators, even though they don't really do anything (everything is ANDed together by default).
 * OR: 44 of 77 queries with OR seem to be Boolean operators.
 * NOT: 2 of 17 queries with NOT seem to be Boolean operators.
 * !: None of the 475 queries with ! seem to be well-formed, intentional Boolean queries. Most are not formatted as Booleans—Welcome! is not a Boolean query. The ones that are look like typos, like !mismatched quotes" or !st (looks like a mis-shift key for 1st). Others don't seem terribly useful, like Category:!Dogs returns everything that is not in Category:Dogs. Fewer than 10 queries are plausible Boolean queries.
 * -: 533 queries plausibly use - as negation. I did a quick skim, and the large majority of them look to be using it intentionally. 70 queries start with - and most of them don't look to be using it intentionally (or at least not correctly), since it's very hard (but not entirely impossible) to imagine -172 is a useful query.
 * ~: 110 queries use ~. Almost half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Many look like typos for a hyphen, as in 1900~2000. About 15 look to be intentional use of ~.
 * : A handful of the queries using AND, OR, or NOT as Booleans also use parens—even though they don't do anything in our current search system.
 * +: Ignoring tbnid queries, 2,304 queries use +. Most seem to be using it in place of a space.
 * There are no special keywords with colons other than namespaces in my sample. Carly asked about  so I looked a little harder for those and found two instances of , six  , and one  , none with any other search terms. There was one malformed query:  , but it is also missing the colon.

Light vs Heavy Normalization
While talking to Erik about the Top N queries report he's putting together, he mentioned that he's doing a stronger form of normalization for that report, and replacing all punctuation with spaces before normalizing whitespace.

This is more likely to group queries that get different results than my "light" normalization used above—though even the light normalization may do so because of the way we treat CamelCase search terms. The most obvious case is removing quotes from around a query: John Smith will get many more results than "John Smith". However, as Erik rightly pointed out, in most cases, such variants probably generally represent the same query intent.

It turns out that in my sample, light vs heavy normalization makes only a small difference, about 1.5-2.5% increase in buckets other than the singleton bucket.

The most common change among the heavily normalized top 100 queries is stripping quotes. The most impactful change—in terms of increasing the number of queries grouped together—is stripping hyphens, though it only applied to one query.

Distribution of Hits
Below is a summary of the distribution of results from all of the queries in the sample.

153,549 queries (10%) get zero results. 9,566 queries (0.62%) get more than a million results. The idea number is probably somewhere in between.

Result Count Frequency Distribution

Distribution of Token Counts
The number of tokens (roughly, words) in a query is an easy proxy for the complexity of a query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.

Here, we're using a very simple process of breaking tokens on spaces and punctuation. So, a long Chinese sentence would be counted as one token, and ain't would be counted as two, and ''.. ,, ;; --'' would be counted as zero. It isn't perfect, but it gives us a reasonable approximation of what we have.

Of note, 1,011,221 queries (66.0%) are one or two tokens. 1,417,935 queries (92.6%) are one to four tokens.

All of the "zero-token" queries are strings of punctuation and symbols.

Of the 20 queries with 40 or more tokens, 4 are gibberish, and most of the rest seem to be captions from images (presumably looking for the original image, but who knows?). 5 are in French, 1 in Spanish, and 11 are in English.

Tokens Count Distribution

Miscellaneous Odd or Interesting Queries

 * Erik's Top N report doesn't exclude anything, and covers all wikis. We noticed some politics-themed searches on English Wikipedia like 2020 Texas US Senate Election with no variation in whitespace or capitalization. I'm not sure I found the source of the query, but I did find a political website that helps you determine what politicians you align with by asking you questions about various topics. For each topic, they have a "Learn More" link which links to a search on English Wikipedia. So it is definitely possible that unexpected commons searches (e.g., a relatively complex query with no variation) may be the result of links—which may not be publicly available.
 * I saw a handful of queries that look like attempts at SQL injection.
 * There are some random-looking queries, e.g. 510d47d9-4f0a-a3d9-e040-e00a18064a99, that turn out to be identifiers in collections that images have been taken from.

= Zero-Results Query Sub-Corpus Analysis =

After writing up the initial report, I realized that 10% of the queries get zero results, which is a lot, but not as many as we see on many Wikipedias. I investigated just the zero-results queries separately.

Zero-Results Summary
In three month's worth of likely-human queries issued on Commons, zero-result queries make up about 10% of all queries (which is less than the zero-results rate on Wikipedias). Subjectively, the zero-results queries seem to have less junk than on Wikipedia, and so may be more salvageable. Also, there seem to be more spelling errors/typos in the zero-results queries.

80% of the zero-results queries are in the Latin script (which is less than in the total corpus, which is 90% Latin text). Only 32% are in English (vs 50%), and roughly 25% are names (same as overall). Only 6.5% are porn-related (vs 9.5% overall).

Only 31 of the top 100 most commons zero-results queries are porn-related, vs 66 overall.

Zero-results are more heavily skewed toward unique queries.

In a sample of 200 random zero-results queries (the long tail), 37% were about specific things, 20.5% people, 13% places, 5% facts, 3% organizations, and 6.5% were porn. This is roughly similar to the full corpus, with a bit less porn. 60% of zero-results queries were narrow and fairly specific, 10% were fairly broad, and 22.5% were in the middle. (Broad zero-results queries were often one word.) This is very similar to the full corpus.

In a sample of the 100 most common zero-results queries (the head-ish), 31 were porn, 28 were specific things, 23 were people. This is much more specific and has half the porn of the full corpus. 57 queries were narrow and fairly specific, 31 were fairly broad, and 9 were in the middle. (Broad zero-results queries were often one word.) This is skewed much more toward narrow queries compared to the full corpus.

Breaking on whitespace and punctuation (less than ideal, but easy), 60% of queries are one or two words; 86% are four words or less. This is slightly less than the full corpus. More than half of all high-token queries (10+) give zero results.

Spelling errors seem more common in the zero-results queries (and there is less junk in the zero-results queries than in Wikipedia data); 32% of the random sample of zero-results queries have spelling errors, and 38% of the top 100 zero-results queries have spelling errors. "Did you mean" suggestions and the completion suggester do okay, but could be much better. The current completion suggester doesn't have much to work with because it is limited to page/file/category names, which are not always good matches with what people are searching for. T250436 could be a big help!

The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.

Zero-Results Data
This is a sub-sample of the earlier-described data set, limited to queries that got zero results.

The sample contains 153,525 zero-results queries (almost exactly 10.0% of the full sample).

Zero-Results Scripts
The majority of zero-results queries (125,326; 81.6%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category. The next biggest groups are Arabic (6,618; 4.3%), Cyrillic (2,361; 1.5%), and CJK (2,733; 1.8%)—again possibly with additional numbers, punctuation, and symbols.
 * This is notably less than the percentage of queries overall that are in the Latin script (92.3%).
 * 88,804 (57.8%) of zero-results queries consist only of A-Z (upper- and lowercase) and spaces.
 * This group is notably smaller (57.8% vs 76.4%)
 * Another 5434 (3.5%) are only A-Z and the digits 0-9.
 * The Arabic percentage is notably higher (4.3% vs 1.2%)
 * The 2,733 "CJK" zero-results queries include 1,737 queries in Hangul/Korean, 432 queries in Japanese Katakana, 1 query in Japanese Hiragana, 139 queries that are CJK Unified Ideographs, and 424 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).
 * The mix here is very different from the general queries, with significantly more Hangul, and significantly fewer Ideographs and Hiragana.

The breakdown by script is below:

An additional 1,182 zero-results queries (0.8%, up from 0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Arabic/Latin (~240), Cyrillic/Latin (~200), Latin/CJK (~160), and Hangul/Latin (~120). Not surprisingly, the mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ got zero results.

Zero-Results Numbers
A small number of zero-results queries are mostly numbers:


 * 923 (0.6%) are integers.


 * 1 looks like a measurement (e.g., 3mm or 5x5)


 * Plus a handful (7) of other numbers, including decimals, etc.

Zero-Results Misc
There are a small number of additional text patterns not included above.


 * 666 (0.4%) zero-results queries look like identifiable web domains.


 * 572 (0.4%) zero-results queries look like email addresses.

Zero-Results Symbols
A small number of zero-results queries are all symbols (74; e.g., $600) or punctuation (97; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).

Zero-Results Invisibles
393 zero-results queries (0.3%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.

Zero-Results Languages
I took a random sample of 200 zero-result queries and tried to identify them by language.

Reminders:
 * I generally don't try to identify the language of names.
 * If a zero-results query includes a name and text in a particular language, I count that as in that language.

14 zero-results queries (7.0%) were "unidentifiable" because they weren't words (11q!). This is significantly more than the full corpus (7.0% vs 1.9%).

I categorized 12 (6.0% vs 4.2% in the full corpus) zero-results queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number.

A couple of zero-results queries were mixed-language. I counted them as the predominate language and noted the other language.

The most common categories then are English queries (63; 32.5% vs 50.7%) and names (47; 23.5% vs 23.7%), with a smattering of technical terms (12; 6.0% vs 4.3%), German (12; 6% vs 3.4%), Arabic (9; 4.5% vs 0.2%), Spanish (8; 4.0% vs 2.8%), Korean (5; 2.5% vs 0.8%), French (4; 2.0% vs 2.7%), and others.

Zero-Results Query Patterns
Here are some potentially interesting patterns I noticed in the zero-results queries...

Zero-Results Query Frequency
The frequency distribution of the zero-results queries has a stronger skew towards unique queries, even with light normalization, with the mean frequency being 1.08 (vs 1.41 for all queries).

There are 142,406 unique normalized zero-results queries (out of 153,525 zero-results queries).

Only 2 zero-results queries appeared more than 200 times, only 2 between 50 and 100. All together, only 35 zero-results queries (0.02%) appeared more than 20 times.

So, 99.98% of zero-results queries occur less than 20 times in 3 months. 99.7% occur fewer than 5 times. It's all long tail.

Zero-Results Query Intent
In the random sample of 200 zero-results queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 13 (6.5% vs 9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc. I reviewed the top 100 most zero-results common zero-results queries and tried to categorize them by intent:
 * I also tried to categorize this set by intent, to compare to the head-ish top 100 zero-results queries:
 * 74 were about specific or general things.
 * 5 were about specific events
 * 3 specific products
 * 2 films
 * 1 song
 * 1 website
 * 41 were about specific people
 * 26 were about specific places
 * 13 were porn
 * 11 were about "facts"
 * 6 were about specific organizations
 * 6 were about general concepts
 * 4 were about art
 * 8 were junk
 * 4 were categories
 * 3 seemed to be looking for specific files (e.g., something fairly specific.jpg)


 * 31 were porn
 * 28 were about specific or general things.
 * 5 films
 * 5 websites
 * 23 were about specific people
 * 5 were about specific organizations
 * 3 were about specific places
 * 2 were about "facts"
 * 2 were about general concepts
 * 2 malformed keywords ("category:portal:mathematics")
 * 3 I couldn't categorize
 * 1 seemed to be looking for a specific file

Zero-Results Query Generality
I tried to categorize zero-results queries in various samples as narrow, broad, or somewhere in the middle.

From the sample of 200 random zero-results queries I categorized by language (representing the tail), I categorized them by generality: From the top 100 most common zero-results queries (the stubby head):
 * 120 (60.0%) were narrow (specific person, place, or object)
 * 43 (22.5%) were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
 * 20 (10.0%) were broad (many are one word); these were all porn but one
 * 17 (8.5%) were uncategorizable.
 * 57 were narrow (specific person, place, or object)
 * 9 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
 * 31 were broad (often one word); these were all porn but one
 * 3 were uncategorizable.

Zero-Results Keywords and Specific-Purposes

 * 7,292 zero-results queries (5.1% vs 1.6%) specify a namespace. The most common are Category (5,809), File (970), and Template (206). 60 namespace zero-results queries were in Talk namespaces. The previously mentioned non-existant namespace, media, got zero results.
 * 70 zero-results queries use the character ː, which is normally used in phonetic transcription, instead of a colon in an obvious namespace-style query. (There are 81 uses of ː in the whole corpus).


 * 2,077 zero-results queries (1.5%) end in a file extension (but don't use the File: keyword). The most common are .jpg (1,211), .png (236), and .svg (197). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .tab, .tif, .wav, .webm.


 * 13,193 zero-results queries (9.3% vs 0.9%) have easily detected URL bits. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search. As most of the queries with URL bits get zero results, these are pretty much the same set.


 * Only a handful of zero-results queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
 * AND: 9 of 23 zero-results queries with AND seem to be using Boolean operators.
 * OR: There are no zero-results queries with OR.
 * NOT: Neither of the 2 zero-results queries with NOT seem to be Boolean operators.
 * -: 30 zero-results queries plausibly use - as negation. "Covid -19" is common among the ones that do not seem to be using it intentionally as negation.
 * ~: 37 zero-results queries use ~. More than half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Maybe 1 looks to be intentional use of ~.
 * +: Ignoring tbnid zero-results queries, 301 queries use +. Most seem to be using it in place of a space.

Zero-Results Distribution of Token Counts
The number of tokens (roughly, words) in a zero-results query is an easy proxy for the complexity of the query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.

Of note, 86,230 zero-results queries (60.6% vs 66.0%) are one or two tokens. 122,420 zero-results queries (86.0% vs 92.6%) are one to four tokens.

All of the "zero-token" zero-results queries are strings of punctuation and symbols.

Given that zero-results queries are 10% of all queries in the sample, the table below shows were zero-results queries are over-represented.
 * "Zero-token" queries are punctuation and symbols, so it isn't too surprising most of them get zero results.
 * Queries of 7 or more tokens are more than twice as likely to give zero-results.
 * For queries with more than 10 tokens, more than half give zero-results.

Zero-Results Spelling Errors
I noticed more obvious spelling errors in the zero-results sample than in the larger sample, so I tagged items that seemed to be spelling errors. Spelling errors in languages that use the Latin script are easier for me to detect, so there may be more that I missed.

65 (32.5%) of the 200 random sample of zero-results queries look like spelling errors, in several languages: mostly English, but also German, Spanish, French, Polish, Portuguese, and Swedish.
 * 26 (13.0%) had good Did You Mean suggestions. 8 DYM suggestions were bad, and 7 were mediocre.
 * 5 (2.5%) had good completion suggester suggestions. 1 was mediocre.
 * 22 (11.0%) had no useful corrections.

38 (38.0%) of the 100 top zero-result queries look like spelling errors, one in Spanish, the rest in English.
 * 23 (23.0%) had good Did You Mean suggestions. 2 DYM suggestions were bad, and 2 were mediocre.
 * 7 (7.0%) had good completion suggester suggestions. 1 was bad, and 2 were mediocre.
 * 8 (8.0%) had no useful corrections.

The completion suggester often didn't have much to work with, because there are no pages, files, or categories that match the text of the query, though they could match part of the query. For example, nothing matches Waashington state flag, though Waashington by itself gets several suggestions which start with Washington.

Erik's planned work (T250436) on improving suggestions could help a lot with these kinds of queries on Commons.

Zero-Results Miscellaneous Odd or Interesting Queries
Some of the queries that look like attempts at SQL injection unsurprisingly get zero results.

The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.

Schematized Top 8 Queries with Frequency:
 * 318 Internátional website.com Airport
 * international airport with a website in the middle of its name
 * all with the same capitalization and accents!
 * 232 1234abcd
 * 82 "Firstname Lastname"
 * 54 Merkel Potrait / merkel potrait / Merkel potrait
 * This one looks "organic", because of the variation. I guess portrait is hard to spell. I found several other instances of potrait—though not this many—as well as mnerkel potrait, and other instances of angela maerkel, angela merkeöl, and Angela Merklel.
 * 45 Firstname Middlename Lastname
 * 36 porn related searc
 * porn-related search with the last letter of the last word missing
 * 34 titleof famous painting
 * famous painting title with two words run together

Other than Merkel potrait, all instances of these are identical under heavy normalization (lowercasing, removing punctuation, and normalizing whitespace).