User:TJones (WMF)/Notes/Review of Commons Queries

July 2020 — See TJones_(WMF)/Notes for other projects. See also T258297 and T252544.

I undertook a review of queries on Wikimedia Commons to get a sense of how people are using Commons, and how we might improve search on Commons.

Quick Summary
In three month's worth of likely-human queries issued on Commons, over 90% are in the Latin script, about 50% are in English, almost 25% are names, and almost 10% are porn-related.

Among the most common queries, 8 of the top 10 and 66 of the top 100 are porn-related, but even the most common queries are not really that common, and only 6 queries out of over 1.04M unique (lightly normalized) queries were searched 1,000 times or more, and only 660 were searched 50 times or more. Over 950K were unique. There is not really a head—it's pretty much all the long tail.

In a sample of 100 random queries (the long tail), 30 were specific things, 22 people, 14 places, 11 organizations, and 12 were porn. 60 queries were narrow and fairly specific, 17 were fairly broad, and 22 were in the middle. (Broad queries were often one word.)

In a sample of the 100 most common queries (the head-ish), 66 were porn, 7 were looking for "facts", 7 were specific things, 6 were current events, 5 were people. 24 queries were narrow and fairly specific, 46 were fairly broad, and 27 were in the middle. (Broad queries were often one word.)

Only 1.6% of queries used a namespace, 0.9% had a file extension. Boolean and special operators were very rare.

10% of queries got zero results. Less than 1% got a million results or more.

If we break queries on whitespace and punctuation (less than ideal, but easy), 66% of queries are one or two words; 93% are four words or fewer.

Data
I pulled three months' worth of Commons queries from  to analyze—from April 15 to July 15, 2020.

The sample does not include all queries from the time period; I applied some of the usual filters that the search team has found to be useful to get a reasonable sample from "normal" human users:


 * We require the queries to have used the search box in the upper corner. This eliminates some bots, screen scrapers, and links that are queries.


 * We eliminate users who make more than 100 queries in a day. This helps prevent us from oversampling bots, power users and editors, script/gadget users, shared IP addresses, and other users who are either not "normal" or not human.

Other caveats:


 * This sample only includes queries performed on Commonswiki. It does not include queries from other wikis that also search Commons data (like Wikipedias do).


 * I performed some very minimal normalization on all the data to make it easier to process: I converted tabs and newlines in any query to spaces. This shouldn't change the results shown or the intent of the query, but it sure does make a tab-separated file easier to process. (See below for info on other kinds of normalization done to group queries together.)


 * Note that this is a different from the data source that Erik used for his Top N queries per day. See T257361.

The sample contains 1,532,070 (~1.5M) queries.

Scripts and Languages
The first thing I do with a pile of data, of course, is try to figure out what scripts and languages are contained in it.

Scripts

 * The vast majority of queries (1,414,693; 92.3%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category.
 * 1,170,330 (76.4%) of queries consist only of A-Z (upper- and lowercase) and spaces.
 * Another 70,876 (4.6%) are only A-Z and the digits 0-9.


 * The next biggest groups are Arabic (17,675; 1.2%), Cyrillic (15,180; 1.0%), and CJK (25,239; 1.6%)—again possibly with additional numbers, punctuation, and symbols.


 * The 25,239 "CJK" queries include 13,974 queries that are CJK Unified Ideographs, 5,454 queries in Hangul/Korean, 3,007 queries in Japanese Katakana, 497 queries in Japanese Hiragana, and 2,307 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).

The breakdown by script is below:


 * An additional 2,352 queries (0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Latin/CJK (~500), Cyrillic/Latin (~400), and Arabic/Latin (~300). My favorite query in this group is mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ (the bolded characters are not Latin).

Numbers
A small number of queries are mostly numbers:


 * 2,448 (0.2%) are integers (a small number—just three—with invisibles or diacritics, and some—eleven—are longer numbers with commas)


 * 457 look like measurements (e.g., 3mm or 5x5)


 * 313 look like IP addresses


 * Plus a handful (69) of other numbers, including decimals, ordinals, hex numerals, malformed IP addresses, etc.

Misc
There are a small number of additional text patterns not included above.


 * 2,720 (0.2%) queries look like identifiable web domains.


 * 626 queries look like email addresses.


 * 160 Latin-script acronyms (91 upper case, 69 lowercase). Traditionally we don't handle acronyms very well in search, so at least there aren't a lot of them.

Symbols
A small number of queries are all symbols (711; e.g., $600) or punctuation (107; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).

126 additional queries have characters that my Unicode regexes identify as "unassigned" code points, but these are mostly—but not entirely—emoji. (I assume most are newer emoji than have recently been assigned code points.)

Invisibles
1,390 queries (0.1%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.

If these aren't normalized well, they can screw up query results, however, these are clearly not a huge problem on Commons.

Languages
I took a random sample of 1,000 queries and tried to identify them by language.

The categorizations are almost certainly not perfect, but they should be close enough to get a sense of the proportions of different languages used on Commons.

As a general rule, I don't try to identify the "language" of names. North Americans in particular like to combine names from various ethnolinguistic origins, though they aren't the only ones. Some names—Maria is the most egregious—are too ambiguous to categorize. I make exceptions for names in fairly unambiguous scripts; for example, 엔리코 콜런토니 is arguably "in Korean", even though it's a very Italian name ("Enrico Colantoni") of a Canadian actor.

If a query includes a name and text in a particular language, I count that as in that language. So, Abraham Lincoln is a name, but birthplace of Abraham Lincoln would be categorized as English and local de nascimento de Abraham Lincoln would be categorized as Portuguese.

19 queries were "unidentifiable" because they weren't words (11q!) or were too ambiguous as to language (e.g., a single-word query that could be English, French, or Spanish).

I categorized 42 queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number and 1 measurement.

A few items are assignable to a given language even though they are in the wrong script. The Russian song В путь is transliterated as V Put or V Put', which doesn't mean anything in English, so I guess it's in Russian? Sure, why not—there were only a small handful. Also, there were two wrong-keyboard Russian queries that I noticed, which I also counted as Russian.

A couple of queries were mixed-language. I counted them as the predominate language and noted the other language.

The most common categories then are English queries (507; 50.7%) and names (237; 23.7%), with a smattering of technical terms (43; 4.3%), German (34; 3.4%), Spanish (28; 2.8%), French (27; 2.7%), Chinese (14; 1.4%), Russian (13; 1.3%), Italian (11; 1.1%), and others.

Query Patterns
Here are some potentially interesting patterns I noticed in the queries:

Query Frequency
We wanted to look at the "head, torso, and tail" of the distribution of queries by frequency—however, there looks to me to be at most a tiny head and a long, long tail.

I did some very basic normalization of the queries for bucketing; I lowercased them, and normalized whitespace (removing leading and trailing spaces, and reducing multiple spaces to just one space), so that " JoHN   SMiTh  " and "john smith" count as the same query.

There are 1,090,396 unique normalized queries (out of 1,532,070 queries).

I grouped the query frequencies into quasi-logarithmic binary/decimal buckets (1/2/5/10/20/50...) which is approximately logarithmic in both binary and decimal, has relatively fine-grained buckets, and is human-friendly. I also added buckets for 3 and 4, since there are many queries with these very low frequencies.

Only 2 queries appeared more than 2,000 times, only 4 between 1,000 and 2,000. All together, only 660 queries (0.06%) appeared more than 20 times.

So, 99.94% of queries occur less than 20 times in 3 months. 97.89% occur fewer than 5 times. It's all long tail.

Normalized Query Frequency Distribution mean count: 1.40506



Where to break the head and tail in a distribution is generally subjective, however Wikipedia has a page on the [Head/tail Breaks](https://en.wikipedia.org/wiki/Head/tail_Breaks) algorithm, which breaks the head and tail at the mean value in the distribution. The mean frequency for the normalized queries is 1.40506, which means the head would be everything with a frequency of 2+ (12.4% of queries), and the tail would be all of the unique queries (87.6% of queries). Having a frequency of 2 (or even 10) out of over a million queries doesn't seem like the "head" to me, so I'm going to stick with my claim that there is no head!

Query Intent
While talking to Cormac about this analysis, we talked a bit about "modifiers" to queries, such as looking for things with specific colors, etc.—y'know, the kinds of things that might show up in structured data! It's a difficult and subjective line to draw, but I tried to divide between things that are essential and things that are preferences (like color of a car), and maybe things that would be reified in wikidata (e.g., "Chinese art") or not (e.g., "ugly art"), with reifiable things not counting as "modified". There are not many queries with modifiers—none in the top 100 most common queries.

In the random sample of 1,000 queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 95 (9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc. So, roughly 9-10% of Commons queries are likely porn-themed.

I took a sub-sample of 100 queries from this set and tried to categorize them by intent, to compare to the head-ish top 100 queries:


 * 30 were about specific or general things.
 * 23 things hard to classify more specifically
 * 1 had additional modifiers/specifiers
 * 3 specific products
 * 2 logos (also counted as images)
 * 1 film
 * 1 list (of latin phrases—??)
 * 22 were about specific people
 * 1 had additional modifiers/specifiers
 * 1 was about 2 specific people
 * 14 were about specific places
 * 1 had additional modifiers/specifiers
 * 1 was about an activity at that place
 * 12 were porn
 * 11 were about specific organizations
 * 3 were about general concepts
 * 2 were about an activity (one in a particular place)
 * 2 seemed to be looking for specific images (e.g., something fairly specific.jpg)
 * 1 was about art

I reviewed the top 100 most common queries and tried to categorize them by intent:


 * 61 were porn, and another 5 were likely porn. That's 2/3 of the most common queries.
 * The 4 non-English non-name higher-frequency queries were here: two German words and two Persian words.
 * There was one Category in the sample, and it was also in the porn category.
 * 7 were about "facts" (e.g., map of a place).
 * 1 list of...
 * 7 were about specific or general things.
 * 6 were about current events and topics in the news.
 * 5 were about non-pornographic celebrities or historical figures
 * 2 were about art or artists.
 * 2 were about specific places.
 * 2 were about tech topics or companies.
 * 3 I couldn't categorize (2 of them got 0 results).

Query Generality
I tried to categorize queries in various samples as narrow, broad, or somewhere in the middle.


 * From the sample of 1,000 random queries I categorized by language (representing the tail), I took a random subsample of 100 and categorized them by generality:
 * 60 were narrow (specific person, place, or object)
 * 22 were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
 * 17 were broad (many are one word); these were all porn but one
 * 1 was uncategorizable.


 * From the top 100 most common queries (the stubby head):
 * 24 were narrow (specific person, place, or object)
 * 27 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
 * 46 were broad (often one word); these were all porn but one
 * 3 were uncategorizable

Keywords and Specific Purposes

 * 25,047 queries (1.6%) specify a namespace. The most common are Category (15,519), File (7,044), and Template (823). 282 namespace queries were in Talk namespaces. There was one instance of a non-existant namespace: media.
 * 83 queries use the character ː, which is normally used in phonetic transcription to indicate a lengthened vowel (less often a lengthened consonant). It's only used that way once in these 83 queries. 81 of them use it in place of a colon with a namespace!—which does not work. (And there was one other weird one I couldn't figure out.)


 * 14,118 queries (0.9%) end in a file extension (but don't use the File: keyword). The most common are .jpg (6,482), .svg (3,142), and .png (1,786). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .stl, .tab, .tif, .tiff, .txt, .wav, .webm.


 * 13,239 queries (0.9%) have easily detected URL bits—e.g., ?q=query&thing=whatchamacallit. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search, though it isn't clear how the URL components are getting copied from Google to Commons. I suppose it could be a logging error of some sort. A partly sanitized example query is below (the source language is Polish, and kudłaty stwór means "shaggy creature").
 * Chewbacca&tbnid=12345678901234&vet=1&docid=12345678901234&w=3456&h=5184&q=kudłaty+stwór&hl=pl&source=sh/x/im
 * Searching for kudłaty stwór in images on google.pl does give the Polish Wikipedia article and image for Chewbacca as the 4th result... so something is going on here, but it isn't clear what.
 * Of the 12,275 queries with tbnid= in them, all but one start with what looks like a query in Latin script (the other one is Bengali).
 * Of the 7,273 tbnid queries with a q= parameter, all but about 25 are largely Latin script; the others are Arabic, Bengali, Cyrillic, Devanagari, Emoji, Greek, Hebrew, Korean, Tamil, and Thai.


 * Only a handful of queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
 * AND: 83 of 234 queries with AND seem to be using Boolean operators, even though they don't really do anything (everything is ANDed together by default).
 * OR: 44 of 77 queries with OR seem to be Boolean operators.
 * NOT: 2 of 17 queries with NOT seem to be Boolean operators.
 * !: None of the 475 queries with ! seem to be well-formed, intentional Boolean queries. Most are not formatted as Booleans—Welcome! is not a Boolean query. The ones that are look like typos, like !mismatched quotes" or !st (looks like a mis-shift key for 1st). Others don't seem terribly useful, like Category:!Dogs returns everything that is not in Category:Dogs. Fewer than 10 queries are plausible Boolean queries.
 * -: 533 queries plausibly use - as negation. I did a quick skim, and the large majority of them look to be using it intentionally. 70 queries start with - and most of them don't look to be using it intentionally (or at least not correctly), since it's very hard (but not entirely impossible) to imagine -172 is a useful query.
 * ~: 110 queries use ~. Almost half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Many look like typos for a hyphen, as in 1900~2000. About 15 look to be intentional use of ~.
 * : A handful of the queries using AND, OR, or NOT as Booleans also use parens—even though they don't do anything in our current search system.
 * +: Ignoring tbnid queries, 2,304 queries use +. Most seem to be using it in place of a space.
 * There are no special keywords with colons other than namespaces in my sample. Carly asked about  so I looked a little harder for those and found two instances of , six  , and one  , none with any other search terms. There was one malformed query:  , but it is also missing the colon.

Light vs Heavy Normalization
While talking to Erik about the Top N queries report he's putting together, he mentioned that he's doing a stronger form of normalization for that report, and replacing all punctuation with spaces before normalizing whitespace.

This is more likely to group queries that get different results than my "light" normalization used above—though even the light normalization may do so because of the way we treat CamelCase search terms. The most obvious case is removing quotes from around a query: John Smith will get many more results than "John Smith". However, as Erik rightly pointed out, in most cases, such variants probably generally represent the same query intent.

It turns out that in my sample, light vs heavy normalization makes only a small difference, about 1.5-2.5% increase in buckets other than the singleton bucket.

The most common change among the heavily normalized top 100 queries is stripping quotes. The most impactful change—in terms of increasing the number of queries grouped together—is stripping hyphens, though it only applied to one query.

Distribution of Hits
Below is a summary of the distribution of results from all of the queries in the sample.

153,549 queries (10%) get zero results. 9,566 queries (0.62%) get more than a million results. The idea number is probably somewhere in between.

Result Count Frequency Distribution

Distribution of Token Counts
The number of tokens (roughly, words) in a query is an easy proxy for the complexity of a query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.

Here, we're using a very simple process of breaking tokens on spaces and punctuation. So, a long Chinese sentence would be counted as one token, and ain't would be counted as two, and ''.. ,, ;; --'' would be counted as zero. It isn't perfect, but it gives us a reasonable approximation of what we have.

Of note, 1,011,221 queries (66.0%) are one or two tokens. 1,417,935 queries (92.6%) are one to four tokens.

All of the "zero-token" queries are strings of punctuation and symbols.

Of the 20 queries with 40 or more tokens, 4 are gibberish, and most of the rest seem to be captions from images (presumably looking for the original image, but who knows?). 5 are in French, 1 in Spanish, and 11 are in English.

Tokens Count Distribution

Miscellaneous Odd or Interesting Queries

 * Erik's Top N report doesn't exclude anything, and covers all wikis. We noticed some politics-themed searches on English Wikipedia like 2020 Texas US Senate Election with no variation in whitespace or capitalization. I'm not sure I found the source of the query, but I did find a political website that helps you determine what politicians you align with by asking you questions about various topics. For each topic, they have a "Learn More" link which links to a search on English Wikipedia. So it is definitely possible that unexpected commons searches (e.g., a relatively complex query with no variation) may be the result of links—which may not be publicly available.
 * I saw a handful of queries that look like attempts at SQL injection.
 * There are some random-looking queries, e.g. 510d47d9-4f0a-a3d9-e040-e00a18064a99, that turn out to be identifiers in collections that images have been taken from.