User:TJones (WMF)/Notes/Dropping Final Question Marks in the Top 10 Wikipedias

June 2016 — See TJones_(WMF)/Notes for other projects. See also Phab ticket T133711.

Background
We are trying to determine the best way to deal with final question marks, which, at least in enwiki, dewiki, eswiki, friwiki, itwiki, and ptwiki, are more likely to be used as question marks than as wildcards, though they are currently treated as wildcards (matching one character) by Search.

In this analysis we are trying to get a sense of the impact of stripping final question marks from queries for each of the top ten Wikipedias by search volume per language (English, German, Spanish, Russian, French, Portuguese, Japanese, Italian, Polish, and Chinese), as determined by the Zero Result Rate by Language and Project dashboard.

Corpus Creation
I pulled 100K fulltext queries for each of the English, German, Spanish, Russian, French, Portuguese, Japanese, Italian, Polish, and Chinese Wikipedias.

In general, I got the 100K random queries from the week of May 19-25, 2016. In some cases, the Hive query failed to finish reducing, so I switched to another week (May 9-15 for Portuguese); when that didn't work, I broke the queries up into smaller pieces (~14K queries per day, or ~7K per 12-hour period for English, and even as far as ~3.5K for 6-hour period for Chinese).

The queries were not limited to poorly-performing queries (as my previous quick analysis had done). However, the other criteria for query selection were used:


 * Query came from the search box on .wikipedia.org
 * Exclude any IP that made more than 30 queries per day
 * Include not more than one query from any given IP for any given day
 * Only the _content index was searched (except for Wikipedias that search multiple indexes by default)

Question Mark Prevalence
Below are the number of queries that have trailing question marks (?$), and the number of queries with any question marks (?), and the percentage of queries with question marks that have final question marks (?$%):

After deduping and removing queries that consist of nothing but spaces and punctuation [.,:;?!-] the number of unique queries with trailing question marks from each wiki were:

A quick look at the resulting query sets reveals the usual elements: a certain amount of junk, a number of queries in languages other than the language of the wiki, and a lot of obvious questions (some more obvious to me than others depending on the language). In the data from Spanish Wikipedia, there were also a number of queries starting with ¿. (It seems that ¿ is ignored on eswiki, though its presence prevents exact title matches.)

Samples for RelForge Testing
For each wiki/language, I took the full set of ?-final queries and a random 5K sample for each, to run through RelForge with and without final question marks. I excluded queries that consist of nothing but spaces and punctuation from the 5K sample, as those cause problems for RelForge.

In the 5K samples, there were a number of ?-final and ?-containing queries. The breakdown by wiki/language as above is:

The proportion of ?-final queries among all ?-containing queries is roughly the same as we saw in the 100K samples, especially when you take into account the small sample sizes (e.g., there are only 3 ?-containing queries in the Japanese sample).

Sample Munging and RelForge Hacking
For each wiki/language I created new corpora from the corpora above by stripping all final question marks and spaces. I ran the ?-final corpora and the 5K sample corpora and their ?-final-stripped counterparts through RelForge.

I also added a new metric to RelForge that counts poorly-performing queries (i.e., those with < 3 results), since that category has come up so often in recent discussions. I plan to refactor this change and commit it to RelForge.

Results and Conclusions
Below is a numeric summary for the ?-final corpora and 5K sample corpora for each of the ten Wikipedias.

Key:


 * #?$: the number of ?-final queries in the corpus. For the ?-final corpus, this is all of the ?-final queries (except those composed of only ?, spaces, and punctuation) out of the 100K sample. In the 5K sample, it's just whatever was in the random sample.
 * err: the number of errors when running the corpus in RelForge. (See below.)
 * ZRR before: The Zero Results Rate with the original queries.
 * ZRR after: The Zero Results Rate with the final ? and spaces stripped.
 * ZRR Δ: The difference between ZRR before and after.
 * PP before: The percentage of Poorly Performing queries (fewer than 3 results), wiht the original queries.
 * PP after: The percentage of Poorly Performing queries with the final ? and spaces stripped.
 * PP Δ: The difference between PP before and after.
 * Top 3 Δ: The percentage of queries with a change in their Top 3 sorted results (i.e., new results moved into the top 3, or the top 3 were re-ordered).
 * Hits Δ: Median change in Total Hits across all queries. (The Mean is sometimes wildly skewed by one or two results; see below.)

Errors
RelForge uses non-production copies of the various indexes on a non-production server. As a result, it is more prone to timeout errors. I did have to re-run some of the 5K samples to get error-free results.

In the case of the ?-final corpora, all of the errors are in queries that have one to three question marks at the end of the query, separated by a space. For example, how ?, what ??, or why ???

Since ? is a wildcard that matches a single letter, these queries are actually asking to match every one-letter, two-letter, or three-letter word, respectively. On English Wikipedia, ? would match every instance of a, every one-digit number, every article about a letter of the Latin alphabet, or the Cyrillic or Greek alphabets, and lots, lots more, so it's no wonder they timeout in RelForge.

Statistics for the ?-final corpora do not include the queries with errors. Fortunately there are only one or two per corpus, and only nine total. I was able to check them all by hand, and they all returned something (between 5 and 33,282 results)

Medians and Skewed Means
Sometimes queries without final question marks will give a wildly different number of results, which can make the mean and standard deviations on changes to total hits meaningless.

For example, in itwiki, n!? gets 0 hits, but n! (which apparently ignores the !), gets 456,341 results. On frwiki, f? gets 354,103 results (every 2-letter string beginning with f matches), but f by itself only gets 153,478 results.

Impressions & Observations

 * Across these ten Wikipedias, for ?-final queries, dropping the final ? improves the zero results rate (ZRR) between 15.4% and 37.9% (mean: 26.0%), and the poorly performing (PP) query rate 10.4% to 47.4% (mean: 26.4%). Note that many ZRR improvements are also PP improvements, and that some queries do worse without the final question mark, both on the ZRR and PP metrics. However, overall it's a big improvement, among ?-final queries.


 * Across these ten Wikipedias, for a random sample of 5,000 queries, dropping the final ? has little overall impact. ZRR and PP change is less than 0.3% in all cases, and the mean is 0.1%, with the biggest impact on ptwiki, which has a very high incidence of ?-final queries (see tables above). So, overall, it doesn't have a proportionally large impact on these Wikipedias.


 * Among the ?-final enwiki queries, most are obviously questions. I skimmed the list, and there are also a few queries that are in other languages (the ones I could easily read were questions, e.g., qu'est-ce que la jalousie?). There are some junk queries (e.g., ffttttoouuwji99?), as always. There are also a few titles that have question marks in them.
 * There is a pattern of a noun or noun phrase followed by a question mark, like leader?—that could be a wildcard, though in English there are no other non-names that match that pattern other than the plural, leaders.
 * Others following the noun? or noun phrase? pattern seem to make sense as questions, though they don't have any question words, such as nobel prize in economics? I don't see that as "economics+ one more letter". I've taken to interpreting  ? as "tell me about ."
 * Sometimes the pattern of noun? and typos collide to make for worse results. For example, Arnold Schwarzenegge? In this case there doesn't seem to be anything this could be about other than Arnold Schwarzenegger, and "Did You Mean" does the right thing if you removing the final question mark. Not all cases are so clear, though. (In fact, they are so unclear that I used this made up example instead of the real one because I haven't figure out what the real one refers to.)
 * As an example of titles with question marks, there is a Japanese light novel series called And you thought there is never a girl online? Unfortunately, searching the the title near-match And you thought there was never a girl online? title gets 0 results! (Though dropping the ? gets good results!)
 * A quick skim of dewiki, eswiki, friwiki, itwiki, and ptwiki queries shows the same patterns.


 * Not all of the results returned by stripping the final ?s are necessarily good. Longer queries with many high-frequency function words (as opposed to content words) generally don't perform well. As examples, on enwiki try what is a dog, what are tomatoes, or what is linguistics (without question marks!). The question word (what) and other function words (is, a, are) adversely effect the results.
 * On the other hand, the results without question marks are significantly less ridiculous. For the query, how can poison enter the body? only has one result, Matriarchy, because it contains a reference with the word bodye in the title, and just happens to have all the other words in the article. Dropping the question mark at least gives articles more relevant to poison. The query are viruses living? gets results primarily based on including the name Livings in the article; dropping the question mark gives an excellent result with a partial answer right in the snippet!

Conclusions & Recommendations

 * While ?-final queries aren't a huge proportion of all queries, there are still a lot of them in terms of raw numbers. Most seem to be questions (in the six of the ten top Wikipedias by search volume that I could easily read: English, German, Spanish, French, Portuguese, and Italian), and often give no results, or unexpected results. Since we can't expect most visitors, especially new visitors, to be familiar with bash-style wildcards, we should do something about it.


 * A longer-term solution is an "advanced mode" in which ambiguous search syntax (like ?) is treated as search syntax, and a default mode in which it is treated not as search syntax (in this case, stripped from the query everywhere, or at least at end of any word—queries with multiple questions are uncommon, but do occur). This may apply to other specifc bits of search syntax, too.


 * Short term, we could strip final question marks to improve results in most cases for most users.
 * It would make sense to ignore queries with other special syntax in them, like insource, prefix, intitle, or incategory.
 * As with "Did you mean" queries, we could also offer a one-click way to search with the original unmodified query. Options include an API parameter to disable final-? stripping, or a special syntax, like an escaped ? (e.g., \?). Both are a bit hacky, but leave current functionality in place.
 * It may also make sense to make the feature configurable by wiki, as there may well be wikis in which it does more harm than good. What the default setting of such an option should be is unclear.

Queries With Multiple Question Marks
Kevin asked: ''Of queries that end with a question mark, how many of those had other question marks as well? To me, that would distinguish a likely question from a likely other use of question marks. Would it be possible to compute those percentages?''

My first thought, having noticed some multi-question queries, was that additional question marks followed by a space would likely still be questions. Since there were only 39 queries with extra question marks, I just went ahead and looked at all of them.


 * ?$: queries with trailing question marks (out of a 100K sample), after deduping and removing queries that consist of nothing but spaces and punctuation [.,:;?!-]
 * ?…?: queries with at least one other question mark besides the trailing question mark
 * ?…?%: percentage of queries with trailing question marks that have another question mark in the query. That is, ?…?/?$.
 * ?_…?: queries with at least one other question mark followed by a space, besides the trailing question mark.
 * ?_…?%: percentage of queries with multiple question marks where the additional question mark is followed by a space. That is, ?_…?/?…?.

Of the 39 queries with a second question mark, 19 (about half) has a space after the question mark. Of those 18 were being used to ask an additional question, and the last one, in zhwiki, was this: ( ?° ?? ?°)?—I'm not sure what to make of that, but it could be a wildcard pattern.

As for the other 20:


 * de (1)
 * 1	second question without a space after it (i.e., something like warum?wie?).
 * en (3)
 * 1	second question without a space after it.
 * 1	possible wild card pattern: ?ur?i? —though we don't seem to support initial wildcards (for what it's worth, the pattern matches turnip).
 * 1	possible junk query (maybe a wildcard?): ?铲8?
 * es (2)
 * 2	Initial ?, looks to be ? used for ¿
 * ja (2)
 * 1	second question without a space after it. (Japanese doesn't use spaces after ?)
 * 1	extreme wildcarding or junk query: ????Ｄ???
 * pl (1)
 * 1	I'm going with junk: vvbbbbbgfffffg ,?,??????
 * pt (7)
 * 6	second questions without a space after them.
 * 1	junk query: tggg?.ttg'g.g'tg'j'ggg,.g??j!g?
 * ru (1)
 * 1	second question without a space after it.
 * zh (3)
 * 2	hard to tell. Format is ???? or ??? where  is a Chinese character.
 * 1	looks like a question, format is john1009?john?????? (but with a less common name than "john").

In summary, out of the 1M queries sampled across 10 Wikipedias, there are 6 queries (2 on enwiki, 1 on jawiki, and 3 on zhwiki) that end in a question mark, have a second question mark in them, and that might be patterns. All 6 will fail anyway because the first element of a query term is a quesiton mark wildcard. At least this is not a huge problem.

Note that I did not carefully review all 2416 ?-final queries; some of those may be wildcards. Most obviously are not, but some can be very hard to decide on.

Awesome and interesting question, Kevin. Thanks!

Queries with Non-Final Question Marks
Since we were originally only looking at queries with final question marks, a small number of queries with non-final question marks were ignored in the analysis above. I searched these same ten 100K corpora for queries with non-final question marks. The numbers are small, so I also reviewed all of the queries manually. Surprisingly many were in English, regardless of source, and there were a lot of URLs, typos, and junk queries. Also plenty of questions! There were a few potential wildcards—sometimes it's hard to tell with very short queries (e.g., b?t—is that a wildcard or a junk query?) I've erred on the side of wildcards in counting them. One interesting pattern I noticed is characters with diacritics being replaced with a question mark. It's hard to tell if those are wildcards, or just a conversion problem from one encoding to another. I've generally treated them as wildcards. So that's 26 out of a million (0.0026%) potential wildcard patterns—which doesn't seem to be too many.