September 2015 — See TJones_(WMF)/Notes for other projects.

Beginning to Quantify Why People Use Search Engines Instead of Wikimedia Search

Introduction

The initial hypothesis is that there are two big reasons why people would use internet search engines instead of Wikimedia search:

external search engines give better results than Wikimedia search
it's just habit, it's more convenient, users don't know about our search capabilities, etc.

We can try to begin to quantify this by looking at queries that come from search engines and lead to Wikimedia pages (at least those with referrers) and test whether those same queries (possibly minus wiki or wikipedia and similar search terms) give the destination page as a result using our search (say, top 5 or top 10 results).

If they don't, then it gives credence to the idea that people are using external search engines because they give better results.

If they do, then it's habit or convenience of using an external engine or ignorance of our search capabilities.

In the former case, we have examples of what we need to work on in our own search engine. In the latter case, we need to work on our advertising!

Caveats

For simplicity in this first investigation, we're only looking at referrals from Google to English wikipedia (en.wikipedia.org or en.m.wikipedia.org). However, we didn't filter out which Google, and it turns out that results vary by which version of Google you use. हेमा मालिनी (Hema Malini) is an Indian actress. The query (constructed by me for illustration) हेमा मालिनी wikipedia on google.com gives results for the Hindi Wikipedia. However, the same query on www.google.co.in gives results in both Hindi and English Wikipedias.

We don't know which position the Wikipedia result was in among the Google search results; it could have been first, fifth, or fortieth. And those results may vary from search to search, based on personalization of results, so it isn't necessarily recoverable, either.

We don't know whether Google made a suggestion and showed those results instead. For example, barca vikipedia shows results for barca wikipedia instead (for me, today). (And I missed this in my list of "wiki" terms below, so it's a miss for us.)

We don't know whether the target page is actually what the user was looking for, only that they clicked on it. I didn't try to reconstruct search sessions (e.g., multiple clicks in quick succession from the Google results, with the same query terms and the same user). We also have no way of knowing whether the user just found something else interesting to look at on Wikipedia.

We don't know when Google is pulling out additional information from a Wiki page and displaying additional results. Searching on roy has the top result as Roy_(film) with several actors mentioned in the article provided as additional links. One of those was apparently the link the user clicked on.

We are not taking into account good matches that are not the same as the target page. For example, when searching for rem, a disambiguation page is the first result (and an exact title match), and R.E.M. is the first item in the list. Is that an effective way of getting the user what they want? Probably. Does it count here? Alas, no.

Comparing some queries to the target pages, they don't look like a great match, but who knows what people really want.

Data

From a 5-day sample of logs from Sept 11-15, 2015, we were able to pull out 93,411 referrals from Google with recoverable query strings. 41,794 landed on en.wikipedia.org or en.m.wikipedia.org.

These were randomly sampled down 10-to-1, to 4,257 samples for testing. Of those, 89 were instances of fully specified URLs for Wikipedia pages (e.g., https://en.wikipedia.org/wiki/Coldplay ), and were removed from the sample, leaving 4,168 samples.

121 queries had "wiki" search terms in them, which I have interpreted generally as instructions for Google to get results from Wikipedia. The terms include:

wiki
wikipedia
in wiki
wikipidiya
wikipidia
wikipedia the free encyclopedia

On later review of failed queries, I discovered that I missed some potential "wiki"-like terms, including vikipedia and encyclopedia which, when removed, give good results.

There were also 8 instances of users searching for Wikipedia to get to the Main_Page. (Note that the main portal at wikipedia.org isn't on en.wikipedia.org, so links there weren't included.)

Methods

For each of the wiki pages that users linked to from Google, I looked up the "canonical" version of the page (i.e., the result of any redirects) and noted that. For example, https://en.wikipedia.org/wiki/Sun_(astrology) redirects to https://en.wikipedia.org/wiki/Planets_in_astrology. For the purposes of counting matches, results from either the page linked to by Google or the "canonical" version of the page the user would be redirected to were considered hits. There were only 15 such pages, and only 13 unique ones.

For each query, I ran the query against enwiki, noted whether the paged referred to by Google (the "target" page) was present in the top 5 or top 10 results, re-ran any suggested queries, and noted whether the target page was present in the top 5 or top 10 results from the suggestion.

For queries with a "wiki" term (as listed above), I dropped the wiki term(s) and re-ran the query and any suggestions, again noting whether the target page was present in the top 5 or top 10 results.

Results

Wiki terms

enwiki search takes "wiki" search terms seriously, and they modify the results. In many cases, the queries with "wiki" terms in them were exact matches to Wikipedia article titles, plus the wiki term, such as moses malone wiki. In this case, moses malone is an exact title match and gives good results, while moses malone wiki does not. Other queries are unaffected, such as alesha dixon wikipedia vs. alesha dixon.

Of these 121 queries with "wiki" terms, 17 got results with wiki term, but 76 got results without wiki term (and the 17 were all included in the 76).

For the rest of this analysis, I used the de-wikified queries.

Overall Matches

Out of 4,168 queries:

2,445 queries (58.7%) had a top-5 hit for the target page (2,437) or its canonical version (8).
2,609 queries (62.6%) had a top-10 hit for the target page (2,600) or its canonical version (9).
846 queries resulted in suggestions.
- 169 suggestions had a top-5 hit.
- 180 suggestions had a top-10 hit.
- In 51 cases, both the original and suggestion matched.
  - Thus 118/129 (2.8%-3.1%) suggestions added helpful information (top-5/top-10).

"Exact" matches

Of the 4,168 queries I looked at, 1,124 were "exact" matches to the target page, and 4 were exact matches to the canonical version of the target page (27.1% overall). "Exact" matches ignore case and extra whitespace, parens, commas, and periods.

Most of the "exact" matches (27.0% of all Google referrals) got top-5 results from either the target page (1,119) or the canonical version (6). For an example of the canonical match, sophie simmons gives Gene_Simmons as the first result. (KISS band-member Gene Simmons is Sophie Simmons' father, and Sophie_Simmons redirects to Gene_Simmons.

There was one "exact" match that was top-10 rather than top-5 (rem to R.E.M. at #8, though the disambiguation page was #1, as above).

The two "exact" matches that failed were xxx and xxx., both of which had .xxx as their target.

The Sophie/Gene Simmons example brings up a question about how we display redirects in results. In this case (see results), it seems that since "Simmons" partially matches "Gene Simmons", there's no redirect information in the search results indicating that Sophie_Simmons redirects to Gene_Simmons (and Sophie doesn't appear in the matched snippet). Compare that to the result "Shannon Tweed (redirect from Shannon Tweed-Simmons)". If this is in fact based on matches in the redirected title and a lack of matches in the canonical title, it might be good to re-evaluate the criteria used; instead of binary match/no-match, we might compare "match" and "better match". Clearly "Sophie_Simmons" is a better match for sophie simmons than "Gene_Simmons", though the exact method for evaluating that in the more general case isn't 100% clear.

xxx-related searches were the most ambiguous, with targets including the various parts of the xXx film series, the .xxx domain, and Super Bowl XXXI.

Non-Matches

So what about the 35% or so that didn't match, and didn't get a good suggestion?

It's impossible to generalize about all the missed queries, and I did not have time for an exhaustive survey, looking at the queries, trying to divine the users' intent, and checking the queries' results on Google and on Wikipedia.

Instead, I skimmed the queries, looking for patterns and interesting mismatches between queries and target pages. Below are some of my observations.

Sometimes Wikipedia doesn't know

One example is ampiclox, a combination drug composed of ampicillin and cloxacillin. The only mentions of ampiclox on Wikipedia are two dead links to an article on how to identify fake ampiclox. Google has access to the whole web, and can more readily make the connection between ampiclox, ampicillin, and cloxacillin—so a Google search for ampiclox wiki returns the wiki page for Cloxacillin, which also mentions ampicillin in the Antibacterials template at the bottom of the page.

Also, Google has a better handle on popular topics, and presents "popular" results higher in its results.

"Advanced" results

A small number of targets are not actually possibly results from Wikipedia search without invoking the Advanced search options. I didn't check whether these were returned when searching the appropriate namespace.

14 Category: pages
6 User: pages
3 Talk: pages
3 File: pages
1 Portal: page
1 Wikipedia: page
1 Template: page

We could consider searching for really, really good matches in some namespaces—especially the Category and File namespaces.

Old friends: typos and ?

I'm seeing a lot of typos that occur in the first two letters of the query, which we can't handle in fulltext search now because of the prefix length problem (the prefix length limit is 2, so the first two letters have to be correct—a reverse index would help with this). I didn't quantify these, I just noticed them as I was skimming.

I'm also seeing a fair number of question marks (?) in questions. The current Help:Searching documentation incorrectly suggests typing in questions, though the ? works as a wildcard. These are easier to quantify, since I could search for ? and quickly check them all. Only 28 queries had a ? in them, and all are questions. (Two targets had ? in them: List_of_Who_Wants_to_Be_a_Millionaire?_top_prize_winners and Where's_Wally?)

Abbreviation & disambiguations

There are a number of short queries that are abbreviations for their targets:

rem / R.E.M. (discussed above)
bap / B.A.P_(South_Korean_band)
mgh / Monumenta_Germaniae_Historica
AFIS / Automated_Fingerprint_Identification_System
b2 / Northrop_Grumman_B-2_Spirit
csi / CSI:_Crime_Scene_Investigation
go / Go_(programming_language)

In each of these cases, the query matches the title of a disambiguation page, with the target page (and others, sometimes many others) listed on it. I would consider these successful searches (though I haven't accounted for them in the numbers here), since the target is presented in an organized fashion, along with other very reasonable alternatives.

We could conceivably get some mileage out of processing initialisms better. In this case, R.E.M. and B.A.P. go directly to the right page. M.G.H. doesn't seem to be a relevant abbreviation, though.

Sometimes the default page is not a disambiguation page:

d / D_(programming_language)

In this case, the disambiguation page is one extra click away. Given the small amount of info to work with, this is a decent result.

A similar situation applies to many other one-word queries that are ambiguous:

compton / Compton,_California
continuum / Continuum_(TV_series)
daddy / Daddy_(1989_film)
koti / Saluri_Koteswara_Rao

In each case, the query lands on a disambiguation page that includes the target. These are reasonable results. This also applies to some other short queries, like the town, the word, and the visit.

Sometimes, one extra step is required (as with d above):

indra / Indra_(2002_film)

These human-curated disambiguation pages are really a valuable asset. Disambiguation in the absence of context (i.e., a one-word query) is impossible. Continuum, for example, has more than 50 links off the disambiguation page. They can't all be top-5 results for the one-word query.

We don't handle "exploded" abbreviations well, but we could certainly do better in this specific scenario:

A S L O / ASLO
I C T / Information_and_communications_technology

Capitalization

Sometimes capitalization matters, and if a query isn't all lowercase or ALL CAPS, maybe the user is trying to tell us something:

Abbreviations in IT / List_of_computing_and_IT_abbreviations

IT isn't a pronoun/low-value-term/stopword here, and it's obvious from the capitalization. Indexing to deal with this could be hard, but might be worth pursuing. As a comparison, Abbreviations in computing fares better (but not great), since computing is not discounted as a terrible search term the way IT/it is.

I noticed another example when looking at MGH: MgH is something specific (Magnesium monohydride—Mg + H) that is distinct from all the other MGH's out there.

"Quotes"

Much to my surprise, adding quotes around a single word changes the results! "dating" can successfully make a title match if you search in the upper right search box, but if you search in the full text search box on the search page, Dating isn't in the first 20 results!

It works for almost any one-word query. Sometimes the obvious targets don't show up at all, and sometimes they are ranked lower than expected.

We also don't seem to do spelling correction inside quotes. The example "give us back our elevn days" illustrates this. Without the quotes, we can correct elevn and get the right result, and of course with the quotes and eleven spelled correctly, we get the right result.

Questions and expression templates

In addition to the problems caused by ? (see above), there are lots of questions and other templated language where the "framing" text leads our search engine astray.

Some examples, with the potentially templated language bolded:

define analytical balance
about chinatown london
about rajpal yadav.
explain compounding
History of christ embacy
how did the mission system affect society
how is gluuronic acid produced
how many cylons in battlestar galactica
how many double o agents are there
how tall is kellita smith
Types of crops
what a cell wall structure
What are herb
what happened to muammar gaddafi bodyguards
what is a plaster
Whats visa debit card
where is cromer manitoba
who is the voice for the geico gecko
who wrote the story of my life
why did edgar davids wear glasses
Wwe brock lesoner real weight and height
height of uluru
david s k lee net worth

Of course, there are counter examples for all of these, where the "framing" text is clearly relevant:

About a Boy
Explain Plan
Define the Great Line
History of the United States
Who Is She 2 U
How Did It Ever Come to This?
How Many Drinks?
Types of motorcycles
What a Cartoon!
What Are Little Girls Made Of?
What Happened to Us
Who Wants to Be a Millionaire?
Why Did I Get Married?
Height of Buildings Act of 1910
List of Swiss billionaires by net worth
Weight and height percentile

So, this would require careful experimentation and tuning.

Truncating queries

In many cases, it seems that queries have a loose topic-comment structure, where the main subject is presented, followed by some elaboration:

donnie dumphy without make up
baku azerbaijan
elapidae family
ellie goulding husband

In each of these cases, which get no results or lots of incidental results (where the target is mentioned but isn't central), the results would be greatly improved by including the longest prefix that matches an article title: donnie dumphy, baku, elapidae, and ellie goulding.

Thesauri and reformatting queries

The query hero movie 2015 does not return Hero_(2015_film) in the top 20 results! Of course, Wikipedia articles seem to always have "film" in the title. hero film 2015 gets the right result in fifth place. However, the canonical format, hero 2015 film gets the desired result in first place—and gets some very nice results in second and third: Hero_(2015_Hindi_film) and Hero_(2015_Japanese_film).

There's an argument to be made here that if we can't get good results from a given class of query, we could modify the query to get better results. One could argue that if "movie" or "film" and a likely date appear at the edge of a query (e.g., not in the middle, like Cannes Film Festival Award 2015 for Best Actress), change movie to film move them to the end of the query as film <date>.

In general, a thesaurus that maps from common query terms to common article title terms might be helpful. movie list to filmography, Chevy to Chevrolet, etc.

Similarly, a mapping between query terms to categories and templates might help float better results to the top. mad max remake cast doesn't bring up Mad_Max:_Fury_Road. But mapping "cast" and "remake" to film-related categories (and others, like TV shows, plays, etc), might help, especially since Fury Road is not technically a remake / reboot.

What the?

Surprisingly, india cricket world cup didn't give India_at_the_Cricket_World_Cup in the top 20. That's just weird. Not even any quotes to cause trouble.

The Wikipedia Rabbit Hole

Foreign languages and non-Latin diacritics

We had a small number of foreign language queries: 4 in or partly in Arabic, 1 in Hindi, and 1 in Greek.

One thing I noticed that we aren't good at is matching accented versions of non-Latin character sets.

The name of Vassiliki Thanou, the Prime Minister of Greece, can be written as either Βασιλική Θάνου or Βασιλικη Θανου. Of course, enwiki has the more formal form, with the diacritics. In the Greek wiki (elwiki), the one with diacritics takes you to her page, the one without gives her page as the first hit. In enwiki, the one with diacritics gives her page as the first hit, the one without fails. In enwiki, we could probably index words in Greek without diacritics as well.

A quick check shows that we do reasonable things with Latin diacritics (apologies to any Francophones):

query	enwiki results
"Tete a Tete"	239
"Tête à Tête"	152
"téte a téte"	3
"tète a tète"	3

(I also checked on Czech, and we do fine there in enwiki.)

Basically, if you search for the unaccented version, you match anything, accented or not. If you use an accented version, you match only that. That's quite reasonable, and probably is achieved by indexing the original, and the de-indexed versions. We should do that for other alphabets / character sets.

Including Cyrillic: searching for Серге́й Кошелев (with diacritic) gives Sergei Koshelev's page. Searching for Сергей Кошелев (without diacritic) gives a link to a different Sergei Koshelev who translated part of Lord of the Rings. On ruwiki, Кошель,_Сергей_Викторович has no diacritics in the title, but they are there in the first few words of the article: Серге́й Ви́кторович Ко́шель.

Other alphabets / character sets may make sense, too.

And this could be expanded beyond enwiki. Searching for Βασιλική Θάνου on ruwiki works, but Βασιλικη Θανου does not.

Suggestions / Future work

Of course, as all of these notions are inspired by Google queries, any we think of working on should be validated against our own queries as being likely to be useful—or part of a strategy to support more query types. (Most of these things do look familiar from my time spent with other Wikipedia queries.)

At some point there aren't going to be any really big wins to be had anymore from small changes. At that point, we either have to go for really dramatic changes, or make small changes and chip away at the long tail. Both take a long time to show results, and pursing both is a good idea, if we have the resources to do so.

Some of these are amenable to machine learning approaches, and some may best be handled by hand-crafted solutions. Much would be language specific—in which case we need to decide whether they are worth it, based on the impact they could have.

Below are suggestions to consider that are relevant at different places along that long tail, roughly grouped and in no particular order.

Search syntax and documentation

Figure out a good way to deal with the ? problem. We could skim our on-wiki queries and see how often question marks are used in an obviously non-questioning way. Maybe we should deprecate the ? as wildcard (or do something else, though whatever it is would be less obvious to experienced users), or do some sort of question detection and not treat ? as a wildcard when it looks like a question.

Proofread Help:Search documentation to be sure it is accurate about wildcards and other aspects of search.

Indexing

Keep working on the reverse index! Prefix length limit makes all kinds of sense, but with a reverse index we could at least have a chance to match words that don't have both an error in their first two letters and an error in their last two letters. (We could also think about ways to make searching more robust to typos—maybe some sort of sounds-like search option? See phabricator:T5140.)

Handle initialisms better: index R.E.M. as rem and B.P.A. as bpa, for example.

Pay attention to capitalization. If a user goes to the trouble to capitalize some letters and not others, maybe it means something. IT isn't necessarily it, and MgH isn't MGH.

Index Greek, Cyrillic, and possibly other alphabets or character sets in their original form, and stripped of diacritics, as we do for Latin characters with diacritics.

Sorting and Displaying results

Re-evaluate how redirects are displayed in results. See the example of Sophie Simmons above.

Pull out information for/from particular templates, infoboxes, or categories to present additional, related results embedded in the main results.
- For disambiguation pages that are exact matches, give the "best" n queries (see Erik's recent PageRank experiments).
- For movies, list stars, directors, etc.
- For authors, list notable works, etc.
- For books, list authors, etc.
- etc., etc., etc.—and any of these could pull out relevant links from infoboxes, or "best" links using some metric.

Look at other measures of relevancy. Erik has a great experiment with PageRank-like link scoring. We could also measure recent article activity (visits and edits), query activity (changes in frequency of terms and phrases) to identify "hot" topics. Whether we can do this fast enough to keep up with trends is another question—i.e., is the necessary hardware worth the search quality boost? We need to figure out both extent of the hardware and the boost!

Work out a way to merge result streams from different related queries. If we mangle a query to improve it, or run several versions of it, how do merge results from these streams into a single result set?

Query mangling and other alternate searches and results

"Exploded abbreviation" recognition: if all the terms in a query are capitalized single letters, maybe run them all together and try again? See A S L O and and I C T above.

Look into query mangling for certain categories of popular query types. Using thesauri (see below), term re-ordering, etc.

Thesauri—general thesauri exist (i.e., linking lawyer and attorney), and statistical "related terms" thesauri can be computed based on statistically improbable co-occurence (on the assumption of random distribution), which would allow us to do things like, look for related terms when we can't find any/many results, or suggest disambiguating additional terms for ambiguous queries, or disambiguate terms based on co-occurence in queries, etc.

Handle questions and other query templates to remove irrelevant text from the query, when appropriate.

Look into longest-prefix-of-query vs. article-title matching (or maybe longest-substring within a query)—possibly requiring at least two words. If this gives a good article title match, pop it to the top!

Handle quotes better:
- Either figure out why quoted single words don't do well, or strip the quotes—at least for potential title matches.
- Figure out why we don't do spelling correction inside quotes (esp. longer phrases) and fix it.

Investigate searching for really, really good matches in some namespaces—especially the Category and File namespaces—even if not requested by the user.

Misc

Investigate why india cricket world cup didn't match India_at_the_Cricket_World_Cup, and see if there's a more general issue.

Relevance Lab!

We've talked before about having a relevance lab, and we have lots of pieces that would go into it, but everyone is still doing their own thing. It would be nice to have a unified framework to work in, and potentially open it up to others for experimentation!

I see at least four kinds of experiments we'd want to run, which have different infrastructure needs:

differing queries: I want to compare how tall is X / height of X queries against just X for many values of X.
differing configs: I want to compare slop=0 to slop=1.
differing code: I want to compare running with this Gerrit patch against a baseline.
differing indexes: I want to compare having a reverse index to not having a reverse index.

Some experiments would overlap these categories, but the requirements for some are greater than others. Obviously just running different queries and collating the results is the easiest. And config changes are easier than having different indexes.

The automatic and manual evaluations we could run would be the same in all cases. Some automatic evaluations:

# queries with zero results
# queries with changes in order in the top-N (5?, 10?, 20?) results
# queries with new results in the top-N results
# queries with changes in total results (very pretty 2-D graphs await!)
etc.

Changes that make little difference would be fairly obvious, and in cases where there is some difference in results, we could link to example queries that show side-by-side results (with diff highlighting, if we want to get fancy), allowing for manual review of affected queries.