User:TJones (WMF)/Notes/Cross Language Wiki Searching

September 2015 — See TJones_(WMF)/Notes for other projects.

Summary
More that 85% of failed queries to enwiki are in English, or are not in a particular language. Only about 35% of non-English queries in some language (<4.5% of zero-results queries), if funneled to the right language wiki, get any results.

The types of queries most likely to get results from the non-enwikis are names and queries in English. There are lots of English words in non-English wikis (enough that they can do decent spelling correction!), and the idiosyncrasies of language processing on other wikis allow certain classes of typos in names and English words to match, or the typos happen to exist uncorrected in the non-enwiki.

Perhaps a better approach to handling non-English queries is user-specified alternate languages.

Introduction
We want to evaluate the effect of searching other wikis when presented with non-English queries on enwiki that get no results. Even if language identification is perfect, does it help? When it isn't perfect, does it help? Does it hurt?

See Language Detection Evaluation for more details on how the queries were collected, how languages were identified, and other features of the data we're using.

I used the same query data here that I looked at before when evaluating language detection approaches, but dropped the DOI searches, since (a) they get searched on 25 wikis already, and (b) zero results are expected in most cases. This left a set of 1419 zero-results full-text queries from enwiki.

Experimental setup
I ran two sets of experiments.

In the first experiment, for queries identified as being in a language I ran the queries against the appropriate wiki (except in the case of the one query in Hmong, since there doesn't seem to be a Hmong Wikipedia), to see if any results were returned.

In the second experiment, I ran all 1419 queries against each of 27 wikis, including the biggest wikis, the wikis in the languages I identified the most queries as being in, and Romanian (since the ES language detector plugin likes Romanian so much). The goal here was to see how often results were returned, and for what type of queries. The full list of wikis queried is: ar, bg, bn, ceb, de, en, es, fa, fr, hi, id, it, ko, ms, nl, no, pl, pt, ro, ru, sv, sw, tl, tr, vi, war, zh.

I did not run the experiment where the queries are run against the wikis of the language chosen by the ES language detector plugin. We may evaluate other language detectors, and it's straightforward enough to pull that information out of the current results sets if we need to.

Note that we are not accounting for suggestions originally provided by enwiki, or other wikis. This is in part because the original hypothesis was that more of the failed searches would be in particular languages other than English. But most failed enwiki queries are either in English (41.3%), or are not language per se (46.6%).

Perfect identification, ignoring non-language queries
Assuming a perfect ability to detect languages (maybe not perfect, but at least as good as a motivated human, i.e., me) and ignoring non-language queries, how would we do? Note that only 176 of the 1419 queries are in a language other than English, and 1 is in Hmong, which has no wiki, leaving 175 (12.3%) of zero-results non-DOI queries that fall into this category.

Only about 35% of languagey queries (62 queries, 4.4% of all zero-results queries), if sent to the correct wiki, would give any results. This is much lower than our overall successful search rate, but it isn't too surprising, given the number of typos and other oddities in these queries. See table below for details.

While the error bars on this measure are relatively large (and absolutely huge for the individual languages), the overall pattern is clear. fr1

rf2
On Names ==5g

fr1=='' '' Note that the analysis above doesn't take into account cultural information in non-language queries. Names are not really in a language, though often they are characteristic of a language (i.e., the ethnolinguistic source of a person's name comes from a particular language/culture, which is reflected in the character set and spelling details). However, that doesn't mean such a person is well-known in that language/culture.

A good example is Kris Brkljac, an Australia, whose last name (historically probably Brkljač) is Serb/Croat. He recently came to fame (i.e., showed up in our logs) for marrying American TV actress Stana Katic (Katić), who was born to Serbian parents from Croatia who emigrated from Yugoslavia to Canada and raised Katic in Canada and the US. Katic is on American TV, but has fans all over the world. Exercise for the reader: What's the best wiki to look up info on Kris Brkljac, based on his name?

In short, names are hard.

Also, not all of the "Names" in the data set are well-formed names of notable people. There are Unix timestamps before perfectly good names, partial names or name elements (e.g., Naomi Campb or McWashingt), and names of non-notable people (i.e., egosurfing for self or friends).

Searching other wikis, ignoring language identification
At the other end of the spectrum, I ran all the 1419 zero-results (non-DOI) queries against the 27 wikis listed above. enwiki is on the list, too, because things change over time (9 new results popped up!). See more below in Observations & Anecdota.

It's not appropriate to calculate precision based on these numbers. Names of people, places, and things can get appropriate results, and are more cultural than language-based (see above).

There are oddities of language processing that allow non-enwikis to account for typos without having to make a suggestion, returning hits on names and even English words. And there's plenty of garbage in, garbage out going on, too. See Observations & Anecdota below for more examples.

To answer a question originally posed by David some time back, fr, de, es, zh, it, and pt are pretty good but far from great at getting some sort of results for queries (6.1% to 9.3%) that fail on enwiki. Note that es, zh, pt, and fr are also languages that have a fair number of queries on enwiki.

In many cases, the performance of these wikis at least partly corresponds to language-specific or other wiki-specific processing that compensates for typos. For example, Romance language wikis handle extra, missing, or incorrect -o and -a, and Germanic languages similarly handle -er errors. frwiki can ignore many repeated letters, and the Chinese wiki (zhwiki) deals differently with numbers, making it able to find English words without failing on queries with the Unix Timestamp bug.

Unexpectedly, many of the results are for words in English that happen to appear (correctly or incorrectly) in the various wikis and match typo queries (exactly or via the normal language processing on the other wiki).

Again, see Observations & Anecdota below for more examples.

What gets results?
Below are the results for the ten wikis that gave the most non-zero results, broken down by what category their results were in. Categories with no results are excluded from each table.

In all categories but Spanish, the wikis are returning results mostly for names and English queries. In Spanish, it's names, English queries, and Spanish queries.

Observations & Anecdota
This is not a comprehensive list of phenomena that explain search results on various wikis, just observations I made when looking into some unexpected results. I do feel that some of the explanations generalize well, such as the Romance and Germanic language wikis doing particularly well on errors that look like normal variation in those languages, and the Chinese wiki's handling of numbers.

Generally, queries not in the language in question generate a lot of noise. So, high-precision language detection seems like a necessity.
 * Some articles have popped up since these queries were originally submitted. An article has been created on Toby Sheldon/Tobias Strebel. Kris Brkljac married Stana Katic, and so is mentioned on her page. There is, for the moment, a typo, paperr, in a new article. (Which it is hard not to fix. Must. Not. Interfere.)
 * flastest, presumably a typo for fastest, flattest, flash test or something similar, gets hits in dewiki, where it matches flast and flaster. Any search for est gives results for er and . My German's not so good, so I don't know if this is a regular pattern, or if English morphology is being applied to German, or if endings of words are just generally discounted.
 * dodge challanger (typo for challenger) matches one article in the Swedish wikipedia because it has Dodge and the word Challange in it.
 * pestl, presumably a typo for pestle (as in mortar and pestle), gets hits in Germanic and Romance languages, perhaps because they are looser about losing word endings.
 * In general, different languages tend to get results when a word (often with a typo) happens to have extra bits that look like normal inflections in that language. So, kingdomo gets matches on the English word kingdom in es, pt, and it wikis, where -o is a regular modifiable word ending.
 * Similarly, hollyo (which I think is an incomplete hollyoak) gets matches on the name Holly in the same three wikis.
 * franko potente (typo of German actress Franka Potente) gets lots of hits in es, pt, and it wikis, and the top hit is the right one. dewiki gets one hit, and it's not her.
 * deadly devoti (incomplete deadly devotion) gets hits in rowiki because Deadly is present in English titles of movies, etc., and devoti (which looks like Romanaian devóți without the diacritics) matches devotat once you ignore the endings.
 * Some matches on typos (like straigt for straight, or allemagn for allemagne) match in other wikis where the typo is present (e.g., a reference to Scared Straigt, or a typo in an article title in French for allemagn, both in ruwiki).
 * Interestingly, non-enwiki's sometimes have better spelling correction suggestions than enwiki. enwiki suggests strange for straigt, but ruwiki suggests straight. A quick frequency check shows that in both cases it prefers the rarer word (in terms of document frequency), but this doesn't hold elsewhere. Hmmm.
 * The Chinese wiki (zhwiki) does quite well with the Unix Timestamps, even though all the queries it hit on are in English. 1438218314546:Wild Bill Hickok gets exactly the right results. Looks like zhwiki just ignores those unneeded numbers.
 * Thanks to Lsjbot, Swedish, Cebuano, and Waray-Waray wikis do very well on some of the species searches.
 * Some of my categorizations—which must always be a bit suspect—may have been wrong. astronomem looked like a typo of astronomer to me, but may be well-formed in German or Polish.
 * de, es, fr, and pt are more likely to give results for queries with quot in them, often because the English words quote or quotes appear in the article, and because endings don't matter so much, quot is a match.
 * frwiki seems to ignore repeated letters in indexing and searching. To my surprise, pppppppppppo got hits, on Po. In a separate test, hhhhoooooommmmmmmmmmmeeeee is a perfectly fine query on frwiki, as it matches homme.
 * Numbers randomly match in other wikis, in a couple of cases because an ISBN is in fact used in the non-enwiki.

Conclusions and Recommendations
Failed queries on enwiki often generate junk on other wikis (garbage in, garbage out), and many matches are to English words with typos or names, and are generally low precision. Perfect performance on language detection provides a chance to give at most a small boost to overall recall.

In many cases I looked into, the English suggestion is as good or better than the cross-wiki search. I would suggest not using cross-language search unless there is no suggestion—or at least, I wouldn't use it instead of suggestions. (We also need to figure out the UI for cross-wiki searching and determine whether we want to give results, a link to results, or what.)

I also think that maybe rather than language detection (unless we get much better precision than with the default ES language detection plugin), we would get more bang for our buck from some sort of alternate language list provided by the user, and then searching for results in those wikis. As mentioned before, I don't care how good the results are on zhwiki for 1438218314546:Wild Bill Hickok because I don't read Chinese. An important question is how to handle this for anonymous users and api users. A list of alternate language codes would be easy to tack onto the API. For anonymous users, can we give them a cookie?

As mentioned above, queries not in the language in question generate a lot of noise. So, high-precision language detection seems like a necessity if we go that route, both for good precision in results, and to prevent wasting processing time.