User:TJones (WMF)/Notes/Searching for Punctuation Gives Weird Results

From mediawiki.org

June 2018 — See TJones_(WMF)/Notes for other projects. See also T196826.

Background[edit]

At the time that I wrote this, there was a problem on Farsi Wikipedia where searching for a hyphen (-, which is technically a hyphen-minus) would automatically redirect you to the page for the apostrophe. See T196826.

The problem is caused by the way we normalize isolated punctuation marks—either ignoring them or converting them to spaces.

Comments & Explanation[edit]

On-wiki search isn't really optimized for single punctuation characters, and so it can do weird things. In this case, a number of different factors are interacting to get this behavior.

First, a detour to explain how we analyze text to find matches. There are several ways:

  • "text" analysis does as much language-specific normalization as possible: breaking the text into words, lowercasing words, stemming (i.e., so hope, hopes, hoped, and hoping all match), removing foreign diacritics, dropping stop words (common words like the without that don't carry much content), etc. It's used for general full text searching. "Text" analysis generally ignores punctuation, especially when it is on its own.
  • "plain" analysis does as little as possible to the text other than breaking the text into words, lowercasing them, and doing some basic normalization of uncommon characters for most languages. In English, it also strips diacritics, because English almost always ignores them. It's used for "exact" matching, like when you search with quotes. "plain" analysis also generally ignores punctuation, especially when it is on its own.
  • "near match" analysis also does as little as possible, like "plain", but does not break the text into words. It's used for title matches. It doesn't break the string into words, but it does discount some punctuation marks by converting them to spaces, so that hyphenated-man, hyphenated_man, and hyphenated man are all equivalent.
  • "near match ASCII folding" is the same as "near match", but it also aggressively removes diacritics.

When you go to the search box, it looks for an exact title or redirect match, and if there is one, you are taken to it. (It's a little more complicated than this in the cases where you have entries that only differ by capitalization, like jack and Jack or ebay and eBay on English Wiktionary. If you search for jaCK or eBaY you will get sent to the one that's all lower case.)

If not, then it processes the text with "near match" and if there is exactly one title match (after deduplicating redirects), you are taken to it. Thus on English Wikipedia you can search for Albert Einstein, Albert_Einstein, or Albert-Einstein and get the expected result. On Farsi Wikipedia, آلبرت_اینشتین, آلبرت اینشتین, and آلبرت-اینشتین also all work.

If "near match" doesn't get any results, "near match ASCII folding" takes a turn and if there is only one result (ignoring redirect duplicates), you are taken to it. On English Wikipedia, you can search for Ḁłɓęȑṭ Ǝḭɲṧʈɇḯȵ and get taken directly to "Albert Einstein".

If "near match" has more than one result, or "near match ASCII folding" gets no results or more than one result, then the query gets set to the full text search, which uses a combination of different analyses to get results. As an example, on English Wikipedia, if you search for udem you get taken right to the "UdeM" page (that's a "near match" result). If you search for üdem you get taken to the "Üdem" page (which is also a "near match" result, though it redirects to "Uedem" because German spelling is like that).

Now here's where it gets tricky. If you search for udëm there are no "near match" results, but there are two "near match ASCII folding" results: UdeM and Üdem. Since it can't choose between them, you get rolled over to full text search.

Why is all of this relevant? Isolated punctuation marks get reduced to nothing by "text" and "plain" processing, but get indexed as a space by "near match". As a result, if you do full text search on English Wikipedia for a plain single quote ('), a hyphen-minus (-), or a curly apostrophe (’), "text" and "plain" reduce them to nothing, and "near match" converts them to a space. Since there is more than one result for a space as a title, you get rolled over to the full text search results. (The modifier apostrophe ( ʼ) is also returned because "near match" converts it to a space, but searching for it directly gives lots of results because "text" and "plain" do not reduce it to nothing. As I said, on-wiki search isn't really optimized for single punctuation characters.)

Here are links to results on English Wikipedia: search ', search -, or search ’.

You get similar results on English Wiktionary: search ', search -, or search ’.

Now, on Farsi Wikipedia, there's only one full text result for these three characters: search ', search -, or search ’.

Single quote (') and curly apostrophe (’) work in the search box because both have an exact match to a redirect to the "آپاستروف" (apostrophe) article.

Hyphen has no exact title match, so it tries a "near match", gets converted to space (" "), which has only one match—the apostrophe article—so you get sent there.

We could try to figure out a smarter way to process everything and handle all the special cases of punctuation and such, but the most straightforward solution is to add a redirect from "-" to the right article on Farsi Wikipedia.