Topic on Extension talk:TitleKey

Accent and special characters independent search

5
Rbirmann (talkcontribs)

Would it be possible to use this extension to make MW search independent of accents, umlauts and diacritical marks on page titles?

I have a Portuguese installation of MW and users are not being able to find what they are looking for unless the search term has all accents.

What I NEED:

  • Accented page title results for unaccented queries: user should be able to search for unaccented terms and get accented results of page titles, so an article named "Estratégia de atuação" would show up on search results for search queries "estrategia" and "atuacao", for instance.

What I would LIKE (but this is not crucial)

  • Accent-independent search for article content (as well as title)

What I DO NOT NEED at this time:

  • Unaccented results for accented queries
Krinkle (talkcontribs)
  • The normalisation of accents in title search could indeed be handled by the TitleKey extension indeed. I'd recommend creating an issue on bugzilla.wikimedia.org for under "MediaWiki extensions > TitleKey".
  • For content search, you'll need to look in the abilities of the "search backend". This is beyond the scope of the TitleKey extension. The default MySQL search backend will likely not be able to support this. Look into Extension:MWSearch for example, and Lucene search.
  • TitleKey normalises both the query and the index, so it will naturally work in both "directions" from a user point of view.
Rbirmann (talkcontribs)

Apparently bug 20097 is exactly this, but since it's been there for a while, I am not sure anyone is looking into it.

As a "quick and dirty" fix, I used iconv to work around this problem.

I have patched extensions/TitleKey/TitleKey_body.php by changing the 'normalize' function to:


static function normalize( $text ) {
	global $wgContLang;
	setlocale(LC_ALL, 'pt_BR');
	$newtext = iconv('UTF-8', 'ASCII//TRANSLIT', $text);
	return $wgContLang->caseFold( $newtext );
}

With the new file in place I ran TitleKey/rebuildTitleKeys.php and things seem to be working...

Will post updates here if I notice any undesirable side-effects...

Cheers,

Rbirmann (talk) 00:58, 10 October 2013 (UTC)

UPDATE:

This is not a complete fix. Following my previous example, if the article title is "Estratégia de atuação", after this fix searching for "estrategia de atuacao" finds the article, but searching for "estrategia" or "atuacao" does not. It is something, but still far from a fix.

The quest continues...

Wikinaut (talkcontribs)

You wrote:

This is not a complete fix. Following my previous example, if the article title is "Estratégia de atuação", after this fix searching for "estrategia de atuacao" finds the article, but searching for "estrategia" or "atuacao" does not. It is something, but still far from a fix.
The quest continues...

In my view you must apply the normalize function twice:

  1. when generating the table column tk_key in rebuildTitleKeys.php - i.e. when running php rebuildTitleKeys.php
  2. and when actually performing the search (what you do)

so that "translit" input values (substrings while typing) are searched against "translit" database column tk_key entries.

Let me know, if that works, and then perhaps you can send me your code, I am interested in that.

Sophivorus (talkcontribs)
Reply to "Accent and special characters independent search"