Topic on Help talk:CirrusSearch

Can't find words that include ʿ or ʾ

13
Pooya72 (talkcontribs)

Hello everybody,


In our MediaWiki we have titles that use diacritics for transliterations. Since CirrusSearch does accent/diacritic folding out of the box, we are able to search for pages without using diacritics. This is important especially for mobile users. However, if a word includes either ʿ or ʾ then it is not possible to find a page by searching for that word without including either ʿ,ʾ, or a \? (wildcard). I was wondering if it was possible to add ʿ and ʾ to the filter to have words be searchable without including them. For example:

Search for ʿilm by typing ilm, or 'ilm (normal quote).


This is our setup:

MediaWiki 1.35.3
PHP 7.4.22 (apache2handler)
MariaDB 10.4.20-MariaDB
ICU 66.1
Semantic MediaWiki 3.2.3
Elasticsearch 6.5.4
CirrusSearch 6.5.4 (ad4210f)
Elastica 6.1.1


Regards,


Pooya

TJones (WMF) (talkcontribs)

Different languages have different analysis chains configured, besides the language-specific components. We have a long term plan (see T219550) to make them more consistent. I mention that because your description of the situation doesn't match the behavior I'm seeing on English Wikitionary. (If I search for ʿayin on English Wiktionary, then ʿayin is the first result and ayin is second. If I search for ayin, then ayin is first and ʿayin is third. If I search for 'ayin, then ayin is second and ʿayin is third.) What language config are you using?

You should be able to add something to your analysis chain to deal with this. I looked in detail at the English analysis and ʿ & ʾ are both removed by icu_folding. Ahh... that could be it! If you don't have analysis-icu installed, then you get ascii_folding, not icu_folding from the default English config, and ascii_folding does not remove ʿ & ʾ. That could be it!

If you don't want to or can't install analysis-icu or don't want all the aggressive folding of icu_folding, then you could have a much more targeted solution by adding a character filter to remove ʿ & ʾ. (You could map them to ' if ' gets used elsewhere in similar contexts, but it may lead to unwanted behavior. The standard tokenizer strips ' at the edges of words, and aggressive_splitting splits on ', so a'ilm'b gets tokenized as a, ilm, b, while aʿilmʾb gets tokenized as ailmb—at least when icu_folding is enabled.)

I hope that helps. If that doesn't address all your problems, let us know what plugins you have installed, please!

—Trey

Pooya72 (talkcontribs)

Thanks @TJones (WMF)! Looks like installing anaysis-icu is the way to go. Tokenizing aʿilmʾb as ailmb is what we want. If you could point me towards the relevant documentation that would be great.


Edit: I installed the plugin as per the instructions here, and saw the plug-in come up when I restarted elasticsearch. I also added $wgCirrusSearchUseIcuFolding = 'yes'; to LocalSettings.php but the search functionality is still the same. My current language code is:

$wgLanguageCode = "en-gb";

TJones (WMF) (talkcontribs)

Glad to help. Looks like you found the documentation on your own!

Did you also reindex with UpdateSearchIndexConfig.php? It will rebuild the analysis chain with the ICU upgrades and reindex.

You can search for "text" and "plain" (with quotes) on the English Wikipedia config to see what the analysis config should look like, and see where yours differs (if it does).

Pooya72 (talkcontribs)
TJones (WMF) (talkcontribs)

Sounds good, @Pooya72! Monday is a holiday for us, so we'll check in on Tuesday.

Pooya72 (talkcontribs)
TJones (WMF) (talkcontribs)

@Pooya72, check out the docs on in-place reindexing for UpdateSearchIndexConfig.php usage. You need to call it with mwscript . Internally, we use the reindex() function, or some variation of it, to reindex our wikis. In addition to calling UpdateSearchIndexConfig.php, it keeps track of the time the reindex started (REINDEX_START) and calls ForceSearchIndex.php to catch up on activity that happened while the reindex was running—reindexing English Wikipedia, for example, can take hours, so a lot can happen in the meantime.

TJones (WMF) (talkcontribs)

Just wanted to call out that the params --reindexAndRemoveOk --indexIdentifier now are the really necessary ones to make the reindex happen. You need to specify the wiki and cluster (if you have multiple clusters, and then you need to reindex each cluster, too). If things don't seem to work, please share the command you used and its output.

Pooya72 (talkcontribs)

Thanks again @TJones (WMF). I ran php ./extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier=now and this was the output: pastebin. We only have one instance, and it's at the beginning of the project so there are only a handful of entries, and no users.

TJones (WMF) (talkcontribs)

That looks like a successful run, but I see that you only have the analysis-icu plugin. You also need the Wikimedia extra plugin to enable icu_folding. I'm sorry that I'm not familiar with the base MediaWiki install, so I didn't realize this sooner; I had assumed extra and experimental-highlighter and maybe ltr would be installed by default.. You can install extra like so: elasticsearch-plugin install org.wikimedia.search:extra:6.5.4

After that, reindex again and let's see where we are.

Pooya72 (talkcontribs)
TJones (WMF) (talkcontribs)

Woo-hoo! Glad to help.

Reply to "Can't find words that include ʿ or ʾ"