Topic on Extension talk:CirrusSearch

CirrusSearch for MW1.23 with ICU plugin support?

16
Spiros71 (talkcontribs)

I had Lucene installed and I upgraded to Elastica for a Ancient Greek dictionary. However, although Lucene by default would provide diacritics-insensitive autocomplete and search results, it appears that for Elastica 1.7 one needs to use the ICU plugin. But no such option appears in CirrusSearch for MW1.23 (nor in CirrusSearch.php For MW1.26, I could only see it here as $wgCirrusSearchUseIcuFolding). Is there any way to provide diacritics-insensitive autocomplete for MW1.23?

DCausse (WMF) (talkcontribs)

Cirrus will use the ICU plugin as long as it is installed in elasticsearch. Note that cirrus uses the ICU analyzer only for unicode normalization but still uses the asciifolding filter for diacritics-insensitive search.

Icu folding support is very new in cirrus and is only supported by the completion suggester. It's still experimental because it lacks the option preserve_original.

Concerning MW1.23 I don't see an option that would allow you to activate it by a simple config flag. If your wiki is configured as greek then some diacritics will be handled in fulltext search (not autocomplete) by the greek analyzer (but as far as I know not all ancient greek stress marks are supported by the lucene greek analyzer).

Unfortunately the only option that does not involve a modification in the CirrusSearch source code would be to use a more recent version of cirrus and activate the completion suggester with the $wgCirrusSearchUseIcuFolding option enabled.

Spiros71 (talkcontribs)

Thank you very much for the reply, David. This is quite strange, as Lucene with MWSearch (which was installed before) performed the diacritics-insensitive autocomplete by default; sounds like a standard functionality was lost with Elastica?

I just checked searching for αιων in https://en.wiktionary.org and neither the autocomplete nor the search results display the ancient Greek equivalent (only difference being the diacritics). However, the word https://en.wiktionary.org/wiki/αἰών exists, and it can only be found if one enters the exact diacritics, which is quite cumbersome and not very user-friendly.

Checking the Cirrus extension code in different versions for MW, I could not find any differences indicating IcuFolding support. Which Cirrus extension versions for MW support enabling IcuFolding via a config flag? Would they work with MW 1.23?

DCausse (WMF) (talkcontribs)

IcuFolding is currently enabled only for autocomplete queries on the greek wikipedia (searching for ανθρακας in the autocomplete box will suggest Άνθρακας, or your example with αιων you'll see Αιώνας suggested).

I agree with you this is a major regression compared to MWSearch for non-latin wikis.

I doubt that the Cirrus version supporting IcuFolding will work with MW 1.23, but I think the code change would be minimal to enable it for your wiki.

If you feel comfortable hacking some PHP code then you can probably add the code to support it?

I can assist you if you send me a link to the version of the file extensions/CirrusSearch/includes/Maintenance/AnalysisConfigBuilder.php you are using?

Spiros71 (talkcontribs)
DCausse (WMF) (talkcontribs)

This version AnalysisConfiguBuilder.phpshould allow you to force icu folding for fulltext and autocomplete searches.

You will have to:

  1. set $wgCirrusSearchForceIcuFolding = true for this wiki
  2. make sure the icu analyzer plugin properly installed on your elasticsearch cluster
  3. Reindex your wiki with this new config

Note that you will lose the behavior behind ''preserve_original'', basically this feature allows search queries that include words with diacritics to rank pages with words that match the diacritics higher. Example: searching for thé would certainly display pages with the word thé first then pages with the word the, without preserve_original elasticsearch will make no distinction beteween thé and the.

Note that I haven't tested this code.

Spiros71 (talkcontribs)

Thank you so much David. So my LocalSettings.php will be like this before indexing?

require_once "$IP/extensions/Elastica/Elastica.php";

require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";

$wgCirrusSearchServers = array( '127.0.0.1' );

#$wgDisableSearchUpdate = true;

$wgSearchType = 'CirrusSearch';

$wgCirrusSearchForceIcuFolding = true;

DCausse (WMF) (talkcontribs)

Yes, make sure:

- you use the modified version of AnalysisConfigBuilder.php I provided

- analysis-icu plugin is properly installed on elasticsearch

Good luck.

Spiros71 (talkcontribs)

Thank you. I just tried, when I enter "ειμι" in search it will not display "εἰμί". Similarly, when I enter "ει" it will not display "εἴλω" or other words with diacritics on "ι".

DCausse (WMF) (talkcontribs)

Can you verify that "asciifolding"/"aciifolding_preserve" under the "filter" section uses "type": "icu_folding" when looking at api.php?action=cirrus-settings-dumpon your wiki?

If your wiki looks like the example I provided ("type" : "asciifolding") then icu_folding is probably not activated. It's either :

- a bug in the modified php file I provided

- analysis-icu plugin is not installed

- you did not reindex your wiki properly

Spiros71 (talkcontribs)
DCausse (WMF) (talkcontribs)

I don't know why this api is not working properly on your wiki...

But you can still display the index settings by asking elasticsearch directly:

- identify the index: curl -s localhost:9200/_cat/indices : you should see a list with wikiname _ (content or general) _ (first or a timestamp)

- identify the one that matches your wiki name then : curl -s 'localhost:9200/wikiname_content_XXXXX?pretty=true'

Spiros71 (talkcontribs)
DCausse (WMF) (talkcontribs)

OK, the settings are not correct and still uses the wrong asciifolding filter:

- Did you restart elasticsearch after installing the plugin? Can you see the plugin when running: curl -s 'localhost:9200/_nodes/plugins?pretty=true'

- How did you reindex the wiki? The timestamp in the settings indicates that this index was created on monday 2pm UTC.

Maybe you used forceSearchIndex? To reindex and update index settings you need to run extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php  --reindexAndRemoveOk --indexIdentifier now on this wiki.

Spiros71 (talkcontribs)
DCausse (WMF) (talkcontribs)

I'm glad it worked in the end.

I hope it'll work as you expect, it's really something we'd like to fix properly in Cirrus so please comment on T129545 if you encounter any strange/undesirable behaviors related to folding.