Dear community and developers,
Allow me to share with you my recent experience of upgrading and setting up Elasticsearch and Mediawiki in order to reach some search criteria of mine.
I have tried an update of php version to 7.3.x, Elasticsearch to 6.5.4 and Mediawiki to 1.35.1, and installing extra and icu_folding plugins into Es, upgrading composer in Elastica and going through indexing the MW with some kind an expectation of possible further or better configuration, or possible compatibility of new version advantage, but all with no major improvement at all.
Steps I have maintained:
After installation of MW, ES, Elastica, CirrusSearch, installing icu_folding and extra plugin for Es, adding CirrusSearch and Elastica LocalSettings.php Mw definition, into running update.php and updating composer for Elastica I have just done expected steps for indexing:
Add this to LocalSettings.php:
1. wfLoadExtension( 'Elastica' );
wfLoadExtension( 'CirrusSearch' );
$wgDisableSearchUpdate = true;
2. Now run this script to generate your elasticsearch index:
php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php
3. Now remove $wgDisableSearchUpdate = true from LocalSettings.php. Updates should start heading to Elasticsearch.
4. Next bootstrap the search index by running:
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipParse
5. Note that this can take some time. For large wikis read "Bootstrapping large wikis" below.
Once that is complete add this to LocalSettings.php to funnel queries to ElasticSearch:
$wgSearchType = 'CirrusSearch';
I have to admit I am very lost and I feel like I have not enough information pieces correctly puzzled together. Am I missing some step or point? After some portion of testing and research I have finished with these findings below:
Comparrison of index of mine vs index of cs.mediawiki:
I will mention here two examples of search/index settings. One you have had provided above csmediawiki, second is from my wiki. My question here is, which steps do I have to maintain to be my index/search settings same or simillar to the cs.mediawiki?
From my own wiki - from disc google: https://docs.google.com/document/d/e/2PACX-1vRMnWjIrTsN9Y_V84Cxq4Ys_V899Qup9hfOx0MCYxhYX9-CKGuQ6eyhoN6eqsXy9j7OMFPHfon0-Fzq/pub
Partial indexing?
Is possible that not all pages and categories are indexed correctly or fully? I have just witnnessed a reality that word with "á" character was not found for the first search and after third and second search (refreshing the search page of this keyword) just appeared.
For next time I witnessed similar reality that page with some of these characters: "í, é, á, ý" was not able to be found unless I visited the particular page containing this character in title. Did I did something horribly wrong?
Or is this expected use case? All apologies for my possible knowledge limitation here, but wouldn't it all be structurally explained in some kinda a documentation to this particular use case? Like combination of Elastica, CirrusSearch, Elasticsearch and Mediawiki to make all connected and working together?
Wouldn't it be problem with this Topic:Ud6sblxvbtlzlm16? Or this one Topic:V5iwq5ev1fmwnkq5?
Reindexation of customized index?
Might I ask you which steps do I have to mainting to reach the same or simillar settings as in the cs Wikipedia? My point here is, how can I customize the way how the index is created and filled? Is it correct way to let the Mediawiki be indexed and afterwards change the index and reindex it again?
Standalone ICU server lib/sw?
If I uderstand clearly if I do not want to use icu_folding in Elasticsearch as a plugin, I can use ICU library as server software as available in cs wikipedia "ICU".
Additional post-install settings of CirrusSearch?
Everything what I have missed is post-install configuration for CirrusSearch, because there is no clear explanation of what settings is crucical and what is optional. Example of settings in this ticket: Topic:Ud6sblxvbtlzlm16
All questions have common basic ground which in my humble opinion is that I am just not able to find some kinda a proper documentation and explanation of what is optional and what is crucial, thinking about very strict README files or even more strict Mediawiki official documentation for extensions.
I am sorry for this long post and I thank you for your time and effort,
Svrl