Extension talk:CirrusSearch

About this board

Discussion related to the CirrusSearch MediaWiki extension.

See also the open tasks for CirrusSearch on phabricator.

Storage option for CirrusSearch in MW 1.39 Docker

1
Testergt1302 (talkcontribs)

Hi,

We are trying to put our wiki application into Docker Container (Azure Kubernetes). We made the MW working but, the elasticsearch is not working as expected. It generates the indexes for the first time, after one or two days, the index data is deleted. Not sure why this is happening. We used Azure Blob Container as data volumes for ES, but I think it is not supported.

Do we have a compatibility list for storage support by Elastic? I tried to find out from elastic website and docs, but I could not find any. Anybody having idea on this ?

Thanks in Advance.

Reply to "Storage option for CirrusSearch in MW 1.39 Docker"

Username and password authentication for Elastic server?

2
Brooke Vibber (WMF) (talkcontribs)

I tried setting up a local development instance using a default Docker installation of Elastic. This creates a username "elastic" and a generated password for authentication, however I can't find anything about specifying authentication in search host configuration for CirrusSearch, and the updater script won't connect with anything I've devised yet.

It doesn't seem to work to specify "elastic:<password>@localhost:9200" as the host:

Elastica\Exception\Connection\HttpException from line 186 of /var/www/html/w/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php: Malformed URL


nor "elastic:<password>@localhost" with default port assumed:

Elastica\Exception\Connection\HttpException from line 186 of /var/www/html/w/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php: Couldn't connect to host, Elasticsearch down?

Prefacing with 'http:' or 'https:' makes no difference.

Any ideas? I'm hoping to get this running so I can do some fixes on Extension:MediaSearch on my local development site. Thanks!

EBernhardson (WMF) (talkcontribs)

Generally i would suggest using docker-registry.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image:v7.10.2-5 as it contains the extra plugins we use. This is based off the elasticsearch-oss image. Alternatively you can directly use https://www.docker.elastic.co/r/elasticsearch/elasticsearch-oss image. This image doesn't contain authentication, because the auth isn't part of the OSS offering.

I haven't tested it, but you should be able to provide authentication as part of the connection configuration. For example (untested):

    $wgCirrusSearchServers = [
        [
            'host' => 'localhost',
            'port' => '9200',
            'username' => '...',
            'password' => '...',
        ]
    ];
Reply to "Username and password authentication for Elastic server?"

Completion suggestions for other namespaces

5
Alex44019 (talkcontribs)

Hi,

as in title. Are these supported? If not, could you point me to a Phabricator task? On my wiki we've got two content namespaces (one for official, other for unofficial content), but unfortunately the second namespace is never suggested in suggestions. Are there any workarounds maybe? (without disabling Cirrus suggestions)

Alternatively, if someone has some pointers how to potentially implement it in the extension, I'd gladly appreciate them - though I've never done any search work in the past.

DCausse (WMF) (talkcontribs)

Getting suggestions (title completion) should be supported.

For 2 pages:

  • My_Page
  • Unofficial:My_Page

I suspect that what you want when typing "Ma Pag" in the search box is getting at least these two pages suggested?

If yes I think that the way to get this working is:

  1. Configure wgNamespacesToBeSearchedDefault with [ 0 => 1, 100 => 1 ] (assuming that 100 is the Unofficial namespace)

Note that changing wgNamespacesToBeSearchedDefault will require reindexing your wiki.

You can see it in action on https://es.wikipedia.org for examples, where the Author and Portal namespaces are searched by default, if you search for `Lenguas portuguesa` you should obtain results from both the main content namespace and the Portal namespace.

Note that it is likely that the suggestions from these extra namespaces are ranked very low compared to the ones from the main namespace.

Alex44019 (talkcontribs)

I'll link the wiki as it may be helpful for the thread: https://ark.wiki.gg/. We use the main namespace for official game content, and a "Mod" namespace for unofficial modifications, all following a format of a mod's main page at "Mod:modname", and mod's content as sub-pages to that main page. For example "Mod:ARK Additions/Acrocanthosaurus".

In my expectations, typing "Acro", "Acrocantho", or the full title "Acrocanthosaurus", in the mw-head search bar would suggest the article that's in the Mod namespace. We have no other page titled Acrocanthosaurus in any namespace (ignoring files of course). However, there are simply no results returned at all.

To get the suggestions, the reader has to type the mod namespace prefix and the mod's name. "Mod:ARK Additions/Acr" returns valid suggestions. There's no "partial" completion, the prefix must be complete and without typos. And that's not very intuitive or useful.

Regular Special:Search already handles this well [enough], and our mod namespace is weighed below main.

(I've put "enough" in brackets, as searching for "acrocanth" in Special:Search yields no results until a wildcard is added to the end. I'm not familiar with Cirrus's configuration though, so not sure if there's a setting to alter the behaviour so search acts as if there was a wildcard at all times. However, this is not related to this thread.)

DCausse (WMF) (talkcontribs)

You seem to use the fuzzy-subphrases profile of the completion suggester which allows it to complete in the middle of the titles. When running a completion search across multiple namespaces the CompletetionSuggester (if enabled) will only work and use this algorithm for the main namespace, the other namespaces will be searched using the classic prefix search algorithm. This is why searching for Acro does not yield Mod:ARK Additions/Acrocanthosaurus, you have to search for ARK Additions Acro for it to work.

So indeed, in order to support subphrase matching in your context the CompletionSuggester would have to be adapted to support multiple namespaces, sadly it was not designed with this use-case in mind. I'm unclear on what could be the main difficulty here to adapt the codebase but at a glance I think the context-suggester have to be used and I fear that the assumption that only NS_MAIN is indexed is probably hard-coded in many places.

An alternative might be to change how the classic prefix search works by enabling wgCirrusSearchPrefixSearchStartsWithAnyWord, we never enabled this on WMF wikis so I don't have much experience on how it behaves but it might greatly help to increase recall on non-main namespaces in your case

Note that enabling wgCirrusSearchPrefixSearchStartsWithAnyWord requires re-indexing your wiki with UpdateSearchIndexConfig.php.

Alex44019 (talkcontribs)

Interesting, thank you. I'll get in contact with our hosting platform provider about current Cirrus settings, and I'll set up a sandbox to test out the variable you mentioned. I might have a try at getting more familiar with the extension's internals for the CompletionSuggester (mainly for fun), but currently need to burn through my existing to-do lists...

Also... it seems the slash is required in "ARK Additions/Acro" to get article results. Dropping the slash only returns our legacy redirects. Still useful to know!

Reply to "Completion suggestions for other namespaces"

contradiction about version of Elasticsearch for 1.30

7
Aloist (talkcontribs)

The page Extension:CirrusSearch states:

MediaWiki 1.39+ require Elasticsearch 7.10.2

When I download the extension for Mediawiki 1.39 and look in README, it says:

Installation

------------

Get Elasticsearch up and running somewhere. Only Elasticsearch v6.8 is supported.

I would like that to be true, because I have 6.8.23. But is it true?

EBernhardson (WMF) (talkcontribs)

Unfortunately the README is wrong and the wiki page is correct. As linked in the wiki page there is a compatibility layer that can be activated for 1.39 to talk to 6.8.23, but it is focused on ensuring write compatability and it's possible you would run into query issues.

Aloist (talkcontribs)

Thank you.

I face the problem of upgrading from 1.35 to 1.39 on RHEL9.

I already established that 1.35 works with 10.5.22-MariaDB. So the database version can remain the same when I switch MW version. I expect update.php to do the job for database wikidb

But having to upgrade elasticsearch synchronously with Mediawiki is a problem.

Elasticsearch > 6.8 is not in Redhat repositories. I can get 8.x but not 7.10.2

Can I have two Elasticsearch versions installed at the same time? Like one port 9200 and another on 9250?

Instructions somehere?

EBernhardson (WMF) (talkcontribs)

It's techinically possible to run multiple versions of elasticsearch on the same host, but I'm not sure of any documentation to that end. Much would depend on your available infrastructure, and in my experience generally leads to ongoing complexities. In WMF infra we run multiple instances (of the same version) of elasticsearch on a single host and it's led to a number of minor problems and headaches over the years. If you have the ability to spin up virtual machines then one plausible way forward is to spin up a new instance running the newer version. Another potential option might be to use the docker container elastic makes available, those are isolated enough that it should reduce complexities of running two instances on one host.

Aloist (talkcontribs)

Is there someone to be reached who created the compatibiity layer found in (1.39 version)?

./CirrusSearch/includes/Elastica/ES6CompatTransportWrapper.php

This person might be able to answer about problems it creates.

In my wikis, I have little demands on search. All we do is the very common search for articles or for text inside articles.

May I suggest that extensions/CirrusSearch/README is updated?

Would anyone be able to tell whether 1.39 works with Elasticsearch 8.10.4-1 ?

Ciencia Al Poder (talkcontribs)
EBernhardson (WMF) (talkcontribs)

My teammate DCausse wrote the layer, but if you look inside you can see it is very simple. The problem this compatability layer solves for is a breaking change in the bulk write api of elasticsearch. It doesn't do anything with search requests. In WMF production we ran the upgrade such that we had a cluster running 7.10, and a cluster running 6.8. As the code was deployed that knew how to talk to 7.10 it would also switch it's query endpoint between clusters. Only the write layer requied compatability, because it had to write to both clusters at the same time.

There is a reasonable chance it would work for most simple queries. The general problem is that when Elastic releases a major version update they make a wide variety of breaking changes (see breaking changes list for 7.0). You could test and see what happens to work, but if problems do arise I don't know if there will be much we can do to help you.

Reply to "contradiction about version of Elasticsearch for 1.30"

CirrusSearch does not update automatically

4
Davidgbc (talkcontribs)

Hi,

we are currently using Mediawiki 1.39.4, PHP 7.4.33, MariaDB 10.4.12 and Elasticsearch 7.10.2.

I got the task to update the wiki in my company from version 1.35 to 1.39. Only now I am confused with CirrusSearch and Elasticsearch (we use locally on the server).

On the extension page of CirrusSearch it says you need Elasticsearch version 7.10.2, but in the CirrusSearch README it says that only version 6.8 is supported. Which of these is true?

I followed the steps in the README normally and the search works fine.

But when I create a new article it is not found in Special:Search, the content is not found either.


Please help me... :(

Ciencia Al Poder (talkcontribs)

The README is outdated. The correct version is on the wiki page.

Updates to the search index are triggered by jobs. See Manual:Job queue. Check if jobs are running, or if they're failing, or there's a large backlog of jobs that may delay the indexation of new content.

Davidgbc (talkcontribs)

Thanks for the answer, but shouldn't new pages still be indexed automatically?

We have a separate department in the company that only edits wiki pages and they say it worked with the old version...

If I create a job, then I would have to index the database very often, or how should such a job look, that a new page is found directly? I don't get on at all

Ciencia Al Poder (talkcontribs)

You don't have to create jobs, they're automatically created by MediaWiki (usually after saving an edit on a page or performing any other modification) and placed on the job queue. Then, jobs are picked from the queue on following page loads, or by a job runner, depending how did you configure things. See Manual:Job queue for more information.

Checking if there are stuck jobs with Manual:showJobs.php and setting custom log groups for exceptions may give you more information.

Reply to "CirrusSearch does not update automatically"

insource search by default

2
Sphynkx (talkcontribs)

Maybe useful for somebody.

Modification in extensions/CirrusSearch/includes/Searcher.php (function buildFullTextSearch( $term ) ) for search in insource mode as default:

@@ -294,11 +294,14 @@
                // whitespace. Cirrussearch treats them both as normal whitespace, but
                // the preceding isn't appropriately trimmed.
                // No searching for nothing! That takes forever!
+               global $wgInSourceSearchDefault;
                $term = trim( str_replace( "\xE3\x80\x80", " ", $term ) );
                if ( $term === '' ) {
                        $this->searchContext->setResultsPossible( false );
                }
-
+               if ( isset( $wgInSourceSearchDefault ) && $wgInSourceSearchDefault === true ) {
+                       $term = "insource:" . $term;
+               }
                $builderSettings = $this->config->getProfileService()
                        ->loadProfileByName( SearchProfileService::FT_QUERY_BUILDER,
                                $this->searchContext->getFulltextQueryBuilderProfile() );

Also set in LocalSettings.php:

 $wgInSourceSearchDefault = true;

Feature Request.. Would be nice to have check button on search page for insource-searching..

Ciencia Al Poder (talkcontribs)

It would probably be easier or less prone to breaking on upgrade, to add a JavaScript gadget that would automatically prepend the insource: text on the search term when submitting the form.

Reply to "insource search by default"

Indexing WIKI after a database restore

3
Raoufgui (talkcontribs)

Hello

i have two MW servers that work fine :

1 - production server

2- backup server

i will restore the data base backuped from production server to the second server in order to move it to run

the database on the production server is more recent and contains more data.


After each restore of the DB and running the update script :

- should i build index from scratch on the second server ?

- if no, does the new data (difference of data between the 02 DB) will be automatically indexed or shoud i run specific script to index the new data

-how to confirm that all data are indexed on the second server and i will have the same results of search like the first server ?

NB: I'm using CirrusSearch plugin and elasticsearch

Thanks

DCausse (WMF) (talkcontribs)

I'm assuming here that all the Mediawiki dependencies are running on the same server: PHP, your database and elasticsearch, if not please be careful, especially if your elasticsearch cluster is shared between your production and backup installation.

If this is the case, when restoring a database backup you should also reindex everything from scratch. The same way that your relational database will get erased by restoring the backup, elasticsearch also needs to be reset based on the new content of the restored database. This is the easiest and safest solution.

There are no ways to ensure that the same query will return identical results on two different elasticsearch servers, reason is that ranking uses some stats that will certainly differ even if the documents are the same. What you could do is run some sanity checks, e.g. counting the number of indexed documents in both elasticsearch servers to make sure that they are close.

Raoufgui (talkcontribs)
Reply to "Indexing WIKI after a database restore"
Novem Linguae (talkcontribs)

I was googling to see what the CompletionSuggester algorithm is. I found the page Extension:CirrusSearch/CompletionSuggester and it says The algorithm used to rank suggestions is still under development. Could someone knowledgeable consider updating that to describe the current algorithm? Thank you.

Novem Linguae (talkcontribs)
DCausse (WMF) (talkcontribs)

Thanks for the edit! I removed this bullet from the Limitations and added a more detailed section Ranking criteria.

many pages not indexed after restore database

9
Raoufgui (talkcontribs)

Hello

- i restore a database backuped from my current mediawiki server V1.28 on my new mediawiki server V1.39 (upgrade of 1.28)

- after restore i get the same number of articles between the current MW (1.28) and the new MW (so i don't loss data)

- after restore i rebuild the index by running this two command

php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now

php forceSearchIndex.php

--> the second command show : Indexed a total of 57624 pages at 99/second

on my new MW whene i make search same page does not appear i my result despite the page exist

i check the job queue and i found about 11711 job queued related to cirrusSearchElasticaWrite so i execute php runJobs.php whith cron job until get 0 job queued

after that some page are being indexed but i still have others not indexed


so


1- After restoring a database backuped from MW 1.28 to 1.39

should i rebuild the index (does it necessary)?


2- if YES ,please,wich command(s) should i run (by order) to do that ?

3- how to ensure that all pages are indexed and appears on search before moving MW1.39 to production environnement


Thank you

DCausse (WMF) (talkcontribs)

Hi,

Having jobs in cirrusSearchElasticaWrite might possibly mean that there are failures, could you check your logs (mediawiki and elasticsearch ones) to see if there are any errors?

Raoufgui (talkcontribs)

Hello,

MW and elasticsearch don't show any error in the log files

whene i run Saneitize.php script it list many pages which are not in index

  Page not in index      41591 Exploit lnaswpdat003.lna

            Page not in index      41593 Reseau:Api tbs

            Page not in index      41596 Exploit rtr-delaprtr-vip.del

            Page not in index      41597 Exploit rtr-delaprtr02-phy.del

            Page not in index      41598 Exploit rtr-delaprtr03-phy.del

....

at the end of the execution it indicate "Fixed 10425 page(s) (76328 checked)"

what can i do please to resolve the issue


Thanks

DCausse (WMF) (talkcontribs)

Did the Saneitize.php script actually fixed your problems in the end?

If not then there must be an issue with these pages preventing them from being indexed.

If you open one these pages in a browser and add &cirrusDump to the URL it should print what's inside elasticsearch for this page, an empty array is shown if not indexed.

If it's empty can you check that CirrusSearch is actually able to generate the document that will be indexed, for this: api.php?action=query&format=json&prop=cirrusbuilddoc&pageids=41591&formatversion=2, note the pageids param.

Have you identified anything in common in the pages that are not indexed, are they from the same namespace or same content type?

Hope it helps.

Raoufgui (talkcontribs)

Hello

- the Saneitize.php script fix the problem of few number of pages but not all pages

- whene i append &cirrusDump to the URL of not indexed page it print an empty array

i dont' undrestand where i shoud put this code  ?

"api.php?action=query&format=json&prop=cirrusbuilddoc&pageids=41591&formatversion=2"

- all pages indexed and not indexed have the same namspace and nothing is in common between the page not indexed


Thank you in advance

DCausse (WMF) (talkcontribs)

Depending on how you configured your wiki this might vary but you can request api.php?action=query&format=json&prop=cirrusbuilddoc&pageids=41591&formatversion=2 with https://mywikihostname/w/api.php?action=query&format=json&prop=cirrusbuilddoc&pageids=41591&formatversion=2 this assumes that scripts are under /w (which I believe is the default).

If the Sanitize script fixed few pages, does running it again and again fix more and more pages?

Sadly I'm a bit puzzled by your problem and not sure what to look at next. I'd look more into understanding why you do not get any errors in the logs because having jobs cirrusSearchElasticaWrite means that they failed somehow and are being retried and we should have logged something somewhere (unless you mis-configured how MW logs are generated?).

Raoufgui (talkcontribs)

Hello

thank you for reply

when i request the url i get this message

- running Sanitize script many time don't fix more an more pages

- in MW log i found error like this (i don't now if realy is an error)

does it possible to do a meeting together to troubleshooting the issue please ?

Thanks

DCausse (WMF) (talkcontribs)

You have to understand why the ParserOutput cannot be obtained, I could see two main reasons.

  • You use a ContentHandler that does not support CirrusSearch, perhaps you enabled a new Extension, or forgot to enable one you previously had?
  • You have inconsistencies in your database causing some errors, see ParserOutputAccess.php and how it might fail. Did you run Manual:Update.php after importing your database?

We do run wikitech:Search_Platform/Contact#Office_Hours office hours every first Wednesday of the month if you want to get in touch with the WMF Search Platform Team but your issue suggests that the problem is not directly related to CirrusSearch.


You could also ask for help on IRC, see MediaWiki on IRC.


Hope it helps.

Raoufgui (talkcontribs)

1- NO i just upgrade the same extension installed on MW 1.28

2- yes i run update.php after importing database : i do like this :

backup of database from my current MW 1.28

importing database on new server holding 1.35 and 1.39 MW

running first the "update.php" on 1.35 and then running "update.php" on 1.39 (impossible to upgrade directly from 1.28 to 1.39 Manual:Upgrading)

3- i make test (request URL) for another not indexed page i got this message , does this help you ?

Reply to "many pages not indexed after restore database"
Novem Linguae (talkcontribs)
DCausse (WMF) (talkcontribs)

Indeed, thanks for the heads up, we used this page initially to gather initial feedback on this feature, I'll watch it.