Help talk:CirrusSearch

About this board

How to export search result

3
Bennylin (talkcontribs)

Hi, I have a search result on wiki "articles without ref tags", and I want to dump/export the list of all the titles from that search. I tried API, but 500 is the limit of each call.

Can anyone help? Thanks beforehand.

DCausse (WMF) (talkcontribs)

Hi, sadly this is not possible.

You can try to make multiple calls to the API using pagination via the API:Continue parameter to gather more than 500 results.

But there will be limits there too, you can't paginate past the 10000th result.

Such limits are in place to protect the service because even using the continue parameter elasticsearch (the underlying search engine used by CirrusSearch) have to keep all the results from the start in memory.

A quick note regarding your query:

-insource:<ref>

The characters < and > will be ignored and what is actually run is

-insource:ref

and thus you might exclude pages that have the word ref used outside a <ref>, e.g. : https://id.wikipedia.org/wiki/Sumber_primer.

If you want to actually search for the < and > characters you have to use the regular expression syntax by wrapping you search text between a pair of / and escaping the < > characters with a \:

-insource:/\<ref\>/

But beware that the query above might not filter pages with named references <ref name="named ref"> or pages where the reference tag is added via a template.

Bennylin (talkcontribs)

Thank you for the answer and the correction!

Reply to "How to export search result"
Jonteemil (talkcontribs)

In the page it says that the search index will be updated, at least once a day. I've been trying to fix broken files over at Commons that have 0 x 0 px. I used the search fileh:0 filew:0 filetype:image -filemime:image/tiff to find them. Now, files I fixed weeks ago are still listed in the results. When will they go away?

DCausse (WMF) (talkcontribs)

Thanks for reporting the problem, there seems to be a problem in the way CirrusSearch is handling these edits, I filed Phab:T342562 to track and fix the issue.

Jonteemil (talkcontribs)

Okay, perfect.

Reply to "Search index update"

"Words in all content pages" on Special:Statistics not updated when words are added and php ./maintenance/runJobs.php yields "Job queue is empty."

3
Alberto56789 (talkcontribs)

When trying to post my question here I get the ⧼abusefilter-warning-linkspam⧽ error, so I posted my full question on stackoverflow at questions/75269346 and I will post only a summary here:

I have installed Cirrus, Elastica and ElasticSearch as per the instructions, but no matter what I do (for example php ./maintenance/runJobs.php, php maintenance/updateSpecialPages.php), number of words on the statistics page never updates.

How can I get that to update? Thanks!

DCausse (WMF) (talkcontribs)
216.246.250.184 (talkcontribs)

Thanks! Wow that was driving me nuts!

Reply to ""Words in all content pages" on Special:Statistics not updated when words are added and php ./maintenance/runJobs.php yields "Job queue is empty.""

i want all existing templates

5
Wladek92 (talkcontribs)

hi all, going to -> https://www.mediawiki.org/w/index.php?search=%2A&title=Special:Search&profile=advanced&fulltext=1&ns10=1 i want all existing templates ie all pages title in ns Template: . After setting this single ns only from the drop list, i tried several forms but without success: 1. with no string i get no result 2. with joker '*' i get the template * only.

So please what is the syntax ? of this elementary request "give me all page titles of ns Template:" Thanks -- Christian 🇫🇷 FR (talk) 07:03, 27 June 2023 (UTC)

TheDJ (talkcontribs)

Search cannot do that. That's what the api or quarry is for.

Tacsipacsi (talkcontribs)
Cpiral (talkcontribs)

That is a feature that I too once wanted: a list of page titles matching some query. Instead I settled on storing the search result as text, and then using my text-processing skills to extract the titles.


In your case it works to first capture the search result of prefix: template: to file.


Then you grep, and can sort them alphabetically.

TheDJ (talkcontribs)

Again, this is not what you are supposed to use search for. If you want a list, you should use something made to generate lists, like Special:AllPages, database dumps or quarry. Search is fuzzy, its optimised to find words, not to generate lists.


This is an example to get the first 50 template names on mediawiki.org which are not redirects and not deleted:

https://quarry.wmcloud.org/query/74910


And when lists get really big, you will HAVE to use pagination. There is no way around this as WMF properties generally are very big properties.

Reply to "i want all existing templates"
2001:1711:FA4B:D10:B1BE:F13C:8327:704F (talkcontribs)

Hi,

Any profile example on how we can use a synonym file with CirrusSearch and Elastic ?

Thanks

EBernhardson (WMF) (talkcontribs)

Unfortunately synonyms aren't something CirrusSearch has any support for. It's been in the background as something to work on, but we need to come up with a solution that works in hundreds of languages and likely defers the actualy synonym definition to wiki editors rather than system administrators.

While not exactly synonyms, on the WMF wikis we rely on redirects to pages to provide alternate names for them. In most cases where wiki search externally appears to have used synonyms what actually happened was there was a redirect to the page giving alternate titles (that are used as a fairly strong ranking signal).

Aparolini (talkcontribs)

Thanks for the feedback.

Because Elasticsearch doses support synonyms as a filter and that Cirrus is really just a Bridge to Elastic, I was hopping we could work this out with profiles, such as

'default' => [
'builder_class' => Query\FullTextQueryStringQueryBuilder::class,
'settings' => [
    'filter' => [
	'type' => 'synonym',
	'settings' => [
		'synonyms_path' => 'my_synonyms.txt',
                'updateable' => 'true'
	]
      ]
],

Synonyms are important to us (medical wiki), as for instance if you look for, say "audition", you should find not only page with "audition" in it, but also page with "hear" or "malleus" (small bone inside the hear).

Editing the page to add synonyms is not an option for us, as this will add a lot of work for page producers.

Reply to "Synonyms"

Fatal error: UpdateSearchIndexConfig.php

6
189.254.175.249 (talkcontribs)

Hi, I am running the page:UpdateSearchIndexConfig.php and it is throwing the following error:

Fatal error: Cannot declare interface CirrusSearch\Maintenance\Printer, because the name is already in use in /mnt/remoto/lamp/operacion/extensions/CirrusSearch/includes/Maintenance/Printer.php on line 5

I have media wiki 1.39.1, Cirrus Search 6.5.4, Elastic Search 7.10.2


Does anyone have an idea what it could be?

Thanks

DCausse (WMF) (talkcontribs)

At a glance I'd say that the CirrusSearch classes get loaded twice, but I have no clue how this could happen. Could you clarify how you are running the "page:UpdateSearchIndexConfig.php"?

From where did you install "Cirrus Search 6.5.4", CirrusSearch should follow the same Mediawiki version so I'm curious to know where you fetched a 6.5.4 version of CirrusSearch (6.5.4 sounds more like a version of elasticsearch)?

Using MW 1.39.1 you need to fetch a compatible CirrusSearch version by selecting 1.39 here: Special:ExtensionDistributor/CirrusSearch. Or via git by using the REL1_39 branch.

Rsilvamty (talkcontribs)

Hi, I download the extension from the URL for media wiki 1.39:

Special:ExtensionDistributor/CirrusSearch


I got the version number of CirrusSearch from the file extension.json in extensions\CirrusSearch


I run the file UpdateSearchIndexConfig.php from console with command:

php UpdateSearchIndexConfig.php


The UpdateSearchIndexConfig.php file use the file: require_once __DIR__ . '/../includes/Maintenance/Maintenance.php';

that is where I think the error is being displayed


The CirrusSearch\includes\Maintenance\Maintenance.php file calls the Printer.php file


part of the Maintenance.php file:

namespace CirrusSearch\Maintenance;

use CirrusSearch\Connection;

use CirrusSearch\MetaStore\MetaStoreIndex;

use CirrusSearch\SearchConfig;

use CirrusSearch\UserTestingEngine;

use MediaWiki\MediaWikiServices;

use MediaWiki\Settings\SettingsBuilder;

// Maintenance class is loaded before autoload, so we need to pull the interface

require_once (__DIR__ . '/Printer.php');

abstract class Maintenance extends \Maintenance implements Printer


And all the content of the file Printer.php (line 5 is in bold)

namespace CirrusSearch\Maintenance;

interface Printer {

public function output( $message, $channel = null );

public function outputIndented( $message );

public function error( $err, $die = 0 );

Rsilvamty (talkcontribs)

I have tried some options, for example:

- Rename the namespace

- Put the Printer interface code in the Maintenance.php file

- Change the name Maintenance to Maintenance2 in the line where the abstract class is declared

All unsuccessful so far.

DCausse (WMF) (talkcontribs)

Thanks for the details but unfortunately I'm not sure to understand in which conditions you could get this problem.

Perhaps the PHP engine does try to load the Printer interface from two different files (reason why require_once would not prevent the file from being loaded twice). Or it does believe it's from different files?

Perhaps try to add some logging statement in global scope of the Printer.php file to try to understand from where it's loaded initially, something along those lines:

<?php

namespace CirrusSearch\Maintenance;

try {
        throw new \Exception( __FILE__ );
} catch ( \Exception $e ) {
        print( $e );
}

interface Printer {
        public function output( $message, $channel = null );

        public function outputIndented( $message );

        public function error( $err, $die = 0 );
}
200.94.128.249 (talkcontribs)

Check if there was a duplicate file, in the entire wiki directory, but only 1 of Printer.php exists, and it is only called from CirrusSearch/includes/Maintenance/Maintenance.php


Check if the CirrusSearch/includes/Maintenance/Maintenance.php file is called from another file, it is also called from 18 files in CirrusSearch :

- IndexNamespaces.php (used in UpdateSearchIndexConfig.php)

- Metastore.php (used in CirrusSearch/includes/Maintenance/Maintenance.php)

Comment in those 2 files the line:

require_once __DIR__ . '/../includes/Maintenance/Maintenance.php';

And it already worked.


This is strange, because I had already done this process on the same server (PHP 8.1.14) a couple of months ago, in another folder and it worked without problems, now I was replicating it to test all the steps I did the first time, since I migrated from media wiki 1.11 to 1.35 and then to 1.39 and now I get this error


Thanks, the try/catch instruction helped me.

Reply to "Fatal error: UpdateSearchIndexConfig.php"
Justin C Lloyd (talkcontribs)

I'm currently working on testing CirrusSearch with AWS Elasticsearch in my dev environment, but first had to implement (AWS Elasticache) Redis for the job queue. However, I was recently told by someone on the Search team (I apologize, I forget who) that this may not be necessary unless there are perhaps hundreds of thousands to millions of jobs as WMF has. I've seen at most maybe 100-120k (aggregate for my five wikis) but mine are usually 100s, occasionally 1000s, and even more rarely 10s of thousands, which were otherwise handled fine in MySQL. So is it really necessary for CirrusSearch to have the job queues in Redis at that level?

Ciencia Al Poder (talkcontribs)

Using match_phrase_prefix

4
2001:1711:FA4B:D10:1163:390A:525:F58B (talkcontribs)

Hi. Mediawiki 1.38.2 and CirrusSearch generate Elastic queries using "query_string". How do I make Cirrus use "match_phrase_prefix" instead" ?

This will allow me to find page using partial keywords: Example "Cirr" will return pages with "Cirrus" inside.

Any ideas ? Thanks.

EBernhardson (WMF) (talkcontribs)

Within cirrus we don't have anything that directly supports match_phrase_prefix. We generally avoid this style of query as it provides queries that give unexpected outputs that can change depending on which replicas of the index it lands on. In particular there is no guarantee with match_phrase_prefix that "cirr" will return pages with "Cirrus" inside of them. Instead it will look at term dictionaries and select a number of words somewhat arbitrarily that start with cirr and then search for those words. Depending on the exact term statistics in the replica it lands on this can choose a different set of words to search for when repeating the same query.

While I would generally suggest avoiding it, the existing query_string queries do support this style of query. You can achieve the same functionality by appending a *, such as cirr*

2001:1711:FA4B:D10:1029:E025:F952:194F (talkcontribs)

Thanks for your anser.

It's a Swiss-French medical Wiki used by doctors, with a lot of long words. Our basic users don't know Elatic tricks, like "*" or "~".

I.E: "Prostatectomie", should be found by juste by entering "Prostat"

So if we cannot use match_phrase_prefix, can we put the final "*" by default in all search with Cirrus?

DCausse (WMF) (talkcontribs)

To customize the main full text search query you can implement your own \CirrusSearch\Query\FullTextQueryBuilder implementation and register it in the wgCirrusSearchFullTextQueryBuilderProfiles config var, see some examples for other builder profiles here. Then you can activate this new profile as the default by setting wgCirrusSearchFullTextQueryBuilderProfile to its name.

You have some examples of how to implement a FullTextQueryBuilder here.

Note that doing this is not very trivial but this is I think the only way to achieve what you want without teaching your users to use the search syntax.

Reply to "Using match_phrase_prefix"

UpdateSearchIndexConfig.php not working?

1
Summary by DCausse (WMF)
Gamebrew (talkcontribs)

I couldn't able to figure out why the UpdateSearchIndexConfig.php isn't working on the latest Mediawiki 1.39. Can someone able to help me a bit?


Error Log:

php /extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php


Updating cluster ...

indexing namespaces...

mw_cirrus_metastore missing, creating new metastore index.

Creating metastore index... mw_cirrus_metastore_first   Scanning available plugins...none

Elastica\Exception\ResponseException from line 178 of /public/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php:

#0 /public/extensions/Elastica/vendor/ruflin/elastica/src/Request.php(178): Elastica\Transport\Http->exec()

#1 /public/extensions/Elastica/vendor/ruflin/elastica/src/Client.php(513): Elastica\Request->send()

#2 /public/extensions/Elastica/vendor/ruflin/elastica/src/Index.php(655): Elastica\Client->request()

#3 /public/extensions/CirrusSearch/includes/MetaStore/MetaStoreIndex.php(201): Elastica\Index->request()

#4 /public/extensions/CirrusSearch/includes/MetaStore/MetaStoreIndex.php(139): CirrusSearch\MetaStore\MetaStoreIndex->createNewIndex()

#5 /public/extensions/CirrusSearch/includes/Maintenance/Maintenance.php(227): CirrusSearch\MetaStore\MetaStoreIndex->createIfNecessary()

#6 /public/extensions/CirrusSearch/maintenance/IndexNamespaces.php(40): CirrusSearch\Maintenance\Maintenance->maybeCreateMetastore()

#7 /public/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(72): CirrusSearch\Maintenance\IndexNamespaces->execute()

#8 /public/maintenance/includes/MaintenanceRunner.php(309): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute()

#9 /public/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()

#10 /public/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(117): require_once('/home/nginx/dom...')

#11 {main}


My Mediawiki Info:

Product Version
Mediawiki 1.39.1
PHP 7.4.33 (fpm-fcgi)
MariaDB 10.3.37-MariaDB
ICU 62.2
Pygments 2.11.2
Elasticsearch 7.10.2
CirrusSearch 6.5.4 (e15ac38) 06:42, January 10, 2023 GPL-2.0-or-later
Elastica 6.2.0 (1baee3b) 06:13, December 4, 2022

Elastic is working fine at my end.

curl -XGET 'localhost:9200'

{

  "name" : "node",

  "cluster_name" : "nodecluster",

  "cluster_uuid" : "_na_",

  "version" : {

    "number" : "7.10.2",

    "build_flavor" : "default",

    "build_type" : "rpm",

    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",

    "build_date" : "2021-01-13T00:42:12.435326Z",

    "build_snapshot" : false,

    "lucene_version" : "8.7.0",

    "minimum_wire_compatibility_version" : "6.8.0",

    "minimum_index_compatibility_version" : "6.0.0-beta1"

  },

  "tagline" : "You Know, for Search"

}

46.193.3.148 (talkcontribs)

Hello quick question once I have downloaded all the dependencies for the cirrus Search extension how do I link it with elastic search ?

Ciencia Al Poder (talkcontribs)