Help talk:CirrusSearch

About this board

46.193.3.148 (talkcontribs)

Hello quick question once I have downloaded all the dependencies for the cirrus Search extension how do I link it with elastic search ?

Ciencia Al Poder (talkcontribs)
Reply to "Elastic and Cirrus Search"
Jonteemil (talkcontribs)

Hello!

Some of the filenames of these files aren't complete for some reason. Also you would expect that this search would say "File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac (redirect from File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac.flac)" however it isn't. Might anyone here know why?

DCausse (WMF) (talkcontribs)

Hi,

I'm not sure to understand what is not complete in the query results for intitle:/Philadelphia Orchestra/ filemime:audio/x-flac do you have a specific page missing.

The second query you pasted contains error and after fixing it (I assumed you wanted to search for the redirect with the doubled .flac extension) finds the page you mention. Here is how I fixed the query:

intitle:/Philadelphia Orchestra/ filemime:audio/x-flac intitle:/\.flac\.flac/.

When searching for a redirect the search engine will always display the redirected page, sometimes you may see a hint that you matched a redirect when the mention (redirect from: page_name) appears after the page title, see for instance the results for: intitle:/\.flac\.flac/ filemime:audio/x-flac incategory:"Swiss Foundation Public Domain".

Speravir (talkcontribs)

DCausse, the filenames are only partially displayed, from the first search Jonteemil provided it seems there is a maximum length for display, some limit for characters, and then the second search condition intitle:/\.flac\.flac/ only narrows down the result(s) without adjusting the displayed lines.

But with an altered search I get the full display, it links to the redirected file with only one file extension, though, as already pointed out by DCausse: file: filemime:audio/x-flac intitle:/Philadelphia Orchestra.+\.flac\.flac/. Note that I merged both regex searches, as the one with two of them is really bad in terms of server loading (I also added the namespace as search domain; this should be always added if possible).

Reply to "2 questions"

cirrussearch vs database backup dumps

2
69.191.241.48 (talkcontribs)

Hi - there are two types of dumps available for enwiki pages - monthly database dump structured in XML which you can subscribe to and weekly cirrussearch dumps, which are structured in JSON for bulk upload to elasticsearch. We're trying to diff the two dumps to see if they're comparable, but notice some articles are in the monthly XML dump not in the weekly cirrussearch dump. I'm having trouble finding an explanation in the main wikimedia homepage that clearly states the difference beteween these two enwiki dumps. Any additional information would be much appreciated.

I would post links, but am getting an error when trying to post, so please navigate to dumps.wikimedia.org and look for the extensions

cirrussearch dump: /other/cirrussearch/

xml dump: /enwiki/latest/

Ciencia Al Poder (talkcontribs)

Check if the "missing" articles in cirrus search dumps exist on the live wiki. If not, that means those articles got deleted after the monthly XML dumps but before the weekly cirrus search dumps

Reply to "cirrussearch vs database backup dumps"
Nicolas senechal (talkcontribs)

I try to -incategory:"Actif" and I have a result of an include. So I try with ! and he doesn't work (-! and !- (is equal with -)). I try to include Actif with incategory:"Actif" and he doesn't work.

It's not normal... how can fix this?

Any help is apprised.

DCausse (WMF) (talkcontribs)

The syntax you are using is correct and should work:

-incategory:"Cartographie"

is searching for pages that are not directly under the Cartographie category.


It might be that you are expecting incategory to find articles belonging to the tree of subcategories?

This is not the case, incategory is limited to direct relationships.

For finding articles in a category and the subcategories of this category you must use the deepcat keyword: deepcat:"Cartographie".


This keyword is available on WMF wikis.

Nicolas senechal (talkcontribs)

Thank you @DCausse (WMF), but it's still doesn't work.

Actif is the directely liked.

And I try again with deepcat the same, but doesn't work with just deepcat.

DCausse (WMF) (talkcontribs)

If you are using a publicly accessible wiki could you provide the link to it so that we can try to have a closer look?

If you are using your own private wiki could you double check that CirrusSearch is properly installed, to do this I generally append &cirrusDumpQuery to the URL bar on the search results page.

This is how it looks like on the french wikipedia:

https://fr.wikipedia.org/w/index.php?search=-incategory%3ACartographie&title=Sp%C3%A9cial%3ARecherche&ns0=1&cirrusDumpQuery


This should display the JSON document sent to elasticsearch and we might be able to detect what is wrong in your setup, esp. if the match query on the category.lowercase_keyword field is wrapped inside a must_not block or not.

The deepcat keyword requires a graph engine to be installed and setup, this most probably explains why it does not work out of the box.

Nicolas senechal (talkcontribs)
Nicolas senechal (talkcontribs)

OK - after checking my install

I have 1 question :

I only have to have the elastica extension for Wikipedia, and the library cURL in order to running circusSearch?

If no, so how can I download for xampp on windows elastica librairie for php?

Nicolas senechal (talkcontribs)

After some time I retry to download elasticaSearch, now it work, incategory don't realy work (so for me: !incategory: and -incategory: to include, I don't have exclude, the oprator like OR, AND, NOT don't work).

So in order to run ElasticSearch I have to do a command in the cmd on windows, but I need to have a cmd run constantly. I shear on the internet and I find the commend START \B .So now I have elasticSearsh who run in a hidden cmd. To stop it I just have to close my common cmd.


so here is the result with &cirrusDumpQuery for !incategory:"Actif" -incategory:"Sql"

{

    "__main__": {

        "description": "full_text search for '!incategory:\"Actif\" -incategory:\"Sql\"'",

        "path": "test\/page\/_search",

        "params": {

            "timeout": "20s",

            "search_type": "dfs_query_then_fetch"

        },

        "query": {

            "_source": [

                "namespace",

                "title",

                "namespace_text",

                "wiki",

                "redirect.*",

                "timestamp",

                "text_bytes"

            ],

            "stored_fields": [

                "text.word_count"

            ],

            "query": {

                "bool": {

                    "minimum_should_match": 1,

                    "should": [

                        {

                            "query_string": {

                                "query": "!incategory\\: (all.plain:\"Actif\"~0^1)",

                                "fields": [

                                    "all.plain^1",

                                    "all^0.5"

                                ],

                                "phrase_slop": 0,

                                "default_operator": "AND",

                                "allow_leading_wildcard": true,

                                "fuzzy_prefix_length": 2,

                                "rewrite": "top_terms_boost_1024"

                            }

                        },

                        {

                            "multi_match": {

                                "fields": [

                                    "all_near_match^2",

                                    "all_near_match.asciifolding^1.5"

                                ],

                                "query": "!incategory:"

                            }

                        }

                    ],

                    "filter": [

                        {

                            "bool": {

                                "must": [

                                    {

                                        "terms": {

                                            "namespace": [

                                                0,

                                                4,

                                                6,

                                                14

                                            ]

                                        }

                                    }

                                ],

                                "must_not": [

                                    {

                                        "bool": {

                                            "should": [

                                                {

                                                    "match": {

                                                        "category.lowercase_keyword": {

                                                            "query": "Sql"

                                                        }

                                                    }

                                                }

                                            ]

                                        }

                                    }

                                ]

                            }

                        }

                    ]

                }

            },

            "highlight": {

                "pre_tags": [

                    "\ue000"

                ],

                "post_tags": [

                    "\ue001"

                ],

                "fields": {

                    "title": {

                        "type": "fvh",

                        "number_of_fragments": 0,

                        "order": "score",

                        "matched_fields": [

                            "title",

                            "title.plain"

                        ]

                    },

                    "redirect.title": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 10000,

                        "matched_fields": [

                            "redirect.title",

                            "redirect.title.plain"

                        ]

                    },

                    "category": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 10000,

                        "matched_fields": [

                            "category",

                            "category.plain"

                        ]

                    },

                    "heading": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 10000,

                        "matched_fields": [

                            "heading",

                            "heading.plain"

                        ]

                    },

                    "text": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 150,

                        "no_match_size": 150,

                        "matched_fields": [

                            "text",

                            "text.plain"

                        ]

                    },

                    "auxiliary_text": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 150,

                        "matched_fields": [

                            "auxiliary_text",

                            "auxiliary_text.plain"

                        ]

                    },

                    "file_text": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 150,

                        "matched_fields": [

                            "file_text",

                            "file_text.plain"

                        ]

                    }

                },

                "highlight_query": {

                    "query_string": {

                        "query": "!incategory\\: (title.plain:\"Actif\"~0^20 OR redirect.title.plain:\"Actif\"~0^15 OR category.plain:\"Actif\"~0^8 OR heading.plain:\"Actif\"~0^5 OR opening_text.plain:\"Actif\"~0^3 OR text.plain:\"Actif\"~0^1 OR auxiliary_text.plain:\"Actif\"~0^0.5 OR file_text.plain:\"Actif\"~0^0.5)",

                        "fields": [

                            "title.plain^20",

                            "redirect.title.plain^15",

                            "category.plain^8",

                            "heading.plain^5",

                            "opening_text.plain^3",

                            "text.plain^1",

                            "auxiliary_text.plain^0.5",

                            "file_text.plain^0.5",

                            "title^10",

                            "redirect.title^7.5",

                            "category^4",

                            "heading^2.5",

                            "opening_text^1.5",

                            "text^0.5",

                            "auxiliary_text^0.25",

                            "file_text^0.25"

                        ],

                        "phrase_slop": 1,

                        "default_operator": "AND",

                        "allow_leading_wildcard": true,

                        "fuzzy_prefix_length": 2,

                        "rewrite": "top_terms_boost_1024"

                    }

                }

            },

            "stats": [

                "full_text",

                "full_text_querystring",

                "complex_query",

                "incategory",

                "query_string"

            ],

            "rescore": [

                {

                    "window_size": 8192,

                    "query": {

                        "query_weight": 1,

                        "rescore_query_weight": 1,

                        "score_mode": "multiply",

                        "rescore_query": {

                            "function_score": {

                                "functions": [

                                    {

                                        "field_value_factor": {

                                            "field": "incoming_links",

                                            "modifier": "log2p",

                                            "missing": 0

                                        }

                                    },

                                    {

                                        "weight": 0.1,

                                        "filter": {

                                            "terms": {

                                                "namespace": [

                                                    4

                                                ]

                                            }

                                        }

                                    },

                                    {

                                        "weight": 0.2,

                                        "filter": {

                                            "terms": {

                                                "namespace": [

                                                    6,

                                                    14

                                                ]

                                            }

                                        }

                                    }

                                ]

                            }

                        }

                    }

                }

            ],

            "size": 21

        },

        "options": {

            "timeout": "20s",

            "search_type": "dfs_query_then_fetch"

        }

    }

}

DCausse (WMF) (talkcontribs)

Hi,

the proper syntax is -incategory:Sql and the json output you pasted shows that it works since it has this section:

"must_not": [
    {
        "bool": {
            "should": [
                {
                    "match": {
                        "category.lowercase_keyword": {
                            "query": "Sql"
                        }
                    }
                }
            ]
        }
    }
]
Nicolas senechal (talkcontribs)

Thank you very much for your patience and your response!

But why does it work without quotes to exclude and work with quotes to include?

DCausse (WMF) (talkcontribs)

incategory:"sql" and incategory:sql should produce the same query.

Similarly, -incategory:"sql" and -incategory:sql should also produce the same query. If not please try to identify what differs from the cirrusDumpQuery output.

Reply to "-incategory don't work"

Code used to parse the text as in cirrus dump

2
80.12.85.103 (talkcontribs)

Could someone provide the code or the means to parse the wikicode as in the "text" attribute within the elasticsearch pages - or where to better inform me ? I've been working with NLP dataset generation from wikipedia dumps, but I can't get satisfactory results with most of the parsers I've tested (mwparserfromhell, wikitextparser, mediawiki-parser). I would need to have the same text as in the cirrus dump but keeping the internal links. Thank you for any information!

EBernhardson (WMF) (talkcontribs)

The text used in the CirrusSearch dumps comes from the allText value created by WikiTextStructure::extractWikitextParts.

For the most part the processing takes the html output from mediawiki's wikitext parser, strips out elements matching a set of css selectors identifying some of the non-content and auxiliary parts of a page, and then strips all the tags out of the remaining content.

Unfortunately I'm not aware of a way to get the bulk html content of the wiki, you may need to use the mediawiki parser, and that still may have difficulties depending on template and lua usage.

Reply to "Code used to parse the text as in cirrus dump"

Need to upgrade elastic search Library log4j-1.2-api-2.11.1.jar

5
Summary by Ciencia Al Poder

Stop creating duplicate posts. Topic:Wm67mprhel2mv59q

Pooja2425 (talkcontribs)

Hi Team,

Hi Team,

we are using below,

MediaWiki 1.35.3
PHP 7.4.23 (apache2handler)
MySQL 8.0.26
Lua 5.1.5
Elasticsearch 6.5.4

/usr/share/elasticsearch/lib/log4j-1.2-api-2.11.1.jar

log4j-api-2.11.1.jar

log4j-core-2.11.1.jar

x-pack-security/log4j-slf4j-impl-2.11.1.jar

please provide us any patch which is higher then log4j>2.15.0

TheDJ (talkcontribs)

We do not provide elasticsearch. Mediawiki only uses it. Please contact elastic.co itself. Or just restart elasticsearch with the variable which disables the affected functionality. This is widely documented but looks something like -Dlog4j2.formatMsgNoLookups=true

Pooja2425 (talkcontribs)

Thanks @TheDJ for help,

please let me know only i need to set -Dlog4j2.formatMsgNoLookups=true into etc/elasticsearch/jvm.options or anything else also need to do.

As we are using Elasticsearch v 6.5.4 and java 1.8.0

Or pls suggest me link where i can confirm all this.

thanks

Pooja2425 (talkcontribs)
TheDJ (talkcontribs)

Please contact elastic.co

Summary by Wkee4ager

Referred to phabricator

Wakelamp (talkcontribs)

Is there a sandbox to test in, so I don't affect performance ~~~~

Wkee4ager (talkcontribs)

What are you planning to test? Abuluntu (talk) 12:46, 1 November 2021 (UTC)

Wakelamp (talkcontribs)

No So I am not doing a performance load test, :-) I just saw the performance issue affecting others. Have many people screwed up? I assume you have some sort of timeout - Is that a switch I can set lower? Anway I am trying to work out ways of getting what I am after some other way.

Wkee4ager (talkcontribs)

I’m not sure if I’m the right one to ask. If there is a performance issue it’s probably discussed at phabricator. Wish you the best of luck!

EBernhardson (WMF) (talkcontribs)

If you are referring to the WMF wikis, when the search system cuts off a query for performance reasons (typically it took too long to execute) that's normal and expected behaviour that shouldn't negatively impact others (assuming you aren't a bot making many parallel queries). Overall you shouldn't need to worry about it, beyond pondering how to construct a query that doesn't timeout

If you are trying to construct a query and it keeps timing out, could you could post some details of the information you are trying to retrieve? Someone might be able to point out a more efficient way to get the same information.

Problem indexing pdf documents that include cyrilic characters

2
LveFunc (talkcontribs)

Hello guys,


Lately i've been up to a task of deploying local MediaWiki. Everything went smooth until it came to indexing inside of pdf files that contain characters other that US ascii. Doing '?action=cirrusDump' and looking at 'file_text' field shows that all cyrillic characters are getting dropped while latin characters are preserved. Folks at ru.wikipedia.org somehow managed to do it but i couldn't find solution online. I would be very thankful if somebody could point out why that happens and how i could potentially solve this problem.

My configuration is:

MediaWiki - 1.36.1

PHP - 7.4.22 (apache2handler)

PostgreSQL - 13.3

ICU - 66.1

Elasticsearch - 6.5.4

PDF Handler - c9705a8

AdvancedSearch - c8a42b8

CirrusSearch - 6.5.4 (ab802b7)

Elastica - 6.1.3 (9f6e66a)

My elasticsearch configuration:

analysis-icu

extra MediaWiki plugin

ingest-attachment

DCausse (WMF) (talkcontribs)

Hi,

CirrusSearch does not manipulate the text it receives from Extension:PdfHandler. I would check if this extension is working properly especially that the tooling it depends on (set via $wgPdftoText, likely to be pdftotext) is properly extracting the text you expect.

DonSimon (talkcontribs)

How to exclude edits by users (let from some certain vector of users) from search view via cirrus?

It'd be useful for patrolling recent changes when you don't want see 2 most active users right now, for example.

PerfektesChaos (talkcontribs)

Unsatisfying answer: Search can analyze content only, but not metadata.

  • The adjoined metadata of a page would be timestamp, user, summary etc.
  • The content is the visible text of the rendered page, or source text of the page itself without transclusions.

For recent changes a search within all articles or pages within the wiki is the wrong tool.

  • There might be some gadgets already existing, which will filter the recent change list by a vector of trusted users, to focus on less well known people. At least a gadget programmer could easily remove those by screen grabbing. This one could do that, among many other things, but is not focussing on such list of trusted users.
  • By MediaWiki software there are options on watchlist and recent changes (which are based on the identical software) to suppress bots, registered users, or minor edits (unsafe).
Reply to "Exclude edits by users"

Couldn't connect to host, Elasticsearch down?", Elastica\\Exception\\Connection\\HttpException

1
Pooja2425 (talkcontribs)

Hi, I am using these, I installed elastic server But not showing below.

Product Version
MediaWiki 1.35.3
PHP 7.4.24 (apache2handler)
MySQL 8.0.26
ICU 65.1
Lua 5.1.5

also installed required extensions and elasticsearch/elasticsearch": "6.5 client.

Elastica 6.1.3 (f3c9459) 01:29, 3 September 2021
CirrusSearch 6.5.4 (95b958b) 19:07, 20 August 2021

wfLoadExtension( 'Elastica' );

wfLoadExtension( 'CirrusSearch' );

$wgDisableSearchUpdate = true;

$wgCirrusSearchIndexBaseName =  ''; //DataBase Name

php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php

Now remove wgDisableSearchUpdate = true;

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php –-skipParse

$wgSearchType = 'CirrusSearch';

# php /data/www/html/wiki/maintenance/runJobs.php


1)when trying to search anything in search engine :

An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later.


2) http://wiki/api.php?action=cirrus-settings-dump


"code": "internal_api_error_Elastica\\Exception\\Connection\\HttpException", "info": "[YUsbMuAOtsHM4e8mbGG6NwAAAAs] Exception caught: Couldn't connect to host, Elasticsearch down?", "errorclass": "Elastica\\Exception\\Connection\\HttpException", "*": "Elastica\\Exception\\Connection\\HttpException at /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php(190)\n#0 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Request.php(194): Elastica\\Transport\\Http->exec()\n#1 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(689): Elastica\\Request->send()\n#2 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index.php(571): Elastica\\Client->request()\n#3 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(383): Elastica\\Index->request()\n#4 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(75): Elastica\\Index\\Settings->request()\n#5 /data/www/html/wiki/extensions/CirrusSearch/includes/Api/SettingsDump.php(36): Elastica\\Index\\Settings->get()\n#6 /data/www/html/wiki/includes/api/ApiMain.php(1593): CirrusSearch\\Api\\SettingsDump->execute()\n#7 /data/www/html/wiki/includes/api/ApiMain.php(529): ApiMain->executeAction()\n#8 /data/www/html/wiki/includes/api/ApiMain.php(500): ApiMain->executeActionWithErrorHandling()\n#9 /data/www/html/wiki/api.php(90): ApiMain->execute()\n#10 /data/www/html/wiki/api.php(45): wfApiMain()\n#11 {main}" }


pls suggest which step i am missing,

Reply to "Couldn't connect to host, Elasticsearch down?", Elastica\\Exception\\Connection\\HttpException"