This page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made.
Discussion related to the CirrusSearch MediaWiki extension.
See also the open tasks for CirrusSearch on phabricator.
Latest comment: 3 years ago3 comments2 people in discussion
RESOLVED
The root cause was not using the default node.name or cluster.name in /etc/elasticsearch/elasticsearch.yml
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
I couldn't able to figure out why the UpdateSearchIndexConfig.php isn't working on the latest Mediawiki 1.39. Can someone able to help me a bit?
Creating the mw_cirrus_metastore index seems to be failing, the stack indicates that an error was detected in the elasticsearch response body, so elasticsearch did receive the request but refused to process it.
Unfortunately the exception does not tell us more, so you might have to investigate. Elasticsearch logs might possibly contain some indication. Another way might be to hack the include/MetaStore/MetaStoreIndex.php script with something like (around line 199 in the createNewIndex method):
Latest comment: 3 years ago4 comments3 people in discussion
RESOLVED
Elasticsearch 8.6.0 is not supported, with MW 1.39 you must install elasticsearch 7.10.2.
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
MediaWiki
1.39.1
CirrusSearch
6.5.4 (d13de17)
Elastica
6.2.0 (1baee3b)
Elasticsearch
8.6.0
On a fresh install of MediaWiki and Elasticsearch I'm running into an issue with creating the elasticsearch index. Running `maintenance/UpdateSearchIndexConfig.php` script produces the following error:
Updating cluster ...
indexing namespaces...
Elastica\Exception\Connection\HttpException from line 186 of /var/www/team-wiki/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php: Unknown error:52
Elasticsearch is running when this CURL error 52 is produced. It looks like CirrusSearch is giving an "empty reply from server" from Elasticsearch. Not sure if I should be passing additional parameters besides what is provided in the CirrusSearch README. HadleySo (talk) 00:24, 23 January 2023 (UTC)Reply
Latest comment: 3 years ago3 comments3 people in discussion
I have PDFHandler installed, as I thought I read that you need it installed in order for CirrusSearch to be able to parse PDF's.
Additionally, I have PdfEmbed installed, so that on the (few) pages that contain PDF's, I can embed them directly in the page. Which works great for those cases. I only wish that CirrusSearch could index either the embedded PDF itself, or index the upload in the File namespace. But alas, despite seeing messages here and other places that imply that one of those solutions might exist, I have not found definitive instructions for how to actually implement this.
Can anyone advise? Is this a pipe dream that I have? Or is there a solution that I'm just not seeing?
@Cavila - thank you for the link. From reading that, then reading the PdfHandler documentation, I realized I didn't have the handlers dependencies installed. Once installed those and re-indexed, I am able to search by file content, but only in the advanced tab. (Like default, I search the main namespace. But in advanced search, I can choose to search files."
Latest comment: 3 years ago4 comments2 people in discussion
I opened a SMW GitHub issue back in January, and someone else today finally reported having the same problem when introducing CirrusSearch into a wiki that uses SMW, but it only just occurred to me to mention it here.
sadly without any concrete errors it is extremely hard to determine the cause of the issue...
CirrusSearch is not doing anything special when SMW is installed, the best I could suggest is to continue to investigate and find the root cause, my understanding is that Unparsed SMW queries is something that is considered an error, could SMW log some more context when it happens? DCausse (WMF) (talk) 12:47, 3 April 2023 (UTC)Reply
Out of curiosity does the issue diminish or increase if you delay the time at when CirrusSearch jobs are processed:
$wgCirrusSearchUpdateDelay = [
"prioritized": 60, // in seconds, generally updates happening to the page itself
"default": 60, // in seconds, generally indirect updates (e.g. template change propagation)
];
This was extremely difficult to diagnose, even with the help of an experienced consultant. That said, I will see about testing your delay idea, though I don't know if that would be a problem with very heavy traffic, e.g. an average search rate of about 500 search operations per minute, especially when changes to templates can generate hundreds or even thousands of jobs.
Justin
EDIT: The settings.txt docs say that that variable is ignored with JobQueueDB and is only used for JobQueueRedis. We do use the DB for the job queue, so this wouldn't have any effect. Justin C Lloyd (talk) 22:33, 3 April 2023 (UTC)Reply
Latest comment: 3 years ago3 comments2 people in discussion
Searching with regex doesn't work for me. Example: insource:/linu[xs]/
Debug log entry: [CirrusSearch] Search backend error during regex search for 'insource:/linu[xs]/' after 3: Parse error on script_lang not supported [groovy]
execute /usr/share/elasticsearch/bin/elasticsearch-plugin/elasticsearch-plugin install org.wikimedia.search:extra:7.10.2-wmf4 (see gerrit.wikimedia.org/g/search/extra#installation - sorry got "abusefilter-warning-linkspam" here ... so I couldn't paste the complete URL)
add $wgCirrusSearchWikimediaExtraPlugin[ 'regex' ] = array( 'build', 'use', 'max_inspect' => 10000 ); to LocalSettings.php (see above)
restart Elasticsearch: systemctl restart elasticsearch (is it necessary?!)
recreate index:
php /var/www/wiki/w/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --startOver (--startOver was needed, without I got no error anymore in MediaWiki - but the results were not correct)
I have only a small private wiki - so there was no problem regarding runtime for recreate the index. Elasticsearch is installed on my Ubuntu server via apt.
However whenever I run UpdateSearchIndexConfig.php
I says Couldn't connect to host, Elasticsearch down?
I feel as if I'm not configuring it properly, shouldn't I have to add a secret key, read the readme in entirety with all the options in settings.txt in the docs folder and still can't find an option to properly configure the server
Also note that this configuration option accepts an array of servers.
If elasticsearch is protected by some sort of authentication mechanism you might be able to configure it with the additional config keys: username, password, auth_type and possibly headers, but please refer to the Elastica documentation for more details, I don't have much knowledge of how auth is supposed to work with Elastica. DCausse (WMF) (talk) 07:22, 24 April 2023 (UTC)Reply
Latest comment: 3 years ago4 comments2 people in discussion
Looks like autocomplete does not work well with entries containing iota subscript. For example, try entering διαττω in below site, you will not get autocomplete results for διᾴττω (using core ICU plugin and wmf extra plugin). Since my implementation is in older versions, I cannot verify whether this issue is resolved in currently used versions. Spiros71 (talk) 18:32, 25 April 2023 (UTC)Reply
If this is a bug, it's a bug in the ICU library. This particular normalization is done by ICU normalization (in the icu_normalizer filter). I also disagree with some of the choices ICU normalization and ICU folding make, so I see where you are coming from.
I think they have a general heuristic of unpacking "multi-letter" characters into their component parts, which makes sense for things like 🅎 → ppv; I'm less sure about 🄪 → 〔s〕 and ⑴ → (1). I don't know enough Greek to know whether ᾴ should normalize to άι or ά—but I could see it going either way.. that iota subscript is dangerously close to being demoted to "just a diacritic"!
Anyway, the list of ICU normalization conversions isn't documented anywhere, but I brute forced them all back in 2020, and I documented the full list (excluding simple lowercasing). It may not be 100% accurate anymore, but it should be close. The Greek sections I see start at U+3C2, U+1D5D, and U+1F71 (plus Math Greek at U+1D6A8).
If you want to normalize ᾴ to ά, and do the same for other iota subscript characters, you could add a char_filter to your language analyzer. If you want to keep ᾴ as is (or keep some other character from being normalized) you could add it to the exceptions for icu_normalizer. (Note that you actually specify all allowable normalization characters in theunicodeSetFilter parameter, so uses a negative character class. For example, the German Wikipedia config uses [^ẞß]—everything but ẞ and ß get normalized.)
Trey, thanks so much for your erudition and devotion to computational linguistics issues, I am really impressed :)
Indeed, the iota subscript is parsed as main letter + iota (αι in the case discussed) as it "contains" an "embedded" form of iota.
Another example from wiktionary, τηδε will match nothing, τηιδε though will give τῇδε. Although this parsing is, strictly speaking, linguistically correct, I think that when it comes to usability it would be good that the iota subscript is also parsed as a diacritic (and hence stripped off for autocomplete purposes) so that both τηδε and τηιδε will give τῇδε.
As for adding the exceptions in icu_normalizer or unicodeSetFilter that is great to know! Where can I find those files, for example to normalize ᾴ to α, and so on (whilst also maintaining the current approach of being able to type αι to get ᾴ)? Spiros71 (talk) 20:07, 26 April 2023 (UTC)Reply
Sorry I missed your reply! I was not watching my notifications carefully this week.
If you are available at 15:00-16:00 UTC, we are having office hours this Wednesday, May 3, and we can talk live about your situation. More info on the etherpad.
> so that both τηδε and τηιδε will give τῇδε
I'm not seeing an easy way to allow both τηδε and τηιδε to match τῇδε. If you convert ῇ to η, then τηιδε won't match anymore. A hack would be to also convert ηι to η, but I don't know enough Greek to know if that's a reasonable idea. In general, it sounds like a bad idea, though, because there are probably other contexts where the ι in ηι shouldn't be ignored. More on this below.
Before I give some pointers on how to modify your config, I'm not sure what the best practice is for maintaining language analyzer customizations to a MediaWiki installation. (I only normally work with the code that generates the config for the Wikimedia wikis.) You might want to follow up with the MediaWiki Stakeholders' Group if you aren't already familiar with them. By coincidence, they are having theirmonthly meeting next Friday, May 5. No idea if that's the right place to ask these questions, but I'm sure they can point you in a good direction. There are also EMWCon (which just happened, alas) and SMWCon (in the fall, in Europe). Some of the folks from EMWCon may be helpful, too. Finally, on the off chance you are in or near Athens (not impossible, given our topic of conversation!) the 2023 Wikimedia Hackathon is happening there in a few weeks. They will be much more focused on Wikimedia sites/wikis, but if you happen to live around the corner from the venue, it might be worth it to talk with some Wikimedia developers in person. (And some of the people from the above groups will likely overlap.)
[Looking around, I also found this conversation which points to this Phab Paste, which allows you to use hooks to add custom analysis config. @DCausse (WMF) wrote it, though, and it's basically magic to me. It may also be a little out of date; not sure about that.]
With that disclaimer and distraction out of the way, the readily available ways to customize your analyzer config is either in code, in mediawiki/extensions/CirrusSearch/includes/Maintenance/SuggesterAnalysisConfigBuilder.php if you are a PHP hacker, or with curl or the Elastic console if you are an Elastic hacker. (There are probably other, maybe better ways, too.) The curl/console option is probably the easiest to pick up from online examples, since it's not MediaWiki-specific.
Anyway, to get τηδε to match τῇδε, a mapping character filter would let you map "ᾴ=>α", "ῇ=>η" (and all the others) before icu_normalizer does something you don't want.
If you look at the Greek Wikipedia config, there's titlesuggest / titlesuggest / index / analysis. You could add another mapping char_filter there, and configure it in analzyer / plain / char_filterandanalyzer / plain_search / char_filter (plain applies to the text in your documents; plain_search applies to user queries; they are often the same.) The new character filter could go before or after word_break_helper; since it doesn't do anything to Greek letters, so the order doesn't really matter.
The problem is that adding this kind of mapping char_filter would mean that τηιδε would not match τῇδε anymore (other than to the degree any typo might match). If τηδε matching is more important than τηιδε matching, then this might be the way to go.
It is possible to make both match, but the only ways I can think to do it are pretty complex. If you are a Java hacker, you could write a custom plugin. The extra-analysis-homoglyph plugin does something similar (1 token in, 1 to 3 tokens out). [Another option would be to convince me that this is something we need to do for Greek-language Wikipedia and other wikis in general—is it?—and then I could write the plugin.. but it'd probably be quite a while before I'd have a chance to get to it as part of my normal work; 6–12 months would be super quick.]
Another approach might be the multiplexer token filter and have a path with nothing and a path with something to normalize the iota subscripts the way you want... but thinking about it, that gets out of hand very quickly. A char filter is the right way to do it, but there's no multiplexer for char fitlers (with good reason). The only way to replace characters in a non-custom token filter is probably a pattern replace token filter, but you can't do multiple mappings in a single filter... Nevermind, that's just too messy.
I hope this isn't too overwhelming. Or, rather, sorry this is probably overwhelming. It can be pretty complex and not easy to do something no one else has tried to do before. [Hmm.. I guess yet another approach would be to try to find a plugin that does something like this. I've never seen such a thing, but I also haven't gone looking very hard for one before.]
Verifying counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Waiting to re-check counts...Not close enough! old=4689 new=3118 difference=0.33503945404137
Failed to load index - counts not close enough. old=4689 new=3118 difference=0.33503945404137. Check for warnings above.S0ring (talk) 07:54, 3 May 2023 (UTC)Reply
I could see several reasons for this to happen:
you ran ForceSearchIndex to populate your index but did not wait for it to be fully populated before running UpdateSearchIndexConfig
your wiki is having many pages being created while the script is running
To be certain try to count the number of docs on kim_general and kim_content before running this script, for instance:
using the logs you pasted as an example if the count was around 3118 initially then it's very likely that it's due to one of the two reasons I stated above.
If it was 4689 initially then I'm not sure to understand how this could happen and might require deeper investigations.
Ultimately you can bypass this check by increasing the acceptable deviation using the --reindexAcceptableCountDeviation flag:
# php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now --reindexAcceptableCountDeviation 50%
But this probably means that some docs were not properly moved to the new index and running ForceSearchIndex again might be required. DCausse (WMF) (talk) 09:41, 9 May 2023 (UTC)Reply
> Can I find a list of the words considered stopwords in my language?
It would be easier to answer if you mentioned the language, since there is a lot going on with stopwords!
For many languages, we use the stopword filters built into Elasticsearch, which are based on data from Lucene. Elastics has a list with links to the Lucene code base. Note that there are both Portuguese and "Brazilian"; we use Portuguese. We don't use CJK except for Japanese (and that may change eventually). Oddly the CJK stopword list is all English; we use the actual English list, which is only slightly different from the CJK list.
For some historical reason, certain language analyzers in Lucene are kept separate from the rest. I think it's because they were originally developed outside Lucene. They include:
Kuromoji (Japanese, which we don't use, yet)—stopwords
Ukrainian Morfologik—stopwords... however, for technical reasons, we maintain our own copy—currently they are the same
Nori (Korean)—which doesn't use stopwords per se, but rather filters part-of-speech tags put on words by the parser. We have a custom list.
SmartCN (Chinese)—it has a stopword list, but it is only punctuation (for technical reasons)
For Moroccan Arabic (ary) and Egyptian Arabic (arz) but not Standard Arabic (ar), we add a fair number of additional stop words.
For Romanian, we add additional variants for some words because the Lucene list is so old that it uses the incorrect letters (ş & ţ) because the correct letters (ș & ț) were not available on computers back then (to be fair, they weren't reliably available until almost 2010).
The Mirandese stopword list was provided by a community member, inspired by the Portuguese stopword list.
The Polish list is the same as the Stempel list above, except we add "o.o" to go with "o.o."—by the time we get to stopwords, no tokens have final periods, so "o.o." doesn't filter anything.
We have smaller lists of additional stopwoprds that are embedded in the code.
For Chinese/SmartCN we have our own punctuation list, which is just a comma (again for technical reasons)
We have additional stop word filters for Irish and Polish, but they aren't for proper stopwords, they are just tools for filtering bits and bobs that come up during analysis. (The SmartCN filter is like that, too, I guess.)
> How can I add a word to the list?
So, it depends on the language and where the stopword list comes from, whose list you want to update, and how long you want to wait to see results.
For quicker results for on-wiki search, we can make changes to CirrusSearch. You can tell me the language and the word(s) and I can take care of it, you can open a ticket on Phabricator and add the tag "Discovery-Search" if you want to track progress, or if you are a Mediawiki programmer, you could submit a patch to the codebase and the Search Team be happy to review it.
If you want to help a wider audience, you could open a ticket or a pull request upstream. Elastic is our immediate source of stopwords for most of these, but they are just wrappers around Lucene, so if they pay attention to a ticket, they'd just open a ticket in Lucene, so you can skip that step and open the ticket or pull request with Lucene. If it's accepted, it will eventually trickle down to Elastic again—though not directly to CirrusSearch, because we can't upgrade Elasticsearch anymore because of licensing changes. We haven't worked out our longer-term plan yet, but there is a decent chance we will end up on an Elasticsearch fork or other Lucene-based search engine and see the benefit eventually.
For most of the core Lucene stopword lists, there's another source mentioned in the code. The most common sources are Jaques Savoy and Snowball, though there are others. You can try to contact Lucene's upstream source and get them to update their list of stopwords, too, which might reach a wider audience, and might eventually trickle down to Lucene (they did update their Snowball-based stemmers and stopword lists 3 years ago—I think it's ad hoc, but they do update from time to time.)
And now the question you didn't ask, but you must be thinking if you read this far...
> Why is it so complicated!
At least, I ask myself this now and then. Lucene tries to be the central repository for lots of open source language analysis because they want to make it available to their users, but they don't have everything. We make modifications and customizations in CirrusSearch in response to things we find in our data, or that community members bring to our attention. We try to push things upstream, but it can take a long time, and it's work when there are other things to do. TJones (WMF) (talk) 14:19, 9 May 2023 (UTC)Reply
Completion profiles not appearing in user preferences, and autocomplete search suggestions not fuzzy
Latest comment: 3 years ago5 comments3 people in discussion
RESOLVED
wgCirrusSearchUseCompletionSuggester and wgCirrusSearchCompletionSuggesterSubphrases must be configured.
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
I installed ElasticSearch/Elastic/CirrusSearch with the intent to replicate Wikipedia's vector-2022 fuzzy search box:
For example, let's say I have a page titled Bees Eat Apples. Then I can currently enter Bee or Be or Bees ea and I will get a suggestions for the corresponding page, but I cannot enter Apples or anything like that and still get a suggestion.
I cannot figure out how to enable this: Even though CirrusSearch seems to work in general, none of the profile selection tools appear on the Special:Preferences#mw-prefsection-searchoptions page, nor does setting $wgCirrusSearchCompletionSettings = 'fuzzy-subphrases'; in LocalSettings.php seem to have any effect on the autocomplete suggestions (note that the index.php?search=Bees search page results are fine). MrArsGravis (talk) 15:36, 15 May 2023 (UTC)Reply
I seem to have solved most of my problem by setting
to generate the suggester index, and then selecting "subphrase matching" in my user search preferences.
I don't know whether the suggester index is periodically/automatically updated? If so, when? Or do I need a cronjob for this? Also, how is the the default user suggestion option set? MrArsGravis (talk) 19:07, 15 May 2023 (UTC)Reply
You need to add UpdateSuggesterIndex.php (without any parameter) to a cron job. For example, to execute it once or twice a day at least. New pages, page moves or deletions won't be tracked in the suggestions until you run it.
Thanks you two, I think it all works as desired now. It might be worthwhile to add this info on the main article page somehow, but I'm also not quite sure where. MrArsGravis (talk) 18:16, 16 May 2023 (UTC)Reply
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.
Latest comment: 3 years ago6 comments2 people in discussion
After a fresh install on a newly-upgraded mediawiki (upgraded in several steps, now at 1.39) I get this after trying to run UpdateSearchIndexConfig.php:
Elastica\Exception\Connection\HttpException from line 186 of /var/www/xxxwiki/w/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php: Unknown error:52
After the first failure, I tried to test the newly-installed Elastic and ran into this issue: (link removed, but I guess it isn't needed. .deb installer permissions not correct for Elasticsearch certificate directories). I changed the permissions as suggested and the curl command worked, but the error from UpdateSearchConfig.php was unchanged.
I am using Ubuntu 20.04 with php 8.1 added (8.1.18). The end of my LocalSettings.php is
// search
wfLoadExtension('Elastica');
wfLoadExtension('CirrusSearch');
$wgDisableSearchUpdate = true;
All settings in Elastic and in the CirrusSearch extension are left at default. I may have missed something, but it looks like defaults are fine. The Elastic server is on localhost, and this is a stand-alone server.
How are you configuring the list of elasticsearch servers?
Curl error 52 means "Empty reply from server" so it might be related to how CirrusSearch is trying to access elasticsearch, esp. if contacting elasticsearch using an http client while elasticsearch is running behind https.
When testing elasticsearch with curl what is the URL that you are using?
Yep, I missed that. Thanks! Reading settings.txt I was convinced that not setting anything created a search cluster on localhost and that the searchservers was then optional... I now have
$wgCirrusSearchServers = [
[
'transport' => 'Https',
'host' => 'localhost',
'port' => 9200
]
];
at the end of LocalSettings.php, but now I get an error:60.
Elastica\Exception\Connection\HttpException from line 186 of /var/www/xxxwiki/w/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php: Unknown error:60
Alternatively if elasticsearch is accessed only via localhost there might not be strong values in having encryption between MW and elasticsearch and if disabling SSL on top of elasticsearch is easier for you it might be a better approach perhaps?
mw_cirrus_metastore missing, creating new metastore index.
Creating metastore index... mw_cirrus_metastore_first Scanning available plugins...none
Elastica\Exception\ResponseException from line 178 of /var/www/xxxwiki/w/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php: request [/mw_cirrus_metastore_first/] contains unrecognized parameter: [include_type_name]
contains unrecognized parameter: [include_type_name] does seem to suggest that the version of elasticsearch running is not compatible with the installed version of CirrusSearch.
The CirrusSearch version accompanying MW in version 1.39 should be compatible with elasticsearch 7.10.
Latest comment: 2 years ago2 comments2 people in discussion
When I do a search in wikimedia, I receive the following error message - An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. Surrat48 (talk) 20:45, 8 June 2023 (UTC)Reply
Thanks for the report!
Can you share the wiki and the search query you are making to help us understand in what condition you get this failure?
Latest comment: 2 years ago9 comments2 people in discussion
Hello
- i restore a database backuped from my current mediawiki server V1.28 on my new mediawiki server V1.39 (upgrade of 1.28)
- after restore i get the same number of articles between the current MW (1.28) and the new MW (so i don't loss data)
- after restore i rebuild the index by running this two command
php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now
php forceSearchIndex.php
--> the second command show : Indexed a total of 57624 pages at 99/second
on my new MW whene i make search same page does not appear i my result despite the page exist
i check the job queue and i found about 11711 job queued related to cirrusSearchElasticaWrite so i execute php runJobs.php whith cron job until get 0 job queued
after that some page are being indexed but i still have others not indexed
so
1- After restoring a database backuped from MW 1.28 to 1.39
should i rebuild the index (does it necessary)?
2- if YES ,please,wich command(s) should i run (by order) to do that ?
3- how to ensure that all pages are indexed and appears on search before moving MW1.39 to production environnement
Having jobs in cirrusSearchElasticaWrite might possibly mean that there are failures, could you check your logs (mediawiki and elasticsearch ones) to see if there are any errors? DCausse (WMF) (talk) 07:46, 12 July 2023 (UTC)Reply
Hello,
MW and elasticsearch don't show any error in the log files
whene i run Saneitize.php script it list many pages which are not in index
Page not in index 41591 Exploit lnaswpdat003.lna
Page not in index 41593 Reseau:Api tbs
Page not in index 41596 Exploit rtr-delaprtr-vip.del
Page not in index 41597 Exploit rtr-delaprtr02-phy.del
Page not in index 41598 Exploit rtr-delaprtr03-phy.del
....
at the end of the execution it indicate "Fixed 10425 page(s) (76328 checked)"
Did the Saneitize.php script actually fixed your problems in the end?
If not then there must be an issue with these pages preventing them from being indexed.
If you open one these pages in a browser and add &cirrusDump to the URL it should print what's inside elasticsearch for this page, an empty array is shown if not indexed.
If it's empty can you check that CirrusSearch is actually able to generate the document that will be indexed, for this: api.php?action=query&format=json&prop=cirrusbuilddoc&pageids=41591&formatversion=2, note the pageids param.
Have you identified anything in common in the pages that are not indexed, are they from the same namespace or same content type?
If the Sanitize script fixed few pages, does running it again and again fix more and more pages?
Sadly I'm a bit puzzled by your problem and not sure what to look at next. I'd look more into understanding why you do not get any errors in the logs because having jobs cirrusSearchElasticaWrite means that they failed somehow and are being retried and we should have logged something somewhere (unless you mis-configured how MW logs are generated?). DCausse (WMF) (talk) 08:58, 13 July 2023 (UTC)Reply
Hello
thank you for reply
when i request the url i get this message
- running Sanitize script many time don't fix more an more pages
- in MW log i found error like this (i don't now if realy is an error)
does it possible to do a meeting together to troubleshooting the issue please ?
You have to understand why the ParserOutput cannot be obtained, I could see two main reasons.
You use a ContentHandler that does not support CirrusSearch, perhaps you enabled a new Extension, or forgot to enable one you previously had?
You have inconsistencies in your database causing some errors, see ParserOutputAccess.php and how it might fail. Did you run Manual:Update.php after importing your database?
We do run wikitech:Search_Platform/Contact#Office_Hours office hours every first Wednesday of the month if you want to get in touch with the WMF Search Platform Team but your issue suggests that the problem is not directly related to CirrusSearch.
Latest comment: 2 years ago2 comments2 people in discussion
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
Latest comment: 2 years ago3 comments2 people in discussion
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
I'm assuming here that all the Mediawiki dependencies are running on the same server: PHP, your database and elasticsearch, if not please be careful, especially if your elasticsearch cluster is shared between your production and backup installation.
If this is the case, when restoring a database backup you should also reindex everything from scratch. The same way that your relational database will get erased by restoring the backup, elasticsearch also needs to be reset based on the new content of the restored database. This is the easiest and safest solution.
There are no ways to ensure that the same query will return identical results on two different elasticsearch servers, reason is that ranking uses some stats that will certainly differ even if the documents are the same. What you could do is run some sanity checks, e.g. counting the number of indexed documents in both elasticsearch servers to make sure that they are close. DCausse (WMF) (talk) 16:41, 3 August 2023 (UTC)Reply
It would probably be easier or less prone to breaking on upgrade, to add a JavaScript gadget that would automatically prepend the insource: text on the search term when submitting the form. Ciencia Al Poder (talk) 10:40, 6 August 2023 (UTC)Reply
Latest comment: 2 years ago4 comments2 people in discussion
Hi,
we are currently using Mediawiki 1.39.4, PHP 7.4.33, MariaDB 10.4.12 and Elasticsearch 7.10.2.
I got the task to update the wiki in my company from version 1.35 to 1.39. Only now I am confused with CirrusSearch and Elasticsearch (we use locally on the server).
On the extension page of CirrusSearch it says you need Elasticsearch version 7.10.2, but in the CirrusSearch README it says that only version 6.8 is supported. Which of these is true?
I followed the steps in the README normally and the search works fine.
But when I create a new article it is not found in Special:Search, the content is not found either.
Thanks for the answer, but shouldn't new pages still be indexed automatically?
We have a separate department in the company that only edits wiki pages and they say it worked with the old version...
If I create a job, then I would have to index the database very often, or how should such a job look, that a new page is found directly? I don't get on at all Davidgbc (talk) 07:18, 21 October 2023 (UTC)Reply
You don't have to create jobs, they're automatically created by MediaWiki (usually after saving an edit on a page or performing any other modification) and placed on the job queue. Then, jobs are picked from the queue on following page loads, or by a job runner, depending how did you configure things. See Manual:Job queue for more information.
Unfortunately the README is wrong and the wiki page is correct. As linked in the wiki page there is a compatibility layer that can be activated for 1.39 to talk to 6.8.23, but it is focused on ensuring write compatability and it's possible you would run into query issues. EBernhardson (WMF) (talk) 19:52, 25 October 2023 (UTC)Reply
Thank you.
I face the problem of upgrading from 1.35 to 1.39 on RHEL9.
I already established that 1.35 works with 10.5.22-MariaDB. So the database version can remain the same when I switch MW version. I expect update.php to do the job for database wikidb
But having to upgrade elasticsearch synchronously with Mediawiki is a problem.
Elasticsearch > 6.8 is not in Redhat repositories. I can get 8.x but not 7.10.2
Can I have two Elasticsearch versions installed at the same time? Like one port 9200 and another on 9250?
It's techinically possible to run multiple versions of elasticsearch on the same host, but I'm not sure of any documentation to that end. Much would depend on your available infrastructure, and in my experience generally leads to ongoing complexities. In WMF infra we run multiple instances (of the same version) of elasticsearch on a single host and it's led to a number of minor problems and headaches over the years. If you have the ability to spin up virtual machines then one plausible way forward is to spin up a new instance running the newer version. Another potential option might be to use the docker container elastic makes available, those are isolated enough that it should reduce complexities of running two instances on one host. EBernhardson (WMF) (talk) 22:20, 25 October 2023 (UTC)Reply
Is there someone to be reached who created the compatibiity layer found in (1.39 version)?
My teammate DCausse wrote the layer, but if you look inside you can see it is very simple. The problem this compatability layer solves for is a breaking change in the bulk write api of elasticsearch. It doesn't do anything with search requests. In WMF production we ran the upgrade such that we had a cluster running 7.10, and a cluster running 6.8. As the code was deployed that knew how to talk to 7.10 it would also switch it's query endpoint between clusters. Only the write layer requied compatability, because it had to write to both clusters at the same time.
There is a reasonable chance it would work for most simple queries. The general problem is that when Elastic releases a major version update they make a wide variety of breaking changes (see breaking changes list for 7.0). You could test and see what happens to work, but if problems do arise I don't know if there will be much we can do to help you. EBernhardson (WMF) (talk) 18:58, 26 October 2023 (UTC)Reply
Latest comment: 2 years ago5 comments2 people in discussion
Hi,
as in title. Are these supported? If not, could you point me to a Phabricator task? On my wiki we've got two content namespaces (one for official, other for unofficial content), but unfortunately the second namespace is never suggested in suggestions. Are there any workarounds maybe? (without disabling Cirrus suggestions)
Alternatively, if someone has some pointers how to potentially implement it in the extension, I'd gladly appreciate them - though I've never done any search work in the past. Alex44019 (talk) 00:31, 20 November 2023 (UTC)Reply
Getting suggestions (title completion) should be supported.
For 2 pages:
My_Page
Unofficial:My_Page
I suspect that what you want when typing "Ma Pag" in the search box is getting at least these two pages suggested?
If yes I think that the way to get this working is:
Configure wgNamespacesToBeSearchedDefault with [ 0 => 1, 100 => 1 ] (assuming that 100 is the Unofficial namespace)
Note that changing wgNamespacesToBeSearchedDefault will require reindexing your wiki.
You can see it in action on https://es.wikipedia.org for examples, where the Author and Portal namespaces are searched by default, if you search for `Lenguas portuguesa` you should obtain results from both the main content namespace and the Portal namespace.
I'll link the wiki as it may be helpful for the thread: https://ark.wiki.gg/. We use the main namespace for official game content, and a "Mod" namespace for unofficial modifications, all following a format of a mod's main page at "Mod:modname", and mod's content as sub-pages to that main page. For example "Mod:ARK Additions/Acrocanthosaurus".
In my expectations, typing "Acro", "Acrocantho", or the full title "Acrocanthosaurus", in the mw-head search bar would suggest the article that's in the Mod namespace. We have no other page titled Acrocanthosaurus in any namespace (ignoring files of course). However, there are simply no results returned at all.
To get the suggestions, the reader has to type the mod namespace prefix and the mod's name. "Mod:ARK Additions/Acr" returns valid suggestions. There's no "partial" completion, the prefix must be complete and without typos. And that's not very intuitive or useful.
Regular Special:Search already handles this well [enough], and our mod namespace is weighed below main.
(I've put "enough" in brackets, as searching for "acrocanth" in Special:Search yields no results until a wildcard is added to the end. I'm not familiar with Cirrus's configuration though, so not sure if there's a setting to alter the behaviour so search acts as if there was a wildcard at all times. However, this is not related to this thread.) Alex44019 (talk) 13:01, 23 November 2023 (UTC)Reply
You seem to use the fuzzy-subphrases profile of the completion suggester which allows it to complete in the middle of the titles. When running a completion search across multiple namespaces the CompletetionSuggester (if enabled) will only work and use this algorithm for the main namespace, the other namespaces will be searched using the classic prefix search algorithm. This is why searching for Acro does not yield Mod:ARK Additions/Acrocanthosaurus, you have to search for ARK Additions Acro for it to work.
So indeed, in order to support subphrase matching in your context the CompletionSuggester would have to be adapted to support multiple namespaces, sadly it was not designed with this use-case in mind. I'm unclear on what could be the main difficulty here to adapt the codebase but at a glance I think the context-suggester have to be used and I fear that the assumption that only NS_MAIN is indexed is probably hard-coded in many places.
An alternative might be to change how the classic prefix search works by enabling wgCirrusSearchPrefixSearchStartsWithAnyWord, we never enabled this on WMF wikis so I don't have much experience on how it behaves but it might greatly help to increase recall on non-main namespaces in your case
Interesting, thank you. I'll get in contact with our hosting platform provider about current Cirrus settings, and I'll set up a sandbox to test out the variable you mentioned. I might have a try at getting more familiar with the extension's internals for the CompletionSuggester (mainly for fun), but currently need to burn through my existing to-do lists...
Also... it seems the slash is required in "ARK Additions/Acro" to get article results. Dropping the slash only returns our legacy redirects. Still useful to know! Alex44019 (talk) 12:31, 24 November 2023 (UTC)Reply