Topic on Project:Support desk

Some pages index and other do not in CirrusSearch

15 comments • 22:28, 13 December 2022 1 year ago

15

Bruceillest (talkcontribs)

Running on Windows Server 2012R2

Product	Version
MediaWiki	1.32.0
PHP	7.2.7 (cgi-fcgi)
MySQL	8.0.15
ICU	61.1
Elasticsearch	5.6.16

Extension	Version
CirrusSearch	0.2 (b1fa4bd)07:47, 20 February 2019
Elastica	1.3.0.0 (9fcf88c)02:09, 11 October 2018

When I search I am able to populate information from some sites but not all. I did a &action=cirrusdump to most of my pages and I noticed that some of them gave me an output and others only showed this []. I ran the commands in the read me file and everything went through smoothly. I was wondering what extension/application is responsible for indexing the pages? is it ElasticSearch, CirrusSearch, or Elastica extension? Knowing this information should help me narrow down what configurations to focus on.

Thanks

Reply 16:28, 16 August 2019 4 years ago

MarkAHershberger (talkcontribs)

I wasn't aware of the action=cirrusdump parameter. Do the pages that return [] show up in search?

Reply 20:32, 17 August 2019 4 years ago

Bruceillest (talkcontribs)

Pages that return [] do no show up in search. The other thing is when I first set this up No pages were showing in search until I added &action=cirrusdump to the page and initially it would return [] but after hitting refresh it would populate I guess what they call the JSON output (example below): But on the majority of my pages it will only return []. This shows me that whatever auto process that is supposed to run to do this is not running the only thing is what is the process that supposed to run this?

[{"_index":"caswiki_content_XXXXXXXXXXX","_type":"page","_id":"1","_version":[],"_source":{"version":848,"wiki":"wiki","namespace":0,"namespace_text":"","title":"Main Page","timestamp":"2019-08-06T13:27:23Z","create_timestamp":"2019-02-11T18:01:17Z","category":[],"external_link":[],"outgoing_link":[],"template":[],"text":"Welcome to Wiki Need to know how to add info, Click on the link below for instructions. Wiki How Category:XXXXXX Category:XXXXXXXX Category:XXXXXXX Category:XXXXXX Category:XXXXXXXXX:Instructions Category:XXXXX Category:XXXXXXXXX Category:XXXXXXX Category:XXXXXXXXXXXXXXXX Category:XXXXXXXXX Category:XXXXXXXX Category:XXXXXXXXX","source_text":"Welcome to Wiki<\/big>\n\n\nNeed to know how to add info, Click on the link below for instructions.\n\n[http:\/\/houcaswiki01\/CAS\/index.php?title=Wiki_How Wiki How<\/u><\/big>]\n\n==Categories==\nSpecial:AllPages","text_bytes":239,"content_model":"wikitext","language":"en","heading":["Categories"],"opening_text":"Welcome to Wiki Need to know how to add info, Click on the link below for instructions. Wiki How","auxiliary_text":[],"defaultsort":false,"display_title":null,"redirect":[],"incoming_links":0}}]

Reply Edited 14:15, 3 September 2019 4 years ago

Bruceillest (talkcontribs)

Also I got those parameters from here: Extension:CirrusSearch#API

Reply 14:14, 19 August 2019 4 years ago

Ciencia Al Poder (talkcontribs)

Indexing is done with the Manual:Job queue. If the job queue is not being processed, or jobs fail to execute, then new pages won't be indexed.

Reply 09:28, 19 August 2019 4 years ago

Bruceillest (talkcontribs)

So I ran php runJobs.php and still the same issue where some pages show and others don't. Then I went through the readme again with enabling and disabling search update while running the respective scripts in their order and set action.auto_create_index: false, action.disable_close_all_indices: true, action.disable_delete_all_indices: true, action.disable_shutdown: true in the elasticsearch.yml file and added $wgRunJobsAsync = false; (per readme) and then ran php runJobs.php which this time took sometime to complete and then still the same result where some pages show up in search and most don't. I then restart my server and Elastic search wouldn't start because per the elasticsearch logs action.disable_close_all_indices: true, action.disable_delete_all_indices: true, action.disable_shutdown: trueare "unknown settings". Once removing those settings I was able to turn it back on and verify that the issue was still there. FYI not sure if this may affect but, due to security reasons when I originally setup Mediawiki with MySQL I changed the default port of 3306 which hasn't been an issue for my other extensions.

Reply 15:30, 19 August 2019 4 years ago

Bruceillest (talkcontribs)

I got some more info in logs when I ran php showJobs.php --list. I was wondering if you guys are familiar with this?

IP: 127.0.0.1

Start command line script showJobs.php

[caches] cluster: WinCacheBagOStuff, WAN: mediawiki-main-default, stash: db-replicated, message: WinCacheBagOStuff, session: WinCacheBagOStuff

[caches] LocalisationCache: using store LCStoreDB

[DBConnection] Wikimedia\Rdbms\LoadBalancer::openConnection: calling initLB() before first connection.

[DBReplication] Wikimedia\Rdbms\LBFactory::getChronologyProtector: using request info {

"IPAddress": "127.0.0.1",

"UserAgent": false,

"ChronologyProtection": false,

"ChronologyPositionIndex": 0,

"ChronologyClientId": null

}

[DBConnection] Wikimedia\Rdbms\LoadBalancer::openLocalConnection: connected to database 0 at '127.0.0.1:6720'.

[error] [9542873de79a620a1e3149db] [no req] ErrorException from line 586 of C:\inetpub\wwwroot\CAS\includes\jobqueue\JobQueueDB.php: PHP Notice: unserialize(): Error at offset 65302 of 65535 bytes

#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)

#1 C:\inetpub\wwwroot\CAS\includes\jobqueue\JobQueueDB.php(586): unserialize(string)

#2 C:\inetpub\wwwroot\CAS\includes\libs\MappedIterator.php(78): JobQueueDB->{closure}(stdClass)

#3 [internal function]: MappedIterator->accept()

#4 C:\inetpub\wwwroot\CAS\includes\libs\MappedIterator.php(74): FilterIterator->rewind()

#5 C:\inetpub\wwwroot\CAS\maintenance\showJobs.php(73): MappedIterator->rewind()

#6 C:\inetpub\wwwroot\CAS\maintenance\doMaintenance.php(94): ShowJobs->execute()

#7 C:\inetpub\wwwroot\CAS\maintenance\showJobs.php(109): require_once(string)

#8 {main}

[exception] [9542873de79a620a1e3149db] [no req] Error from line 75 of C:\inetpub\wwwroot\CAS\extensions\CirrusSearch\includes\Job\ElasticaWrite.php: Unsupported operand types

[DBConnection] Wikimedia\Rdbms\LoadBalancer::closeAll: closing connection to database '127.0.0.1:6720'.

Reply 16:30, 19 August 2019 4 years ago

Ciencia Al Poder (talkcontribs)

This seems to be documented at those 2 bug reports:

The "Notice: unserialize(): Error at offset 65302 of 65535 bytes" (and the 65535 number is a very magic one) means the job queue is stored on mysql, and the contents of the job parameters got truncated because the table doesn't allow more than 65535 bytes... The fix for this would be to use Redis as the job storage.

But on the other hand, according to the comment in T157759#4762940, when cirrus jobs are inserted in this way that's because they aren't executed directly because of an error, and that means those jobs shouldn't be there in the first place, the original error is somewhere else.

You should try to clear your job queue from all those errors, and then, switch jobs to redis, or try to edit a page that's not in the index and save, and inspect the debug log to find any possible errors executing the cirrus search update job.

Reply 18:10, 19 August 2019 4 years ago

Bruceillest (talkcontribs)

Ok I think I have figured it out somewhat but, still have an issue. So I was able to expand the size in MySQL from BLOB 65535 bytes to MEDIUMBLOB 16MB. After doing that I cleared the stuck job from MySQL by deleting the row on the job_params column located under the job table. I was then able to run php showJobs.php --list without any errors. Then I ran updateSearchIndexConfig.php and was getting [gitinfo] Cache incomplete for C:\inetpub\wwwroot\CAS so I found this article https://phabricator.wikimedia.org/T131003 and ran the fix which resolved that issue. After going through the readme I was still having the problem with some pages index and others don't but, I was getting no errors in wiki log and elasticsearch.

I then setup Redis and followed the readme instructions again and still some pages index and others don't and also no errors. So I decided to go to a page and edit the source code and Not use visualeditor this time (since I tried editing through visual editor before and that didn't work) and save it and it indexed just fine. FYI I tried this with and without Redis (after fixing MySQL). But, this is where the issue is I need it to go through and automatically index all of the pages (not new pages). Do I have to do this to all my pages manually? isn't there an automatic process that is supposed to do this? and if so what could it be because that is what might be failing now?

Thanks

Reply 21:34, 20 August 2019 4 years ago

MarkAHershberger (talkcontribs)

It looks like you can reindex by running

php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now

See the upgrade notes.

Reply 23:09, 20 August 2019 4 years ago

Bruceillest (talkcontribs)

Sorry just getting back on this so far this didn't work. I am still needing to Edit source and hit save for it to index (basically create the JSON file for it to index, JSON file was verified with &action=cirrusdump). There has to be some type of process that creates these files that is failing could it be something in Visual Editor that is messing with this?

Reply 21:05, 27 August 2019 4 years ago

Bruceillest (talkcontribs)

Ok I got this to work the hardest way possible which is ridiculous. I had to go to each one of my pages and click on edit source and hit save for it to create the file (which is viewable with &action=cirrusdump) to index. I verified that any new changes are actively being updated so that's good. It seems like I should have decided to install this feature before I had created any wiki pages so I wouldn't have gone through this. It was very tedious and time consuming to have to go through already created pages, uploaded files and discussion pages just to have them get the file created and index.

I am afraid that when it comes time for me to do an upgrade that it will force me to do this again and by that time I hope this is already resolved. I will just forget about using this extension if it comes to doing this again just to index what was already there when in the future I'm bound to have more pages and files.

I never got a solid answer as to what creates the file/content/JSON file/whatever it is when you enter &action=cirrusdump, since that process is definitely failing. I do appreciate the help you both have given me so far but this needs to be addressed.

It would be nice if there was a better of different search option than CirrusSearch cause even though I finally have some of it running its still not perfect. When I search for one of the headings of the page it won't populate unless I add the wildcard * and it won't populate when there is one letter missing even though I'm using the wildcard for example *Task will not populate unless you spell it out completely like *Tasks. Unfortunately this extension is just slightly better than the one built in.

Reply 17:33, 29 August 2019 4 years ago

MarkAHershberger (talkcontribs)

Your experience is an example of poor documentation, I think. I haven't run into similar problems when upgrading CirrusSearch, so I'm sorry that you had to do so much work.

I'm sure there is a better way than what you did, but, again, my experience with CirrusSearch is such that I don't know what that would be.

Reply 15:28, 31 August 2019 4 years ago

Ciencia Al Poder (talkcontribs)

Maybe follow the Upgrading notes from the README?

Reply 17:17, 31 August 2019 4 years ago

Olson.jared.m (talkcontribs)

I had this same issue, where some pages were indexed and others weren't. Pages that returned [] did not show up in search.

The first command here was not enough, but once I did the second they showed up in search correctly.

php updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now

php forceSearchIndex.php

As the rebuild happens you'll see "...Indexed 10 pages ending at 5267 at 71/second..." etc.

Reply 22:28, 13 December 2022 1 year ago

Reply to "Some pages index and other do not in CirrusSearch"