Help talk:CirrusSearch

Jump to navigation Jump to search

About this board

CirrusSearch search in text files

9
PushpendraJadaun12 (talkcontribs)

I use documentation versioning to show pages based on different versions.So some text files are not being stored in database is there any way by which I can search text file's content using cirrussearch extension as it searches only data present in database. Please let me know the way to do it.

DTankersley (WMF) (talkcontribs)

Can you give an explicit sample url that you're using?

Thanks!

PushpendraJadaun12 (talkcontribs)

@DTankersley (WMF) Sorry the question has been modified with some more detail. Please have a look at it.

DCausse (WMF) (talkcontribs)

CirrusSearch can search file content as long as a MediaHandler extension supporting the file types you want is installed. As far as I know only Pdf are currently supported this way, this is handled by Extension:PdfHandler.

Ksfield (talkcontribs)

@DCausse (WMF) can you elaborate on what Extension:PdfHandler is doing that allows for the pdfs to be indexed by CirrusSearch? I am interested in add capability for other types of files to be indexed but am having a hard time figuring out what is needed. Thanks!

DCausse (WMF) (talkcontribs)

You just need to install CirrusSearch and PdfHandler to be able to index and search pdf and djvu content (please see their respective documentations). Other types of files are not supported (MS office/libreoffice docs) are not supported as far as I know.

PushpendraJadaun12 (talkcontribs)

Below mentioned commands are supposed to rebuild the search index from scratch: 1) php updateSearchIndexConfig.php --startOver 2) php forceSearchIndex.php

But it seems like they doesn't work either as I have tried them so they index my text files and let me allow to search in text file content but though these commands run successfully it doesn't search.

Note : When I add a new text file and search its content then new file's content get displayed in search results.

Please let me know if anyone can help me out as soon as possible.

DCausse (WMF) (talkcontribs)

Please read the documentation at Extension:PdfHandler esp. the "Debugging" part. You may have to run other commands before rebuilding the index from scratch.

PushpendraJadaun12 (talkcontribs)

@DCausse (WMF) My search is on txt files.Can you please let me know whether Extension:PdfHandler will work for search in txt files ?

Reply to "CirrusSearch search in text files"

php updateSearchIndexConfig.php is giving error

3
PushpendraJadaun12 (talkcontribs)

While installing CirrusSearch Extension after updating mediawiki from 1.28 to 1.32 version when I run command(following README):

php updateSearchIndexConfig.php

I get this error : Elastica\Exception\Connection\HttpException from line 187 of $MW_INSTALL_PATH\extensions\Elastica\vendor\ruflin\elastica\lib\Elastica\Transport\Http.php: Couldn't resolve host

I tried searching but couldn't find luck.Please let me know what I need to do to resolve it ?


Note : Versions of CirrusSearch,Elastica and mediawiki is same 1.32.


EBernhardson (WMF) (talkcontribs)

Couldn't resolve host suggests that whatever hostname it's finding for the elasticsearch cluster isn't able to be resolved. Are you using $wgCirrusSearchServers? Or one of the more complex configuration options? What format are you using to specify $wgCirrusSearchServers?

PushpendraJadaun12 (talkcontribs)

Problem is solved.Actually elastic server was not running properly it was getting closed after some time which was causing this error.

Reply to "php updateSearchIndexConfig.php is giving error"
GiorgioGaleotti (talkcontribs)

Would it be possible to have search results in alphabetical order? It would make finding duplicate files much easier.

EBernhardson (WMF) (talkcontribs)

Unfortunately limitations of the current search implementation prevent sorting results by title.

PerfektesChaos (talkcontribs)

I would like to advertise User:PerfektesChaos/js/resultListSort which is sorting output on many special pages in various sequences. Search results by page name, modification time or page size.

Reply to "Search results"

exclude redirections ?

4
Summary by Tacsipacsi
Tomates Mozzarella (talkcontribs)

Is there a way to exclude redirection pages from search results ? For example, I'd like to isolate only real articles in this search. Thanks for your help.

Tomates Mozzarella (talkcontribs)

No one to help me ? ;-(

Nemo bis (talkcontribs)

You can user -insource:/REDIRECT/ and similar, of course, but it's not particularly clean.

Quiddity (WMF) (talkcontribs)

I've filed the feature request at phab:T90807 ("Option to exclude redirection pages from search results")

Reply to "exclude redirections ?"

Issue in English version breaks German translation

7
Summary by Speravir

An issue caused by improper tag pairing of ref and translate tags caused an incomplete display of the German translation.

Speravir (talkcontribs)

At least the German translation is broken: It has as of now 14 sections, the english one has 16; in German the English sections 7 and 8 are totally missing, the sections for Page weighting and Regular expression searches – and I know the are translated, because I was the one, who did the most of it. The Filters section in German has only 2 subsections, in English there are 7. (Edit: Oh, in fact parts of the Regular expressions section are present, but displayed as part of the Filters section.) So I assume somewhere in or after the second Filters subsection (Deepcategory) is a missing or misplaced <translate> tag (or maybe vice versa somewhere before is one too much).

FWIW the translation has been broken with this edit: Special:Diff/2780543/2787288 on 21 May 2018, but before this the latest update for translation has been on 27 December 2017, hence some-when in this period the mistake(s) must have been occurred.

DTankersley (WMF) (talkcontribs)

Can you provide some samples of where this is happening, please?

Speravir (talkcontribs)

Sorry, but I do not understand what you want from me. What kind of samples? Or: What do you not get from my first message? Simply compare the content overview from Help:CirrusSearch with Help:CirrusSearch/de, especially in section Filters (or Filter in German). Note also that section Geo Search is the ninth in English, but in the moment the seventh in German (as Geosuche). Regarding my edit above: In German in the moment the section 6.2.1 belongs actually to Regex searches, in English this is in fact section 8.2 – looking more carefully shows that already some lines before 6.2.1 do not belong to section 6.2: Everything starting from the box is for RegEx (you can see this string in the box).

Wargo (talkcontribs)
Speravir (talkcontribs)

I actually noted the error message I got in German help version, but thought this would just be caused by a translation change (this happened before, too). Because of the bigger issue I didn’t want to start with a new translation, though. I've now taken a look into the English text at this passage and decided to make a cleaner approach: Special:Diff/2883110/2887712. Please, mark someone the English text for translation, so we can see whether this was the reason for the mess.

If so, this edit broke everything: Special:Diff/2704479/2714274. I just rearranged the tags to the state before this.

Wargo (talkcontribs)

YesY Done

Speravir (talkcontribs)

Thanks, Wargo. This was indeed the reason. Sorry at all for the big noise.

Reply to "Issue in English version breaks German translation"

Truncated search and results snippet

2
Loman87 (talkcontribs)

Hi,
I was wondering why when doing a truncated search using the asterisk, matching search results aren't in bold like in simple search but instead the page snippet shows only the first lines of the page containing the occurrence. Is this an issue or is it the expected behaviour? My wiki is at this link.
Other than this, I was wondering if in the future it is planned to improve the search results page, especially for how concerns the readability of the pages snippets (sometimes wiki markup interfere with the reading, especially when pages have a content model different from wikitext) and linking to pages (now the results link to a page containing the search term, but when a user arrives to that page he has to re-search the word using ctrl-f). I am not a developer so I am just asking to experts for their technical opinion.
Thanks for your answer.

PerfektesChaos (talkcontribs)

If there is a short matching phrase, it is shown in bold.

  • If not, the beginning of visible text is displayed.
  • That is depending on the words and whether they occur close together.

Improvements may arrive year after year. There are many ideas.

Reply to "Truncated search and results snippet"

Is it possible to search by section?

5
Karlpoppery (talkcontribs)

For example, if I have a section called "Etymology" on many of my pages, would it be possible to search only the text that appears in the etymology sections?

If there's no option to do that, is it something that could be done with regex in a reasonable time?

197.235.75.234 (talkcontribs)

No, it isn't possible to do it directly.

Yes, it is theoretically possible to use regex, but in many cases it will timeout.

Karlpoppery (talkcontribs)

I see, that's a bummer. There must be ways to hack around this, though. For example I could automatically save those sections in their own articles under the same category, then do a search by category and redirect the result to the original article. Or maybe I could use Cargo

197.235.219.81 (talkcontribs)

Such hacks will work, but they seem like a lot of effort. The way to make it less likely to timeout is to use efficient regex, along with a simplified text. For example:

"personal life" insource:/\=\=\s*Personal life\s*\=\=.*?he attended.*?\=\=\w*?\=\=.*?/

https://en.wikipedia.org/w/index.php?search=insource%3A%22personal+life%22+insource%3A%2F%5C%3D%5C%3D%5Cs*Personal+life%5Cs*%5C%3D%5C%3D.*%3Fhe+attended.*%3F%5C%3D%5C%3D%5Cw*%3F%5C%3D%5C%3D.*%3F%2F&title=Special%3ASearch&profile=default&fulltext=1

The search above attempts find "he attended" within a section, and because it is simplified it will fail in many cases, for example, if an article only contains one section or if a section is created by a template. Sections are simply too complicated to deal with, because people think of them as sub-documents but in reality they are just pieces of the same document. A related feature request is https://phabricator.wikimedia.org/T27062.

Note that the query above results in a timeout .

TJones (WMF) (talkcontribs)

Regexes in general are tricky, and making them work efficiently in on-wiki search is hard, too. So, a few notes:

  • Put as many required words outside the regex as possible. "personal life" gets about 150K results. "personal life" attended only gets 50K, which cuts the number the regex has to scan by a third. "personal life" "he attended" gets only 13K. The regex matches both "he attended" and "she attended", but I would suggest searching them separately, since "personal life" "she attended" only gets 5K results, and splitting your search into two queries against ~18K results is better than one query against 50K.
    • Any other non-regex info, like categories, also helps. Anything to allow the search index to give the regex fewer documents to scan is a plus.
  • In general, put as much relevant plain text in your regex as possible. We use plain text trigrams to accelerate the regex search, so we're limiting the regexs to scan only documents with "Per", "ers", "rso", etc. In this case it doesn't help much because the plain text in the regex is almost the same as the non-regex search terms—though the regex is case sensitive, so the "Per" will filter the list down a bit more.
  • Regexes are case sensitive unless you tell them not to be with /i at the end, so this won't match sentences that start with "He attended", like Jamie How. Rather than let everything be case insensitive, I'd use [Hh] to allow that one character to match upper or lower case, since here it still leaves a lot of plain text trigrams in place. (I don't recall whether the trigrams processing looks into simple character classes like [Hh] and make trigrams across them—so don't count on it.)
  • The regex suggested above doesn't quite guarantee what you want, only that he attended appears after the Personal life section title—it could be in a different section, as in the case of Jeff Grub, where it's in the section after the Personal life section.
    • You can instead match "not equal sign" with [^=]* though it is more expensive. It will also fail to match if there is an equal sign in the Personal life section before the "he attended" part. Unlikely, but possible. It will also exclude results with a sub-section under Personal life that contains "he attended", which is probably not desirable.
  • Also, the extra \=\=... at the end requires another same-level section after the Personal life section, so it won't match if Personal life is the very last section, which may be unlikely, but is not true for all section titles. I'd drop it.

If you don't need a perfect list, but just a short list of ~100 articles you could review manually to find what you need, I'd recommend these two for this use case (link for ..he.. and ..she.. queries):

"personal life" "he attended" insource:/\=\=\s*Personal life\s*\=\=.*?[hH]e attended/

and..

"personal life" "she attended" insource:/\=\=\s*Personal life\s*\=\=.*?[sS]he attended/

Both of these queries finish on English Wikipedia, and return ~750 to ~1600 results. Still a lot, almost certainly with false positives, but potentially manageable.

EDIT: I'm sure there is still some way to further improve the regex. There always is! Hopefully this helps, though.

Reply to "Is it possible to search by section?"

Why is '#' a regex special character?

3
AlanM1 (talkcontribs)
PerfektesChaos (talkcontribs)

elastic.co suggests that this is used to introduce a comment, especially in a multiline environment. It has no syntactic meaning like any other reserved character.

In our environment it breaks any RegExp, apparently with no result ever if not escaped.

197.235.216.151 (talkcontribs)

Simply because cirrussearch uses elasticsearch, which in turn uses lucene search, and it defines its own regex that interprets that as an "empty language":https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html .

Reply to "Why is '#' a regex special character?"

How to stop all namespaces from displaying in advanced search

2
216.194.21.34 (talkcontribs)

When I click the magnifying glass in the search box the advanced search is displayed and all namespaces are listed. I don't want to see the entire list of namespaces. Is there a way to turn this off or have this toggle?

Tacsipacsi (talkcontribs)

I don’t understand you. The advanced search’s point is to show all namespaces so that you can chose which ones do you want to search in. What would you expect from it, if not showing all namespaces?

Reply to "How to stop all namespaces from displaying in advanced search"
216.194.21.34 (talkcontribs)

I have MediaWiki 1.29 installed along with the CirrusSearch extension. After setting up the extension all Search Results open to the Advanced tab rather than the Content pages tab as it did with the default search feature. Is there a way to configure CirrusSearch to show by default the Content pages tab of the Search Results page?


Thanks,

Tom

Reply to "Search Results page - default tab"