Help talk:CirrusSearch

Jump to navigation Jump to search

About this board

Issue in English version breaks German translation

Summary by Speravir

An issue caused by improper tag pairing of ref and translate tags caused an incomplete display of the German translation.

Speravir (talkcontribs)

At least the German translation is broken: It has as of now 14 sections, the english one has 16; in German the English sections 7 and 8 are totally missing, the sections for Page weighting and Regular expression searches – and I know the are translated, because I was the one, who did the most of it. The Filters section in German has only 2 subsections, in English there are 7. (Edit: Oh, in fact parts of the Regular expressions section are present, but displayed as part of the Filters section.) So I assume somewhere in or after the second Filters subsection (Deepcategory) is a missing or misplaced <translate> tag (or maybe vice versa somewhere before is one too much).

FWIW the translation has been broken with this edit: Special:Diff/2780543/2787288 on 21 May 2018, but before this the latest update for translation has been on 27 December 2017, hence some-when in this period the mistake(s) must have been occurred.

DTankersley (WMF) (talkcontribs)

Can you provide some samples of where this is happening, please?

Speravir (talkcontribs)

Sorry, but I do not understand what you want from me. What kind of samples? Or: What do you not get from my first message? Simply compare the content overview from Help:CirrusSearch with Help:CirrusSearch/de, especially in section Filters (or Filter in German). Note also that section Geo Search is the ninth in English, but in the moment the seventh in German (as Geosuche). Regarding my edit above: In German in the moment the section 6.2.1 belongs actually to Regex searches, in English this is in fact section 8.2 – looking more carefully shows that already some lines before 6.2.1 do not belong to section 6.2: Everything starting from the box is for RegEx (you can see this string in the box).

Wargo (talkcontribs)
Speravir (talkcontribs)

I actually noted the error message I got in German help version, but thought this would just be caused by a translation change (this happened before, too). Because of the bigger issue I didn’t want to start with a new translation, though. I've now taken a look into the English text at this passage and decided to make a cleaner approach: Special:Diff/2883110/2887712. Please, mark someone the English text for translation, so we can see whether this was the reason for the mess.

If so, this edit broke everything: Special:Diff/2704479/2714274. I just rearranged the tags to the state before this.

Wargo (talkcontribs)

YesY Done

Speravir (talkcontribs)

Thanks, Wargo. This was indeed the reason. Sorry at all for the big noise.

Reply to "Issue in English version breaks German translation"

Truncated search and results snippet

Loman87 (talkcontribs)

I was wondering why when doing a truncated search using the asterisk, matching search results aren't in bold like in simple search but instead the page snippet shows only the first lines of the page containing the occurrence. Is this an issue or is it the expected behaviour? My wiki is at this link.
Other than this, I was wondering if in the future it is planned to improve the search results page, especially for how concerns the readability of the pages snippets (sometimes wiki markup interfere with the reading, especially when pages have a content model different from wikitext) and linking to pages (now the results link to a page containing the search term, but when a user arrives to that page he has to re-search the word using ctrl-f). I am not a developer so I am just asking to experts for their technical opinion.
Thanks for your answer.

PerfektesChaos (talkcontribs)

If there is a short matching phrase, it is shown in bold.

  • If not, the beginning of visible text is displayed.
  • That is depending on the words and whether they occur close together.

Improvements may arrive year after year. There are many ideas.

Reply to "Truncated search and results snippet"

Is it possible to search by section?

Karlpoppery (talkcontribs)

For example, if I have a section called "Etymology" on many of my pages, would it be possible to search only the text that appears in the etymology sections?

If there's no option to do that, is it something that could be done with regex in a reasonable time? (talkcontribs)

No, it isn't possible to do it directly.

Yes, it is theoretically possible to use regex, but in many cases it will timeout.

Karlpoppery (talkcontribs)

I see, that's a bummer. There must be ways to hack around this, though. For example I could automatically save those sections in their own articles under the same category, then do a search by category and redirect the result to the original article. Or maybe I could use Cargo (talkcontribs)

Such hacks will work, but they seem like a lot of effort. The way to make it less likely to timeout is to use efficient regex, along with a simplified text. For example:

"personal life" insource:/\=\=\s*Personal life\s*\=\=.*?he attended.*?\=\=\w*?\=\=.*?/*Personal+life%5Cs*%5C%3D%5C%3D.*%3Fhe+attended.*%3F%5C%3D%5C%3D%5Cw*%3F%5C%3D%5C%3D.*%3F%2F&title=Special%3ASearch&profile=default&fulltext=1

The search above attempts find "he attended" within a section, and because it is simplified it will fail in many cases, for example, if an article only contains one section or if a section is created by a template. Sections are simply too complicated to deal with, because people think of them as sub-documents but in reality they are just pieces of the same document. A related feature request is

Note that the query above results in a timeout .

TJones (WMF) (talkcontribs)

Regexes in general are tricky, and making them work efficiently in on-wiki search is hard, too. So, a few notes:

  • Put as many required words outside the regex as possible. "personal life" gets about 150K results. "personal life" attended only gets 50K, which cuts the number the regex has to scan by a third. "personal life" "he attended" gets only 13K. The regex matches both "he attended" and "she attended", but I would suggest searching them separately, since "personal life" "she attended" only gets 5K results, and splitting your search into two queries against ~18K results is better than one query against 50K.
    • Any other non-regex info, like categories, also helps. Anything to allow the search index to give the regex fewer documents to scan is a plus.
  • In general, put as much relevant plain text in your regex as possible. We use plain text trigrams to accelerate the regex search, so we're limiting the regexs to scan only documents with "Per", "ers", "rso", etc. In this case it doesn't help much because the plain text in the regex is almost the same as the non-regex search terms—though the regex is case sensitive, so the "Per" will filter the list down a bit more.
  • Regexes are case sensitive unless you tell them not to be with /i at the end, so this won't match sentences that start with "He attended", like Jamie How. Rather than let everything be case insensitive, I'd use [Hh] to allow that one character to match upper or lower case, since here it still leaves a lot of plain text trigrams in place. (I don't recall whether the trigrams processing looks into simple character classes like [Hh] and make trigrams across them—so don't count on it.)
  • The regex suggested above doesn't quite guarantee what you want, only that he attended appears after the Personal life section title—it could be in a different section, as in the case of Jeff Grub, where it's in the section after the Personal life section.
    • You can instead match "not equal sign" with [^=]* though it is more expensive. It will also fail to match if there is an equal sign in the Personal life section before the "he attended" part. Unlikely, but possible. It will also exclude results with a sub-section under Personal life that contains "he attended", which is probably not desirable.
  • Also, the extra \=\=... at the end requires another same-level section after the Personal life section, so it won't match if Personal life is the very last section, which may be unlikely, but is not true for all section titles. I'd drop it.

If you don't need a perfect list, but just a short list of ~100 articles you could review manually to find what you need, I'd recommend these two for this use case (link for ..he.. and ..she.. queries):

"personal life" "he attended" insource:/\=\=\s*Personal life\s*\=\=.*?[hH]e attended/


"personal life" "she attended" insource:/\=\=\s*Personal life\s*\=\=.*?[sS]he attended/

Both of these queries finish on English Wikipedia, and return ~750 to ~1600 results. Still a lot, almost certainly with false positives, but potentially manageable.

EDIT: I'm sure there is still some way to further improve the regex. There always is! Hopefully this helps, though.

Reply to "Is it possible to search by section?"

Why is '#' a regex special character?

AlanM1 (talkcontribs)
PerfektesChaos (talkcontribs) suggests that this is used to introduce a comment, especially in a multiline environment. It has no syntactic meaning like any other reserved character.

In our environment it breaks any RegExp, apparently with no result ever if not escaped. (talkcontribs)

Simply because cirrussearch uses elasticsearch, which in turn uses lucene search, and it defines its own regex that interprets that as an "empty language": .

Reply to "Why is '#' a regex special character?"

How to stop all namespaces from displaying in advanced search

2 (talkcontribs)

When I click the magnifying glass in the search box the advanced search is displayed and all namespaces are listed. I don't want to see the entire list of namespaces. Is there a way to turn this off or have this toggle?

Tacsipacsi (talkcontribs)

I don’t understand you. The advanced search’s point is to show all namespaces so that you can chose which ones do you want to search in. What would you expect from it, if not showing all namespaces?

Reply to "How to stop all namespaces from displaying in advanced search"

Search Results page - default tab

1 (talkcontribs)

I have MediaWiki 1.29 installed along with the CirrusSearch extension. After setting up the extension all Search Results open to the Advanced tab rather than the Content pages tab as it did with the default search feature. Is there a way to configure CirrusSearch to show by default the Content pages tab of the Search Results page?



Reply to "Search Results page - default tab"
Wuestenarchitekten (talkcontribs)


MediaWiki 1.30.0
PHP 7.0.22
MySQL 5.7.21-0ubuntu0.16.04.1
ICU 55.1
Elasticsearch 5.4.3
Lua 5.1.5

CirrusSearch (0.2) and Elastica (

CirrusSearch seems to be working but the CompletionSuggester is not functional at all. While there is an article called "Movie", CirrusSearch doesn't suggest anything if I look for a slightly misselled "Movei".

After the installation I did not make any changes to the configuration but followed the CirrusSearch README instructions.

Anything I'm missing?

Wuestenarchitekten (talkcontribs)

Sorry, I didn't see the answer earlier further down below:

The completion suggester is enabled by setting 'yes' to $wgCirrusSearchUseCompletionSuggester .

Loading wiki cirrus index dumps into elastic search

Ksha run (talkcontribs)

I am having a hard time loading index dumps using instructions from the elastic blog Looks like others are too

I am trying to do something very basic - load wikiquotes/wikivoyage index and run simple elastic search queries. There are a lot of docs saying a lot of different things but looks like I just need clarity on 3 pieces of info

[1] Version of elasticsearch to use? I was trying things out with 6.3 and I can't figure out what version is being used by wikipedia or where to find the version number.

[2] Do I need to add plugins after elasticsearch is installed (I am on ubuntu 16.04) some docs say analysis-icu and others point at I cant figure out what these plugins are doing so not sure if they are the source of my errors

[3] Creating the index initially needs a mapping and setting file which seem to be very different for elastic 2,5,6. I tried using these Mapping & Setting but it looks like they are of a much older version. Is there a place where I can find the current files and just drag and drop them into my basic single node no replication setup?

Thanks in advance for any assistance!

This post was hidden by (history)
Tacsipacsi (talkcontribs)
Reply to "Loading wiki cirrus index dumps into elastic search"
Automatik (talkcontribs)

Hi. Is there any way to exclude redirects from search results? I want, e.g., to find entries that are not redirections and that contain some character in their title. How to do that?

TJones (WMF) (talkcontribs)

Unfortunately, there's no easy way to exclude redirects from search results.

However, depending on the scope of the task you are trying to complete and your technical ability you could try to use the Search API to semi-automatically do what you need.

This query will give you back the top results with "English Wikipedia" in the title or a redirect:

The default format is JSON converted to HTML so it's easy to read for a human, but hard to read for a computer. If you only have a small number of queries to deal with, and only need a limited number of results from each (up to 500—set by srlimit), you might be able to get what you need by getting these results and looking through the titles by hand.

If you need a computer to process the results for you, say, because you have many queries, you can get real JSON by adding &format=json:

On a Unix-like command line (I'm working in Terminal on OS X) you can use curl to fetch the JSON, python to make it pretty, and grep to pull out the titles, and grep again to find the specific ones you want:

curl -s "" | python -m json.tool | grep "\"title\":" | grep -i "english wikipedia"

Note that the API URL is URL-encoded (spaces become %20, quotes become %22, etc.).


   "title": "English Wikipedia",
   "title": "Simple English Wikipedia",
   "title": "Notability in the English Wikipedia",

The results aren't pretty, and in this case there are only 8 results total and 3 that are not redirects. If you are searching for specific characters, you may need to do some more pre-processing before the final grep. (If you are searching for "e", everything will match, because "title" has an "e" in it, for example.) If you need to go through more than the top 500 results, you'll have to figure out how to get the API to give you additional results, etc.

It's not pretty and it's not easy, but it's a start.

Automatik (talkcontribs)

Thanks for this answer. It is clearly not easy or convenient, and pretty similar to run the query manually (then, filtering visually with CTRL+F "(redirection" and picking only the results without the "(redirection" text highlighted. Developers should add an option "do not follow redirects", to avoid tedious work for all users using this functionality. I guess it is not so difficult, as this option already exists in some use cases (e.g. when displaying a page with &redirect=no).

TJones (WMF) (talkcontribs)

It is very similar to the ctrl-F solution, just more automatic! For me, somewhere around 25 to 50 queries it would be faster (or at least less boring and thus less error-prone) to go for a hacked-together semi-automatic solution.

Adding a title-only index is probably not a trivial change to make from our current state. We have a search index for intitle:, with the text from titles and redirects in it. There's no differentiation between the title and redirect text once it's in the index. I think we'd have to create another field that was title-only (and maybe a redirect-only field would be equally useful—which together would be bigger than the size of the current title index).

It's not clear to me how many people would need such an index. I'm really curious what your use case is—both to get a sense of how useful title-only search would be, and to see if there's a better clever way to get what you need.

You could open a Phabricator ticket and ask for this feature, but that certainly doesn't guarantee that it would be implemented any time soon.

Automatik (talkcontribs)

On the French Wiktionary, we use the typographic apostrophe in titles, instead of the typewriter/vertical apostrophe. I was looking for titles that use the vertical apostrophe, without being a redirection.

Moreover, I am using Windows, which is less convenient than Unix-like command line regarding command-line tools (documentation unclear/not a unified way to run commands in Windows, etc.)

TJones (WMF) (talkcontribs)

Ah.. that's a sensible use case. No other obvious solution comes to mind, but I'll think about it more and if I think of anything useful I'll let you know.

If you are already familiar with Unix-like commands (or want to learn), but just don't have them available because you are on Windows, you could look at Cygwin (English WP, French WP, website)—it's not an emulator or virtual machine, it just gives you versions of standard Unix commands that work on Windows. I used it about 15 years ago when I had a Windows machine for my job. I found it very useful back then, but haven't used it since.

Automatik (talkcontribs)

Thanks for the advice, however the bash terminal from Cygwin does not work (and the solution suggested in does not work out either). Moreover, now that I have installed the program, I cannot uninstall it anymore (at least, not easily), as it does not appear in "Programs and features", and when I click "Uninstall" from a right click on the program icon, it opens the "Programs and features" windows, anyway.

TJones (WMF) (talkcontribs)

Oh no! I should have known better than to suggest software I haven't used in so long—but it was so nice back in the day. I haven't used Windows in almost 15 years either, so I don't really have any helpful advice. Crap, I'm sorry!

Automatik (talkcontribs)

No worries: I "uninstalled" it by removing its folders, and re-installed it using another repository, and now it works! Thanks for the tip then. To look for more than 500 results, I added the &sroffset=500 parameter (then 1000, 1500,... until no results are found)

Speravir (talkcontribs)

Oh, slightly funny: Unaware of this thread I recently opened a ticket on Phabricator: phab:T204089. (talkcontribs)

It seems that it used to be possible to filter redirects at some point, and this was removed,

It seems developers are confused and going back and forth between removing and readding redirects to search. As the old saying goes, "clients don't know what they want". Anyway, a more sensible approach would be a degree of faceting, where it returns all results but aggregates similar properties, e.g. many pages will be in the same category, or many pages will be redirects, disambiguations, poor quality stubs, etc...

It is probably simpler to resolve this using the API, since it already has options for redirect titles. There are also at most about 10000 results, so it would probably be less challenging to filter through those. Anyway, if the search results aren't too many it is easier to include redirect title in API search results and use your favorite replace tool to clean up all those that don't match, e.g. . This would be easier if CSV was a valid API output format. (talkcontribs)
Speravir (talkcontribs)

(Nitpicking) @IP, apparently not: User/developer debt closed phab:T90807 as declined, but with the words “If there is more of a use case than what is in this ticket, please reopen and show examples / steps to reproduce.” Well I did not reopen, because this ticket was not found in a search for older tickets, but the same user/dev debt did not close the ticket opened by me. It seems I showed some valid use cases. (talkcontribs)

Well, it seems more sensible to formulate it as "restore ability to remove redirects from search results" . This was explicitly and deliberately removed for specific reasons.

The general problem with wikis is that they attempt to cater to two sometimes conflicting groups. Pure readers, and editors. The average reader wants the best results, and doesn't even know about the existence of redirects. An editor sometimes wants worse results because they want to address a specific problem.

There are several orders of magnitude more readers than editors, and that's likely the reason it was removed . There is no doubt that such filters have its uses, although the question is whether it justifies the older functionality being restored. Also chances are that "debt" probably forgot about the older ticket or they would likely reopen it, and duplicate that task.

Speravir (talkcontribs)

Fair enough.

Reply to "How to exclude redirects from search results?"

Tracing parameters from API to Elasticsearch

Pj quil (talkcontribs)

Hi, While trying to experiment with providing new search options I have been digging around in the wikipedia search code and I find this pattern a lot. Some frontend extension that exposes a search feature generates an action=query API with a bunch of params, which passes through a couple other extensions that create the elasticsearch query. What I am looking for is to simplify this debugging process. Basically for each search feature provided I would like to see the elastic search query parameters produced at the end of the pipeline.

Any suggestions or advice? Thanks!

(am not a PHP dev so apologies if I am overlooking something)

DCausse (WMF) (talkcontribs)
Pj quil (talkcontribs)

ah that made my day! Thanks :)

Reply to "Tracing parameters from API to Elasticsearch"