Help talk:CirrusSearch

Jump to navigation Jump to search

About this board

Is it possible to search by section?

4
Karlpoppery (talkcontribs)

For example, if I have a section called "Etymology" on many of my pages, would it be possible to search only the text that appears in the etymology sections?

If there's no option to do that, is it something that could be done with regex in a reasonable time?

197.235.75.234 (talkcontribs)

No, it isn't possible to do it directly.

Yes, it is theoretically possible to use regex, but in many cases it will timeout.

Karlpoppery (talkcontribs)

I see, that's a bummer. There must be ways to hack around this, though. For example I could automatically save those sections in their own articles under the same category, then do a search by category and redirect the result to the original article. Or maybe I could use Cargo

197.235.219.81 (talkcontribs)

Such hacks will work, but they seem like a lot of effort. The way to make it less likely to timeout is to use efficient regex, along with a simplified text. For example:

"personal life" insource:/\=\=\s*Personal life\s*\=\=.*?he attended.*?\=\=\w*?\=\=.*?/

https://en.wikipedia.org/w/index.php?search=insource%3A%22personal+life%22+insource%3A%2F%5C%3D%5C%3D%5Cs*Personal+life%5Cs*%5C%3D%5C%3D.*%3Fhe+attended.*%3F%5C%3D%5C%3D%5Cw*%3F%5C%3D%5C%3D.*%3F%2F&title=Special%3ASearch&profile=default&fulltext=1

The search above attempts find "he attended" within a section, and because it is simplified it will fail in many cases, for example, if an article only contains one section or if a section is created by a template. Sections are simply too complicated to deal with, because people think of them as sub-documents but in reality they are just pieces of the same document. A related feature request is https://phabricator.wikimedia.org/T27062.

Note that the query above results in a timeout .

Reply to "Is it possible to search by section?"

Issue in English version breaks German translation

7
Summary by Speravir

An issue caused by improper tag pairing of ref and translate tags caused an incomplete display of the German translation.

Speravir (talkcontribs)

At least the German translation is broken: It has as of now 14 sections, the english one has 16; in German the English sections 7 and 8 are totally missing, the sections for Page weighting and Regular expression searches – and I know the are translated, because I was the one, who did the most of it. The Filters section in German has only 2 subsections, in English there are 7. (Edit: Oh, in fact parts of the Regular expressions section are present, but displayed as part of the Filters section.) So I assume somewhere in or after the second Filters subsection (Deepcategory) is a missing or misplaced <translate> tag (or maybe vice versa somewhere before is one too much).

FWIW the translation has been broken with this edit: Special:Diff/2780543/2787288 on 21 May 2018, but before this the latest update for translation has been on 27 December 2017, hence some-when in this period the mistake(s) must have been occurred.

DTankersley (WMF) (talkcontribs)

Can you provide some samples of where this is happening, please?

Speravir (talkcontribs)

Sorry, but I do not understand what you want from me. What kind of samples? Or: What do you not get from my first message? Simply compare the content overview from Help:CirrusSearch with Help:CirrusSearch/de, especially in section Filters (or Filter in German). Note also that section Geo Search is the ninth in English, but in the moment the seventh in German (as Geosuche). Regarding my edit above: In German in the moment the section 6.2.1 belongs actually to Regex searches, in English this is in fact section 8.2 – looking more carefully shows that already some lines before 6.2.1 do not belong to section 6.2: Everything starting from the box is for RegEx (you can see this string in the box).

Wargo (talkcontribs)
Speravir (talkcontribs)

I actually noted the error message I got in German help version, but thought this would just be caused by a translation change (this happened before, too). Because of the bigger issue I didn’t want to start with a new translation, though. I've now taken a look into the English text at this passage and decided to make a cleaner approach: Special:Diff/2883110/2887712. Please, mark someone the English text for translation, so we can see whether this was the reason for the mess.

If so, this edit broke everything: Special:Diff/2704479/2714274. I just rearranged the tags to the state before this.

Wargo (talkcontribs)

YesY Done

Speravir (talkcontribs)

Thanks, Wargo. This was indeed the reason. Sorry at all for the big noise.

Reply to "Issue in English version breaks German translation"
Wuestenarchitekten (talkcontribs)

Hi

MediaWiki 1.30.0
PHP 7.0.22
MySQL 5.7.21-0ubuntu0.16.04.1
ICU 55.1
Elasticsearch 5.4.3
Lua 5.1.5

CirrusSearch (0.2) and Elastica (1.3.0.0)

CirrusSearch seems to be working but the CompletionSuggester is not functional at all. While there is an article called "Movie", CirrusSearch doesn't suggest anything if I look for a slightly misselled "Movei".

After the installation I did not make any changes to the configuration but followed the CirrusSearch README instructions.

Anything I'm missing?

Wuestenarchitekten (talkcontribs)

Sorry, I didn't see the answer earlier further down below:

The completion suggester is enabled by setting 'yes' to $wgCirrusSearchUseCompletionSuggester .

Loading wiki cirrus index dumps into elastic search

3
Ksha run (talkcontribs)

I am having a hard time loading index dumps using instructions from the elastic blog Looks like others are too

I am trying to do something very basic - load wikiquotes/wikivoyage index and run simple elastic search queries. There are a lot of docs saying a lot of different things but looks like I just need clarity on 3 pieces of info

[1] Version of elasticsearch to use? I was trying things out with 6.3 and I can't figure out what version is being used by wikipedia or where to find the version number.

[2] Do I need to add plugins after elasticsearch is installed (I am on ubuntu 16.04) some docs say analysis-icu and others point at https://github.com/wikimedia/search-extra I cant figure out what these plugins are doing so not sure if they are the source of my errors

[3] Creating the index initially needs a mapping and setting file which seem to be very different for elastic 2,5,6. I tried using these Mapping & Setting but it looks like they are of a much older version. Is there a place where I can find the current files and just drag and drop them into my basic single node no replication setup?

Thanks in advance for any assistance!

This post was hidden by 182.1.184.192 (history)
Tacsipacsi (talkcontribs)
Reply to "Loading wiki cirrus index dumps into elastic search"
Automatik (talkcontribs)

Hi. Is there any way to exclude redirects from search results? I want, e.g., to find entries that are not redirections and that contain some character in their title. How to do that?

TJones (WMF) (talkcontribs)

Unfortunately, there's no easy way to exclude redirects from search results.

However, depending on the scope of the task you are trying to complete and your technical ability you could try to use the Search API to semi-automatically do what you need.

This query will give you back the top results with "English Wikipedia" in the title or a redirect:

https://en.wikipedia.org/w/api.php?action=query&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22

The default format is JSON converted to HTML so it's easy to read for a human, but hard to read for a computer. If you only have a small number of queries to deal with, and only need a limited number of results from each (up to 500—set by srlimit), you might be able to get what you need by getting these results and looking through the titles by hand.

If you need a computer to process the results for you, say, because you have many queries, you can get real JSON by adding &format=json:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22

On a Unix-like command line (I'm working in Terminal on OS X) you can use curl to fetch the JSON, python to make it pretty, and grep to pull out the titles, and grep again to find the specific ones you want:

curl -s "https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=50&srsearch=intitle:%22english%20wikipedia%22" | python -m json.tool | grep "\"title\":" | grep -i "english wikipedia"

Note that the API URL is URL-encoded (spaces become %20, quotes become %22, etc.).

Results:

   "title": "English Wikipedia",
   "title": "Simple English Wikipedia",
   "title": "Notability in the English Wikipedia",

The results aren't pretty, and in this case there are only 8 results total and 3 that are not redirects. If you are searching for specific characters, you may need to do some more pre-processing before the final grep. (If you are searching for "e", everything will match, because "title" has an "e" in it, for example.) If you need to go through more than the top 500 results, you'll have to figure out how to get the API to give you additional results, etc.

It's not pretty and it's not easy, but it's a start.

Automatik (talkcontribs)

Thanks for this answer. It is clearly not easy or convenient, and pretty similar to run the query manually (then, filtering visually with CTRL+F "(redirection" and picking only the results without the "(redirection" text highlighted. Developers should add an option "do not follow redirects", to avoid tedious work for all users using this functionality. I guess it is not so difficult, as this option already exists in some use cases (e.g. when displaying a page with &redirect=no).

TJones (WMF) (talkcontribs)

It is very similar to the ctrl-F solution, just more automatic! For me, somewhere around 25 to 50 queries it would be faster (or at least less boring and thus less error-prone) to go for a hacked-together semi-automatic solution.

Adding a title-only index is probably not a trivial change to make from our current state. We have a search index for intitle:, with the text from titles and redirects in it. There's no differentiation between the title and redirect text once it's in the index. I think we'd have to create another field that was title-only (and maybe a redirect-only field would be equally useful—which together would be bigger than the size of the current title index).

It's not clear to me how many people would need such an index. I'm really curious what your use case is—both to get a sense of how useful title-only search would be, and to see if there's a better clever way to get what you need.

You could open a Phabricator ticket and ask for this feature, but that certainly doesn't guarantee that it would be implemented any time soon.

Automatik (talkcontribs)

On the French Wiktionary, we use the typographic apostrophe in titles, instead of the typewriter/vertical apostrophe. I was looking for titles that use the vertical apostrophe, without being a redirection.

Moreover, I am using Windows, which is less convenient than Unix-like command line regarding command-line tools (documentation unclear/not a unified way to run commands in Windows, etc.)

TJones (WMF) (talkcontribs)

Ah.. that's a sensible use case. No other obvious solution comes to mind, but I'll think about it more and if I think of anything useful I'll let you know.

If you are already familiar with Unix-like commands (or want to learn), but just don't have them available because you are on Windows, you could look at Cygwin (English WP, French WP, website)—it's not an emulator or virtual machine, it just gives you versions of standard Unix commands that work on Windows. I used it about 15 years ago when I had a Windows machine for my job. I found it very useful back then, but haven't used it since.

Automatik (talkcontribs)

Thanks for the advice, however the bash terminal from Cygwin does not work (and the solution suggested in https://superuser.com/questions/1172759/cygwin-error-failed-to-run-bin-bash-no-such-file-or-directory does not work out either). Moreover, now that I have installed the program, I cannot uninstall it anymore (at least, not easily), as it does not appear in "Programs and features", and when I click "Uninstall" from a right click on the program icon, it opens the "Programs and features" windows, anyway.

TJones (WMF) (talkcontribs)

Oh no! I should have known better than to suggest software I haven't used in so long—but it was so nice back in the day. I haven't used Windows in almost 15 years either, so I don't really have any helpful advice. Crap, I'm sorry!

Automatik (talkcontribs)

No worries: I "uninstalled" it by removing its folders, and re-installed it using another repository, and now it works! Thanks for the tip then. To look for more than 500 results, I added the &sroffset=500 parameter (then 1000, 1500,... until no results are found)

Speravir (talkcontribs)

Oh, slightly funny: Unaware of this thread I recently opened a ticket on Phabricator: phab:T204089.

197.235.98.211 (talkcontribs)

It seems that it used to be possible to filter redirects at some point, and this was removed https://phabricator.wikimedia.org/T5174, https://phabricator.wikimedia.org/rMW52e699441edf2958701cea692a5dc3243ec3b064.

It seems developers are confused and going back and forth between removing and readding redirects to search. As the old saying goes, "clients don't know what they want". Anyway, a more sensible approach would be a degree of faceting, where it returns all results but aggregates similar properties, e.g. many pages will be in the same category, or many pages will be redirects, disambiguations, poor quality stubs, etc...

It is probably simpler to resolve this using the API, since it already has options for redirect titles. There are also at most about 10000 results, so it would probably be less challenging to filter through those. Anyway, if the search results aren't too many it is easier to include redirect title in API search results and use your favorite replace tool to clean up all those that don't match, e.g. https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=shakespeare&srlimit=500&srprop=redirecttitle . This would be easier if CSV was a valid API output format.

197.235.98.211 (talkcontribs)
Speravir (talkcontribs)

(Nitpicking) @IP, apparently not: User/developer debt closed phab:T90807 as declined, but with the words “If there is more of a use case than what is in this ticket, please reopen and show examples / steps to reproduce.” Well I did not reopen, because this ticket was not found in a search for older tickets, but the same user/dev debt did not close the ticket opened by me. It seems I showed some valid use cases.

197.235.98.211 (talkcontribs)

Well, it seems more sensible to formulate it as "restore ability to remove redirects from search results" . This was explicitly and deliberately removed for specific reasons.

The general problem with wikis is that they attempt to cater to two sometimes conflicting groups. Pure readers, and editors. The average reader wants the best results, and doesn't even know about the existence of redirects. An editor sometimes wants worse results because they want to address a specific problem.

There are several orders of magnitude more readers than editors, and that's likely the reason it was removed . There is no doubt that such filters have its uses, although the question is whether it justifies the older functionality being restored. Also chances are that "debt" probably forgot about the older ticket or they would likely reopen it, and duplicate that task.

Speravir (talkcontribs)

Fair enough.

Reply to "How to exclude redirects from search results?"

Tracing parameters from API to Elasticsearch

3
Pj quil (talkcontribs)

Hi, While trying to experiment with providing new search options I have been digging around in the wikipedia search code and I find this pattern a lot. Some frontend extension that exposes a search feature generates an action=query API with a bunch of params, which passes through a couple other extensions that create the elasticsearch query. What I am looking for is to simplify this debugging process. Basically for each search feature provided I would like to see the elastic search query parameters produced at the end of the pipeline.

Any suggestions or advice? Thanks!

(am not a PHP dev so apologies if I am overlooking something)

DCausse (WMF) (talkcontribs)
Pj quil (talkcontribs)

ah that made my day! Thanks :)

Reply to "Tracing parameters from API to Elasticsearch"
RodolfoEBDR (talkcontribs)

CirrusSearch0.2 (0be5deb) Elastica1.3.0.0 (75e2f58)

MediaWiki 1.31.0

PHP 7.0.30-0+deb9u1 (apache2handler)

MariaDB 10.1.26-MariaDB-0+deb9u1

ICU 57.1

Elasticsearch 5.6.4

Debian Stretch

With that data, I'm having a major trouble in my wiki. I'm getting no results when searching. Files are indexed. When adding ?action=cirrusDump to any URL returns the right data in JSON (I think). When trying a query via curl (in command line) it retrieves the right results (for example curl -X GET "127.0.0.1:9200/_search?q=SOCIAL&pretty" ). LocalSettings.php section lists this:

$wgServer = "https://192.168.0.154";

[...]

wfLoadExtension( 'PdfHandler' );

wfLoadExtension('PDFEmbed');

wfLoadExtension( 'Elastica' );

require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";

#$wgDisableSearchUpdate = true;

$wgCirrusSearchServers = ['127.0.0.1'];

$wgSearchType = 'CirrusSearch';

The var/log/daemon shows nothing rare. No error. No strange message. Search Engine is ALWAYS retrieving null results. Could it be something about stunneling? If yes, how can I solve the issue? I'm accessing to my wiki from "outside" by https://192.168.0.154 (I'm in a local network and the server wiki is on a VM on the same network) . Altough if I try curl to httpS://127.0.0.1:9200 i get the message curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol . Can it be something from this perspective? The URLs generated by the Search are like this ("Squad" is a word in that is inside a Wiki page and my wiki language is set to Spanish):

https://192.168.0.154/index.php?search=Squad&title=Especial:Buscar&go=Ir&searchToken=2m258n64r6folcjwvc8ad54un

Please heeeelp! I've trying everything I've Googled and metasearching in Help_Talk but I'm really frustrated. The main target of using CirrusSearch+Elastica+Elasticsearch is searching inside the content of PDF Files (I'm building an knowledge wiki). Thank you for reading!

RodolfoEBDR (talkcontribs)
EBernhardson (WMF) (talkcontribs)

One suspicious part of your post is:

curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol

A common way to get this error is to contact an http service via https. If you connect via plain http does this work?

RodolfoEBDR (talkcontribs)

Hi, @EBernhardson (WMF)! That's correct. If I try via plain http it works. I've try debugging CirrusSearch via log and it doesn't add "errors". There are some entries but they are not generate by the Wiki query (I think, based on the timestamps).

2018-07-25 14:23:49 mediawiki mediawiki: Response does not has any data. <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>

<title>404 Not Found</title>

</head><body>

<h1>Not Found</h1>

<p>The requested URL /_msearch was not found on this server.</p>

</body></html>

Is there any way to debug step-by-step what the Wiki-Cirrus-Elastica-ES........ are trying to do after push the "Search" button?

RodolfoEBDR (talkcontribs)

I've disabled SSL on apache2 availabe-sites and also set "$wgServer = "http:" (delete the S). The Wiki opens right, but still zero results. Anyone?

RodolfoEBDR (talkcontribs)

OK, I've detected something really weird:

I've installed httpry (that scans http ports) and executed httpry -i eth0 (eth0 is my network interface). Browsing the wiki writes the right log (capture). When I push the Search button, it captures this:

2018-07-26 14:21:09     192.168.0.22    192.168.0.154   >       GET     192.168.0.154   /load.php?debug=false&lang=es&modules=mediawiki.helplink%2CsectionAnchor%2Cspecial%2Cui%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.special.search.styles%7Cmediawiki.ui.button%2Cinput%7Cmediawiki.widgets.SearchInputWidget.styles%7Cmediawiki.widgets.styles%7Coojs-ui-core.styles%7Coojs-ui.styles.icons-alerts%2Cicons-content%2Cicons-interactions%2Cindicators%2Ctextures%7Cskins.vector.styles&only=styles&skin=vector        HTTP/1.1        -       -

2018-07-26 14:21:09     192.168.0.154   192.168.0.22    <       -       -       -       HTTP/1.1        304     Not Modified

That made me think outside the box a little, and see what happen if a execute an ElasticSearch query from outside the wiki (Firefox in an external PC that actually can access the wiki): http://192.168.0.154:9200/_search?q=Vasco&pretty . Funny thing: it returns the right hits (in plain text / JSON) but the httpry did not capture ANYTHING. No logs, no registrer. Then I've tcpdump port 9200 and effectively it searches the right way thru ElasticSearch, and ES retrieves the right data...

So... I'm thinking effectively that the issue is between the way Elastica? CirrusSearch? is sending the requests and how Apache? is receiving those requests.

EBernhardson (WMF) (talkcontribs)

The error message The requested URL /_msearch was not found on this server. is very suspicious. My best guess here would perhaps the server with elasticsearch on it has not only elasticsearch on 9200, but perhaps http on port 80? Not finding _msearch tells me that the http server connected to is not an elasticsearch instance.

Port 9200 should be the default, but what if you define your elasticsearch connection more explicitly?

['host' => 'my.host.wherever', 'port' => 9200]

RodolfoEBDR (talkcontribs)

Thank you , @EBernhardson (WMF). I've changed wgCirrusSearchServer param with that option but anything happen. I'm starting to think that I'm jinxed. Hahaha

RodolfoEBDR (talkcontribs)

:'(

EBernhardson (WMF) (talkcontribs)

Did you use the following?

$wgCirrusSearchServers = [

['host' => 'my.host.whatever', 'port' => 9200]

];

You asked earlier if you could step through the code. This is possible with https://xdebug.org/. There are a variety of interfaces and instructions for setting that up. A reasonable spot to set a breakpoint and start stepping from would be ElasticaConnection::getClient()

RodolfoEBDR (talkcontribs)

Hi, @EBernhardson (WMF). Yes, I've tried. Nothing happened. I've quit the project for now. I was getting really frustrated and blocked, so I decided to suspend the Wiki for a while. Thanks for your help.

Reply to "No results by any means"
185.124.231.251 (talkcontribs)

I install CirrusSearch (0.2), Elastica (1.3.0.0), Elasticsearch (5.6.9) according this instruction Extension:CirrusSearch

After then I try to search and search is working well.

But I create a new page in wiki and search not work. How to update index in ElasticSearch database? Is it auto or I need sometimes run maintence scripts from CirrusSearch for update index?

DCausse (WMF) (talkcontribs)

CirrusSearch uses the jobqueue to index live updates. The jobqueue may be configured in many different ways so it depends on how you configured it. I think the default behavior is to run jobs while there are visits on your wiki using DeferredUpdates. To see if it's because the jobqueue is not properly running jobs, try to run mwscript runJobs.php. It may be other problems in which case you would have to inspect your log files to find an indication of something that is not working well. Unfortunately without more information it's hard for us to help you. Good luck!

Reply to "How to update index"
Perhelion (talkcontribs)

: "Do not run a bare insource:/regexp/ search."

There is no clarification what that means, what is the alternative of a bare regexp?? All examples are bare regexp.

Tacsipacsi (talkcontribs)

It refers to the previous paragraph, i.e. you should define a search domain in addition to the regex search. So all examples are not bare regexes, as the first four use non-regex insource:, and the fifth uses prefix: to limit the search domain on which the regex is tested.

This post was hidden by Tacsipacsi (history)
Reply to "Warning Regular expression"

Only getting partial search results

4
199.16.64.3 (talkcontribs)

Hi,

I have a wiki running on a dev server with the following:

MediaWiki1.27.4

PHP5.6.25 (apache2handler)

MariaDB5.5.56-MariaDB

Elasticsearch1.7.6

Recently installed CirrusSearch, and it works as expected except for one issue: it's only returning a partial number of pages in the search results. For example, there are about 200 pages (yeah, it's not big) in the main namespace, but only 20 are returned. Likewise, there are about 1800 images, but only 160 are returned. I increased the memory for elasticsearch, but that had no discernible effect.

Any ideas as to how to fix this? Thanks in advance.

199.16.64.3 (talkcontribs)

Should add that Elastica is installed and running, and a null edit forces the change through.

Cpiral (talkcontribs)

Only twenty results are shown on the first page? What was the query you used to display all pages?

199.16.64.3 (talkcontribs)

Yes. I was just using the wildcard * to check the results. It seems as though only an initial set of pages get indexed after running the update scripts, but as I noted above any edits or null edits will in fact force a change through.

Reply to "Only getting partial search results"