Jump to content

Help talk:CirrusSearch/2018

Add topic
From mediawiki.org


Error after search in an updated database

[edit]

I dont know whether i am quite right in this discussion -

I installed mw1.27.4 for test and had no search-problems.

Then i made a new installation in a new online-directory and connected to a mw1.19.24 database, which i had before sucessful updated on xampp to mw1.27( and the search functions on xampp). The online- installation was successful - but the search does not function 'Exception encountered, of type "Error"' (see: 127.spiritwiki.de)

Then i imported the same database as before into a new empty online-database and installed mw1.29.2 (http://w129.mb-info.eu) and connected to that database and had a similar problem [WkpAIMCoLbQAAEggcPUAAAEH] 2018-01-01 14:05:20: Fataler Ausnahmefehler des Typs „Error“ - (of course php maintenance/update made!)

php maintenance/rebuildtextindex.php --- does not help

I have now put the debugging on

The problem exists for example with shivar and not with shiva etc.

Who has experience with this -and what can i do ? I do not want to install cirrus on a defect cms.

P:S Solved it by installiing composer with curl-method of https://packagist.org/ Manbu (talk) 14:37, 1 January 2018 (UTC)Reply

You describe a problem with the default search system that uses the main database as a search index.
Looking at your site I see that the search is functioning, is your problem resolved by now?
If no could you try to gather more debug information to help us understand what's happening on your system?
Thanks! DCausse (WMF) (talk) 16:52, 2 January 2018 (UTC)Reply

Is there an existing list of Wikimedia supported MIME types?

[edit]

For the section Help:CirrusSearch#filemime, is there an existing list of the supported MIME types, that we could link to there?

I was trying to search for midi, and I guessed at "audio/mid" which was incorrect, so I had to search for file:example.mid in order to lookup what the correct name is ("audio/midi").

It might be useful to have a direct link to an already-maintained listing (i.e. don't duplicate it locally, if it changes at all often), or a listing + link-to-source if the list is very stable or very scattered.

It looks like https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/libs/mime/mime.types might be what I'm looking for, but I don't know if CirrusSearch and/or Wikimedia supports all of those (e.g. there might be overrides/blacklist entries I haven't stumbled across yet - I skimmed our defaultsettings.php and various links from Manual:MIME type detection but got a bit lost).

Thanks! Quiddity (WMF) (talk) 19:53, 2 January 2018 (UTC)Reply

Hmm, maybe commons:Special:MediaStatistics is comprehensive, and automatically uptodate, and includes everything that we allow on all the public Wikimedia project wikis? Quiddity (WMF) (talk) 19:58, 2 January 2018 (UTC)Reply
Looks like currently this class is responsible for handling MIME: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/libs/mime/MimeAnalyzer.php
Which seems to use http://php.net/manual/en/function.mime-content-type.php (this depends on local configs) and then a set of hardcoded types. So I think it won't be easy to get the full set of types supported... Smalyshev (WMF) (talk) 21:11, 2 January 2018 (UTC)Reply
Special:MediaStatistics though could probably be a good reference for types worth looking for (as opposed to the list of types supported in theory), so maybe it indeed makes sense to point people there. You can see that these lists differ slightly for different wikis - e.g. Commons doesn't have application/zip but Mediawiki does. So theoretical support and actual files present can differ between wikis. Smalyshev (WMF) (talk) 21:21, 2 January 2018 (UTC)Reply
One place we can check is in a separate instance of elasticsearch we keep for testing. Looks like there are around 21 unique terms that have related content in commonswiki. This dump isn't the most up to date, but probably close enough for these purposes. EBernhardson (WMF) (talk) 21:29, 2 January 2018 (UTC)Reply

سلام دوستان.من به راهنمايي شما نيازمندم اينجا

[edit]

من در صفحه انگليسي ويكيپيديا يه نام را كه سرچ ميكنم مياد بالا ومشخصاتش معلومه ولي در فارسي خير؟چرا 37.255.102.158 (talk) 15:21, 12 January 2018 (UTC)Reply

Hello and apologies for posting in English. Can you put in quotes the search team you're using and on which wikipedia you're searching on?
Thanks! DTankersley (WMF) (talk) 16:31, 12 January 2018 (UTC)Reply

searchToken

[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.


If I publish a link to a search query, should I include the searchToken (like &searchToken=a1b2c3d4) in the URL, or should I not? (Let's say this search link will be used a lot.) Does it matter at all?

What is the purpose of the searchToken parameter? Pipetricker (talk) 12:04, 31 January 2018 (UTC)Reply

If you want to use comprehensive links, please omit the &searchToken=whaffle suffix.
It is used by developers to track most recent queries, but has no effect later.
Once I thought this will reuse a previous result set and will allow to accelerate by scrolling through thousands of recent hits, but I learned that it just leaves a track in a temporary server log for a few minutes. PerfektesChaos (talk) 14:58, 31 January 2018 (UTC)Reply
The possibility that it was for re-use of a cached search was what I wanted to rule out with this question. Pipetricker (talk) 09:09, 1 February 2018 (UTC)Reply
Results are not affected by the searchToken so it's perfectly fine (and certainly cleaner) to drop it when linking to the search result page (a new token will be generated anyway).
The search token allows WMF engineers to associate frontend logs with backend logs (see https://meta.wikimedia.org/wiki/Schema:TestSearchSatisfaction2). DCausse (WMF) (talk) 15:00, 31 January 2018 (UTC)Reply
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.

Search titles with regex

[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.


Is it even possible? intitle:/regex/ does not work. Alexis Jazz (talk) 21:22, 23 February 2018 (UTC)Reply

Hi,
Currently, this does not work, but we do have a phabricator ticket that has some details on how we might add this functionality in: https://phabricator.wikimedia.org/T156474. DTankersley (WMF) (talk) 22:02, 23 February 2018 (UTC)Reply
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.

CompletionSuggester not working?

[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.


Hi

MediaWiki 1.30.0
PHP 7.0.22
MySQL 5.7.21-0ubuntu0.16.04.1
ICU 55.1
Elasticsearch 5.4.3
Lua 5.1.5

CirrusSearch (0.2) and Elastica (1.3.0.0)

CirrusSearch seems to be working but the CompletionSuggester is not functional at all. While there is an article called "Movie", CirrusSearch doesn't suggest anything if I look for a slightly misselled "Movei".

After the installation I did not make any changes to the configuration but followed the CirrusSearch README instructions.

Anything I'm missing? Wuestenarchitekten (talk) 23:04, 7 March 2018 (UTC)Reply

Sorry, I didn't see the answer earlier further down below:
The completion suggester is enabled by setting 'yes' to $wgCirrusSearchUseCompletionSuggester . Wuestenarchitekten (talk) 23:24, 7 March 2018 (UTC)Reply
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.

How to update index

[edit]

I install CirrusSearch (0.2), Elastica (1.3.0.0), Elasticsearch (5.6.9) according this instruction Extension:CirrusSearch

After then I try to search and search is working well.

But I create a new page in wiki and search not work. How to update index in ElasticSearch database? Is it auto or I need sometimes run maintence scripts from CirrusSearch for update index? 185.124.231.251 (talk) 17:51, 15 May 2018 (UTC)Reply

CirrusSearch uses the jobqueue to index live updates.
The jobqueue may be configured in many different ways so it depends on how you configured it.
I think the default behavior is to run jobs while there are visits on your wiki using DeferredUpdates.
To see if it's because the jobqueue is not properly running jobs, try to run mwscript runJobs.php.
It may be other problems in which case you would have to inspect your log files to find an indication of something that is not working well.
Unfortunately without more information it's hard for us to help you.
Good luck! DCausse (WMF) (talk) 14:12, 26 July 2018 (UTC)Reply
Manually running the runJobs.php solved this for me. I will need to do some testing to see if new content is getting picked up after manually running this script, and if I will look into debugging why it's not being run automatically. 24.176.93.67 (talk) 13:20, 10 October 2019 (UTC)Reply

Can CirrusSearch search in contents of uploaded files?

[edit]

Can CirrusSearch search in CONTENT of uploaded files? Such as xml doc xls docx xlsx and other ? 185.124.231.251 (talk) 17:57, 15 May 2018 (UTC)Reply

CirrusSearch supports searching the content of uploaded files when mediawiki has appropriate handling installed for the file type. For example Extension:PdfHandler provides this for PDF files. If there are any extensions for handling those file types they might be linked from Category:Media handling extensions. EBernhardson (WMF) (talk) 20:58, 23 May 2018 (UTC)Reply

Only getting partial search results

[edit]

Hi,

I have a wiki running on a dev server with the following:

MediaWiki1.27.4

PHP5.6.25 (apache2handler)

MariaDB5.5.56-MariaDB

Elasticsearch1.7.6

Recently installed CirrusSearch, and it works as expected except for one issue: it's only returning a partial number of pages in the search results. For example, there are about 200 pages (yeah, it's not big) in the main namespace, but only 20 are returned. Likewise, there are about 1800 images, but only 160 are returned. I increased the memory for elasticsearch, but that had no discernible effect.

Any ideas as to how to fix this? Thanks in advance. 199.16.64.3 (talk) 19:08, 30 May 2018 (UTC)Reply

Should add that Elastica is installed and running, and a null edit forces the change through. 199.16.64.3 (talk) 21:30, 30 May 2018 (UTC)Reply
Only twenty results are shown on the first page? What was the query you used to display all pages? Cpiral (talk) 03:48, 1 June 2018 (UTC)Reply
Yes. I was just using the wildcard * to check the results. It seems as though only an initial set of pages get indexed after running the update scripts, but as I noted above any edits or null edits will in fact force a change through. 199.16.64.3 (talk) 18:25, 4 June 2018 (UTC)Reply

This is the most worst search engine on the internet

[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.


This search finds only terms/string/words that are exactly as present. Then I don't need a search! Absolutely time wasting. PS: Also exact called terms phab:T196820User: Perhelion 18:49, 9 June 2018 (UTC)Reply

Clearly an exaggerated claim. The issue there seems partly due to mediawiki itself allowing the creation of an intertwiki redirect. This is clearly invalid otherwise anyone viewing that page would automatically be moved to de:wiki. Cirrus seems to depend on valid redirects to store search data.
Removing the invalid redirect will likely fix the problem. Although the issue needs to be fixed in cirrus as well. 197.218.92.155 (talk) 19:48, 9 June 2018 (UTC)Reply
The discussion above is closed. Please do not modify it. No further edits should be made to this discussion.
[edit]
I am having a hard time loading index dumps using instructions from the elastic blog Looks like others are too [1] [2] [3]
I am trying to do something very basic - load wikiquotes/wikivoyage index and run simple elastic search queries. There are a lot of docs saying a lot of different things but looks like I just need clarity on 3 pieces of info
[1] Version of elasticsearch to use? I was trying things out with 6.3 and I can't figure out what version is being used by wikipedia or where to find the version number.
[2] Do I need to add plugins after elasticsearch is installed (I am on ubuntu 16.04) some docs say analysis-icu and others point at https://github.com/wikimedia/search-extra I cant figure out what these plugins are doing so not sure if they are the source of my errors
[3] Creating the index initially needs a mapping and setting file which seem to be very different for elastic 2,5,6. I tried using these Mapping & Setting but it looks like they are of a much older version. Is there a place where I can find the current files and just drag and drop them into my basic single node no replication setup?
Thanks in advance for any assistance! Ksha run (talk) 02:59, 24 June 2018 (UTC)Reply
[1] Wikimedia uses 5.5.2, see Special:Version. Tacsipacsi (talk) 18:51, 24 June 2018 (UTC)Reply

Tracing parameters from API to Elasticsearch

[edit]

Hi, While trying to experiment with providing new search options I have been digging around in the wikipedia search code and I find this pattern a lot. Some frontend extension that exposes a search feature generates an action=query API with a bunch of params, which passes through a couple other extensions that create the elasticsearch query. What I am looking for is to simplify this debugging process. Basically for each search feature provided I would like to see the elastic search query parameters produced at the end of the pipeline.

Any suggestions or advice? Thanks!

(am not a PHP dev so apologies if I am overlooking something) Pj quil (talk) 01:19, 11 July 2018 (UTC)Reply

Yes, simply add &cirrusDumpQuery to the API request URL, e.g. https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srsearch=test&cirrusDumpQuery and you'll see the query sent to elastic. DCausse (WMF) (talk) 17:37, 11 July 2018 (UTC)Reply
ah that made my day! Thanks :) Pj quil (talk) 02:32, 13 July 2018 (UTC)Reply

No results by any means

[edit]

CirrusSearch0.2 (0be5deb) Elastica1.3.0.0 (75e2f58)

MediaWiki 1.31.0

PHP 7.0.30-0+deb9u1 (apache2handler)

MariaDB 10.1.26-MariaDB-0+deb9u1

ICU 57.1

Elasticsearch 5.6.4

Debian Stretch

With that data, I'm having a major trouble in my wiki. I'm getting no results when searching. Files are indexed. When adding ?action=cirrusDump to any URL returns the right data in JSON (I think). When trying a query via curl (in command line) it retrieves the right results (for example curl -X GET "127.0.0.1:9200/_search?q=SOCIAL&pretty" ). LocalSettings.php section lists this:

$wgServer = "https://192.168.0.154";

[...]

wfLoadExtension( 'PdfHandler' );

wfLoadExtension('PDFEmbed');

wfLoadExtension( 'Elastica' );

require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";

#$wgDisableSearchUpdate = true;

$wgCirrusSearchServers = ['127.0.0.1'];

$wgSearchType = 'CirrusSearch';

The var/log/daemon shows nothing rare. No error. No strange message. Search Engine is ALWAYS retrieving null results. Could it be something about stunneling? If yes, how can I solve the issue? I'm accessing to my wiki from "outside" by https://192.168.0.154 (I'm in a local network and the server wiki is on a VM on the same network) . Altough if I try curl to httpS://127.0.0.1:9200 i get the message curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol . Can it be something from this perspective? The URLs generated by the Search are like this ("Squad" is a word in that is inside a Wiki page and my wiki language is set to Spanish):

https://192.168.0.154/index.php?search=Squad&title=Especial:Buscar&go=Ir&searchToken=2m258n64r6folcjwvc8ad54un

Please heeeelp! I've trying everything I've Googled and metasearching in Help_Talk but I'm really frustrated. The main target of using CirrusSearch+Elastica+Elasticsearch is searching inside the content of PDF Files (I'm building an knowledge wiki). Thank you for reading! RodolfoEBDR (talk) 14:52, 25 July 2018 (UTC)Reply

I don't know if helps to understand, but when running lsof I get this
https://imgur.com/ytCLELS RodolfoEBDR (talk) 15:17, 25 July 2018 (UTC)Reply
One suspicious part of your post is:
curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
A common way to get this error is to contact an http service via https. If you connect via plain http does this work? EBernhardson (WMF) (talk) 19:05, 25 July 2018 (UTC)Reply
Hi, @EBernhardson (WMF)! That's correct. If I try via plain http it works. I've try debugging CirrusSearch via log and it doesn't add "errors". There are some entries but they are not generate by the Wiki query (I think, based on the timestamps).
2018-07-25 14:23:49 mediawiki mediawiki: Response does not has any data. <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /_msearch was not found on this server.</p>
</body></html>
Is there any way to debug step-by-step what the Wiki-Cirrus-Elastica-ES........ are trying to do after push the "Search" button? RodolfoEBDR (talk) 19:28, 25 July 2018 (UTC)Reply
I've disabled SSL on apache2 availabe-sites and also set "$wgServer = "http:" (delete the S). The Wiki opens right, but still zero results. Anyone? RodolfoEBDR (talk) 15:23, 26 July 2018 (UTC)Reply
OK, I've detected something really weird:
I've installed httpry (that scans http ports) and executed httpry -i eth0 (eth0 is my network interface). Browsing the wiki writes the right log (capture). When I push the Search button, it captures this:
2018-07-26 14:21:09     192.168.0.22    192.168.0.154   >       GET     192.168.0.154   /load.php?debug=false&lang=es&modules=mediawiki.helplink%2CsectionAnchor%2Cspecial%2Cui%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.special.search.styles%7Cmediawiki.ui.button%2Cinput%7Cmediawiki.widgets.SearchInputWidget.styles%7Cmediawiki.widgets.styles%7Coojs-ui-core.styles%7Coojs-ui.styles.icons-alerts%2Cicons-content%2Cicons-interactions%2Cindicators%2Ctextures%7Cskins.vector.styles&only=styles&skin=vector        HTTP/1.1        -       -
2018-07-26 14:21:09     192.168.0.154   192.168.0.22    <       -       -       -       HTTP/1.1        304     Not Modified
That made me think outside the box a little, and see what happen if a execute an ElasticSearch query from outside the wiki (Firefox in an external PC that actually can access the wiki): http://192.168.0.154:9200/_search?q=Vasco&pretty . Funny thing: it returns the right hits (in plain text / JSON) but the httpry did not capture ANYTHING. No logs, no registrer. Then I've tcpdump port 9200 and effectively it searches the right way thru ElasticSearch, and ES retrieves the right data...
So... I'm thinking effectively that the issue is between the way Elastica? CirrusSearch? is sending the requests and how Apache? is receiving those requests. RodolfoEBDR (talk) 17:36, 26 July 2018 (UTC)Reply
The error message The requested URL /_msearch was not found on this server. is very suspicious. My best guess here would perhaps the server with elasticsearch on it has not only elasticsearch on 9200, but perhaps http on port 80? Not finding _msearch tells me that the http server connected to is not an elasticsearch instance.
Port 9200 should be the default, but what if you define your elasticsearch connection more explicitly?
['host' => 'my.host.wherever', 'port' => 9200] EBernhardson (WMF) (talk) 16:46, 27 July 2018 (UTC)Reply
Thank you , @EBernhardson (WMF). I've changed wgCirrusSearchServer param with that option but anything happen. I'm starting to think that I'm jinxed. Hahaha RodolfoEBDR (talk) 18:01, 27 July 2018 (UTC)Reply

:'(

RodolfoEBDR (talk) 18:26, 30 July 2018 (UTC)Reply
Did you use the following?
$wgCirrusSearchServers = [
['host' => 'my.host.whatever', 'port' => 9200]
];
You asked earlier if you could step through the code. This is possible with https://xdebug.org/. There are a variety of interfaces and instructions for setting that up. A reasonable spot to set a breakpoint and start stepping from would be ElasticaConnection::getClient() EBernhardson (WMF) (talk) 22:38, 1 August 2018 (UTC)Reply
Hi, @EBernhardson (WMF). Yes, I've tried. Nothing happened. I've quit the project for now. I was getting really frustrated and blocked, so I decided to suspend the Wiki for a while. Thanks for your help. RodolfoEBDR (talk) 11:42, 2 August 2018 (UTC)Reply

How to exclude redirects from search results?

[edit]

Hi. Is there any way to exclude redirects from search results? I want, e.g., to find entries that are not redirections and that contain some character in their title. How to do that? Automatik (talk) 03:15, 23 August 2018 (UTC)Reply

Unfortunately, there's no easy way to exclude redirects from search results.
However, depending on the scope of the task you are trying to complete and your technical ability you could try to use the Search API to semi-automatically do what you need.
This query will give you back the top results with "English Wikipedia" in the title or a redirect:
https://en.wikipedia.org/w/api.php?action=query&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22
The default format is JSON converted to HTML so it's easy to read for a human, but hard to read for a computer. If you only have a small number of queries to deal with, and only need a limited number of results from each (up to 500—set by srlimit), you might be able to get what you need by getting these results and looking through the titles by hand.
If you need a computer to process the results for you, say, because you have many queries, you can get real JSON by adding &format=json:
https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22
On a Unix-like command line (I'm working in Terminal on OS X) you can use curl to fetch the JSON, python to make it pretty, and grep to pull out the titles, and grep again to find the specific ones you want:
curl -s "https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=50&srsearch=intitle:%22english%20wikipedia%22" | python -m json.tool | grep "\"title\":" | grep -i "english wikipedia"
Note that the API URL is URL-encoded (spaces become %20, quotes become %22, etc.).
Results:
"title": "English Wikipedia",
"title": "Simple English Wikipedia",
"title": "Notability in the English Wikipedia",
The results aren't pretty, and in this case there are only 8 results total and 3 that are not redirects. If you are searching for specific characters, you may need to do some more pre-processing before the final grep. (If you are searching for "e", everything will match, because "title" has an "e" in it, for example.) If you need to go through more than the top 500 results, you'll have to figure out how to get the API to give you additional results, etc.
It's not pretty and it's not easy, but it's a start. TJones (WMF) (talk) 15:41, 23 August 2018 (UTC)Reply
Thanks for this answer. It is clearly not easy or convenient, and pretty similar to run the query manually (then, filtering visually with CTRL+F "(redirection" and picking only the results without the "(redirection" text highlighted. Developers should add an option "do not follow redirects", to avoid tedious work for all users using this functionality. I guess it is not so difficult, as this option already exists in some use cases (e.g. when displaying a page with &redirect=no). Automatik (talk) 16:45, 23 August 2018 (UTC)Reply
It is very similar to the ctrl-F solution, just more automatic! For me, somewhere around 25 to 50 queries it would be faster (or at least less boring and thus less error-prone) to go for a hacked-together semi-automatic solution.
Adding a title-only index is probably not a trivial change to make from our current state. We have a search index for intitle:, with the text from titles and redirects in it. There's no differentiation between the title and redirect text once it's in the index. I think we'd have to create another field that was title-only (and maybe a redirect-only field would be equally useful—which together would be bigger than the size of the current title index).
It's not clear to me how many people would need such an index. I'm really curious what your use case is—both to get a sense of how useful title-only search would be, and to see if there's a better clever way to get what you need.
You could open a Phabricator ticket and ask for this feature, but that certainly doesn't guarantee that it would be implemented any time soon. TJones (WMF) (talk) 17:03, 23 August 2018 (UTC)Reply
On the French Wiktionary, we use the typographic apostrophe in titles, instead of the typewriter/vertical apostrophe. I was looking for titles that use the vertical apostrophe, without being a redirection.
Moreover, I am using Windows, which is less convenient than Unix-like command line regarding command-line tools (documentation unclear/not a unified way to run commands in Windows, etc.) Automatik (talk) 17:18, 23 August 2018 (UTC)Reply
Ah.. that's a sensible use case. No other obvious solution comes to mind, but I'll think about it more and if I think of anything useful I'll let you know.
If you are already familiar with Unix-like commands (or want to learn), but just don't have them available because you are on Windows, you could look at Cygwin (English WP, French WP, website)—it's not an emulator or virtual machine, it just gives you versions of standard Unix commands that work on Windows. I used it about 15 years ago when I had a Windows machine for my job. I found it very useful back then, but haven't used it since. TJones (WMF) (talk) 17:28, 23 August 2018 (UTC)Reply
Thanks for the advice, however the bash terminal from Cygwin does not work (and the solution suggested in https://superuser.com/questions/1172759/cygwin-error-failed-to-run-bin-bash-no-such-file-or-directory does not work out either). Moreover, now that I have installed the program, I cannot uninstall it anymore (at least, not easily), as it does not appear in "Programs and features", and when I click "Uninstall" from a right click on the program icon, it opens the "Programs and features" windows, anyway. Automatik (talk) 18:08, 23 August 2018 (UTC)Reply
Oh no! I should have known better than to suggest software I haven't used in so long—but it was so nice back in the day. I haven't used Windows in almost 15 years either, so I don't really have any helpful advice. Crap, I'm sorry! TJones (WMF) (talk) 18:20, 23 August 2018 (UTC)Reply
No worries: I "uninstalled" it by removing its folders, and re-installed it using another repository, and now it works! Thanks for the tip then.
To look for more than 500 results, I added the &sroffset=500 parameter (then 1000, 1500,... until no results are found) Automatik (talk) 22:32, 23 August 2018 (UTC)Reply
Oh, slightly funny: Unaware of this thread I recently opened a ticket on Phabricator: phab:T204089. Speravir (talk) 22:21, 19 September 2018 (UTC)Reply
It seems that it used to be possible to filter redirects at some point, and this was removed https://phabricator.wikimedia.org/T5174, https://phabricator.wikimedia.org/rMW52e699441edf2958701cea692a5dc3243ec3b064.
It seems developers are confused and going back and forth between removing and readding redirects to search. As the old saying goes, "clients don't know what they want". Anyway, a more sensible approach would be a degree of faceting, where it returns all results but aggregates similar properties, e.g. many pages will be in the same category, or many pages will be redirects, disambiguations, poor quality stubs, etc...
It is probably simpler to resolve this using the API, since it already has options for redirect titles. There are also at most about 10000 results, so it would probably be less challenging to filter through those. Anyway, if the search results aren't too many it is easier to include redirect title in API search results and use your favorite replace tool to clean up all those that don't match, e.g. https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=shakespeare&srlimit=500&srprop=redirecttitle . This would be easier if CSV was a valid API output format. 197.235.98.211 (talk) 23:00, 19 September 2018 (UTC)Reply
Also that task is a duplicate: https://phabricator.wikimedia.org/T90807 197.235.98.211 (talk) 23:20, 19 September 2018 (UTC)Reply
(Nitpicking) @IP, apparently not: User/developer debt closed phab:T90807 as declined, but with the words “If there is more of a use case than what is in this ticket, please reopen and show examples / steps to reproduce.” Well I did not reopen, because this ticket was not found in a search for older tickets, but the same user/dev debt did not close the ticket opened by me. It seems I showed some valid use cases. Speravir (talk) 23:39, 19 September 2018 (UTC)Reply
Well, it seems more sensible to formulate it as "restore ability to remove redirects from search results" . This was explicitly and deliberately removed for specific reasons.
The general problem with wikis is that they attempt to cater to two sometimes conflicting groups. Pure readers, and editors. The average reader wants the best results, and doesn't even know about the existence of redirects. An editor sometimes wants worse results because they want to address a specific problem.
There are several orders of magnitude more readers than editors, and that's likely the reason it was removed . There is no doubt that such filters have its uses, although the question is whether it justifies the older functionality being restored. Also chances are that "debt" probably forgot about the older ticket or they would likely reopen it, and duplicate that task. 197.235.98.211 (talk) 00:06, 20 September 2018 (UTC)Reply
Fair enough. Speravir (talk) 00:10, 20 September 2018 (UTC)Reply
A partial workaround is to restrict search to Talk pages, which often are missing for redirects. Fgnievinski (talk) 04:54, 7 December 2024 (UTC)Reply

Issue in English version breaks German translation

[edit]
An issue caused by improper tag pairing of ref and translate tags caused an incomplete display of the German translation.

At least the German translation is broken: It has as of now 14 sections, the english one has 16; in German the English sections 7 and 8 are totally missing, the sections for Page weighting and Regular expression searches – and I know the are translated, because I was the one, who did the most of it. The Filters section in German has only 2 subsections, in English there are 7. (Edit: Oh, in fact parts of the Regular expressions section are present, but displayed as part of the Filters section.) So I assume somewhere in or after the second Filters subsection (Deepcategory) is a missing or misplaced <translate> tag (or maybe vice versa somewhere before is one too much).

FWIW the translation has been broken with this edit: Special:Diff/2780543/2787288 on 21 May 2018, but before this the latest update for translation has been on 27 December 2017, hence some-when in this period the mistake(s) must have been occurred. Speravir (talk) 22:56, 19 September 2018 (UTC)Reply

Can you provide some samples of where this is happening, please? DTankersley (WMF) (talk) 16:52, 20 September 2018 (UTC)Reply
Sorry, but I do not understand what you want from me. What kind of samples? Or: What do you not get from my first message? Simply compare the content overview from Help:CirrusSearch with Help:CirrusSearch/de, especially in section Filters (or Filter in German). Note also that section Geo Search is the ninth in English, but in the moment the seventh in German (as Geosuche). Regarding my edit above: In German in the moment the section 6.2.1 belongs actually to Regex searches, in English this is in fact section 8.2 – looking more carefully shows that already some lines before 6.2.1 do not belong to section 6.2: Everything starting from the box is for RegEx (you can see this string in the box). Speravir (talk) 18:39, 20 September 2018 (UTC)Reply
Updating Translations:Help:CirrusSearch/438/de may help. wargo (talk) 20:45, 20 September 2018 (UTC)Reply
I actually noted the error message I got in German help version, but thought this would just be caused by a translation change (this happened before, too). Because of the bigger issue I didn’t want to start with a new translation, though. I've now taken a look into the English text at this passage and decided to make a cleaner approach: Special:Diff/2883110/2887712. Please, mark someone the English text for translation, so we can see whether this was the reason for the mess.
If so, this edit broke everything: Special:Diff/2704479/2714274. I just rearranged the tags to the state before this. Speravir (talk) 21:22, 20 September 2018 (UTC)Reply
Yes Done wargo (talk) 22:51, 20 September 2018 (UTC)Reply
Thanks, Wargo. This was indeed the reason. Sorry at all for the big noise. Speravir (talk) 22:54, 20 September 2018 (UTC)Reply

Is it possible to search by section?

[edit]

For example, if I have a section called "Etymology" on many of my pages, would it be possible to search only the text that appears in the etymology sections?

If there's no option to do that, is it something that could be done with regex in a reasonable time? Karlpoppery (talk) 10:47, 22 November 2018 (UTC)Reply

No, it isn't possible to do it directly.
Yes, it is theoretically possible to use regex, but in many cases it will timeout. 197.235.75.234 (talk) 16:48, 22 November 2018 (UTC)Reply
I see, that's a bummer. There must be ways to hack around this, though. For example I could automatically save those sections in their own articles under the same category, then do a search by category and redirect the result to the original article. Or maybe I could use Cargo Karlpoppery (talk) 12:25, 23 November 2018 (UTC)Reply
Such hacks will work, but they seem like a lot of effort. The way to make it less likely to timeout is to use efficient regex, along with a simplified text. For example:
"personal life" insource:/\=\=\s*Personal life\s*\=\=.*?he attended.*?\=\=\w*?\=\=.*?/
https://en.wikipedia.org/w/index.php?search=insource%3A%22personal+life%22+insource%3A%2F%5C%3D%5C%3D%5Cs*Personal+life%5Cs*%5C%3D%5C%3D.*%3Fhe+attended.*%3F%5C%3D%5C%3D%5Cw*%3F%5C%3D%5C%3D.*%3F%2F&title=Special%3ASearch&profile=default&fulltext=1
The search above attempts find "he attended" within a section, and because it is simplified it will fail in many cases, for example, if an article only contains one section or if a section is created by a template. Sections are simply too complicated to deal with, because people think of them as sub-documents but in reality they are just pieces of the same document. A related feature request is https://phabricator.wikimedia.org/T27062.
Note that the query above results in a timeout . 197.235.219.81 (talk) 15:35, 23 November 2018 (UTC)Reply
Regexes in general are tricky, and making them work efficiently in on-wiki search is hard, too. So, a few notes:
  • Put as many required words outside the regex as possible. "personal life" gets about 150K results. "personal life" attended only gets 50K, which cuts the number the regex has to scan by a third. "personal life" "he attended" gets only 13K. The regex matches both "he attended" and "she attended", but I would suggest searching them separately, since "personal life" "she attended" only gets 5K results, and splitting your search into two queries against ~18K results is better than one query against 50K.
    • Any other non-regex info, like categories, also helps. Anything to allow the search index to give the regex fewer documents to scan is a plus.
  • In general, put as much relevant plain text in your regex as possible. We use plain text trigrams to accelerate the regex search, so we're limiting the regexs to scan only documents with "Per", "ers", "rso", etc. In this case it doesn't help much because the plain text in the regex is almost the same as the non-regex search terms—though the regex is case sensitive, so the "Per" will filter the list down a bit more.
  • Regexes are case sensitive unless you tell them not to be with /i at the end, so this won't match sentences that start with "He attended", like Jamie How. Rather than let everything be case insensitive, I'd use [Hh] to allow that one character to match upper or lower case, since here it still leaves a lot of plain text trigrams in place. (I don't recall whether the trigrams processing looks into simple character classes like [Hh] and make trigrams across them—so don't count on it.)
  • The regex suggested above doesn't quite guarantee what you want, only that he attended appears after the Personal life section title—it could be in a different section, as in the case of Jeff Grub, where it's in the section after the Personal life section.
    • You can instead match "not equal sign" with [^=]* though it is more expensive. It will also fail to match if there is an equal sign in the Personal life section before the "he attended" part. Unlikely, but possible. It will also exclude results with a sub-section under Personal life that contains "he attended", which is probably not desirable.
  • Also, the extra \=\=... at the end requires another same-level section after the Personal life section, so it won't match if Personal life is the very last section, which may be unlikely, but is not true for all section titles. I'd drop it.
If you don't need a perfect list, but just a short list of ~100 articles you could review manually to find what you need, I'd recommend these two for this use case (link for ..he.. and ..she.. queries):
"personal life" "he attended" insource:/\=\=\s*Personal life\s*\=\=.*?[hH]e attended/
and..
"personal life" "she attended" insource:/\=\=\s*Personal life\s*\=\=.*?[sS]he attended/
Both of these queries finish on English Wikipedia, and return ~750 to ~1600 results. Still a lot, almost certainly with false positives, but potentially manageable.
EDIT: I'm sure there is still some way to further improve the regex. There always is! Hopefully this helps, though. TJones (WMF) (talk) 16:59, 17 January 2019 (UTC)Reply