Help talk:CirrusSearch

Jump to navigation Jump to search

About this board

Be..anyone (talkcontribs)

On the phabricator pages folks discuss some obscure feature related to file uploads on phab. I vaguely recall that I added links to two images on phab as "other versions" on a commons file. So where was this, how can I find it again? Maybe Special:Contributions should offer a search limited to all pages edited by the given user.

Nemo bis (talkcontribs)

I don't know if worth it, but this could be feasible, "simply" dumping the history into ElasticSearch. Even just usernames would end up being huge, though.

Reply to "Feature suggestion"

Add articletopic to Draft space

3
Summary by DCausse (WMF)
Sadads (talkcontribs)

I was working on topics on English Wikipedia, and realized, that it would be super handy to have the ORES topic models applied to draft space to make it easier to search.

Speravir (talkcontribs)
DCausse (WMF) (talkcontribs)

About the copyrights of the pictures

2
141.237.124.41 (talkcontribs)

Hello guys I am new to your community and I just wanted to ask you one thing. I have written some E-books and I need to add some pictures to theml . Is it ok to use pictures from your community? Of course I will state that these pictures are not mine and I will give you full credit about them.

Speravir (talkcontribs)
Reply to "About the copyrights of the pictures"

What is the recipe for properly re-initializing Elastic/CirrusSearch?

5
WhitWye (talkcontribs)

Somehow I've ended up with CirrusSearch mostly working, but failing entirely to find some terms known to be in the imported wiki. Also, as we keep a live backup of our wiki to which we nightly import the whole of the main one, we should have the search DB there thoroughly refreshed each night. What is the proper formula to purge and rebuild the search DB? Apologies if it's someplace that should be obvious, which I've so far missed.

EBernhardson (WMF) (talkcontribs)

CirrusSearch contains a maintenance script called forceSearchIndex.php for this purpose. It can be invoked something like the following. This will essentially queue up to 10k indexing jobs, wait for that to go down to ~1k jobs (to prevent dominating the job queue and forcing other jobs to wait for the entire process to complete), and then push more jobs up to 10k in a repeated fashion.


php extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 10000 --pauseForJobs 1000

WhitWye (talkcontribs)

Running that script on an otherwise idle system, after a string of "Queued 100 pages" messages there's a seemingly endless repeat of "[              wikidb] 179 jobs left on the queue." After many minutes of that htop is showing a load between 0.00 and 0.01. Is there a prerequisite to running this maintenance script successfully? Running it without the flags I see it runs into a parsing error:


MWException from line 348 of /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php: Bad parser output text. ....


Obviously I should report a bug: https://phabricator.wikimedia.org/T244603

EBernhardson (WMF) (talkcontribs)

I see in the phab ticket you came up with a temporary solution to the parser failure. With that somewhat resolved, does the reindexing complete?

Legaulph (talkcontribs)

I'm seeing the same issue MWException from line 310 of mediawiki-1.31.7\includes\parser\ParserOutput.php: Bad parser output text. I tried the reported bug temporary solution and that does not work. Legaulph (talk) 13:32, 21 May 2020 (UTC)

Reply to "What is the recipe for properly re-initializing Elastic/CirrusSearch?"

search results in rendered form

2
2001:638:607:205:0:0:0:30 (talkcontribs)

For my MediaWiki Project I'm looking for a way to convert the search results from the raw (wikitext) form into the rendered form to make it look better. Is CirrusSearch able to do that? Or do you guys have any other idea how i can achieve this?


DCausse (WMF) (talkcontribs)

CirrusSearch is not able to do this, the snippets presented are issued from a text version obtained from \WikiTextStructure::getMainText(). To highlight we insert html tags at precise offsets returned by the highlighter run inside elastic, if the text indexed by elastic and the text displayed are different then you'll have to track where offsets are, basically knowing that offset 123 in the text version is at offset 342 in the rendered output. Add this to the fact that since you can't display the whole content (too big) you need to select a consistent chunk of the rendered output to display. This is very challenging in my opinion. Perhaps limiting to a set of known text formatting options might make this a bit easier but handling everything including tables sounds particularly complex.

Reply to "search results in rendered form"

Easy way to identify articles with/without images?

6
Astinson (WMF) (talkcontribs)

So one of the larger theories about reader experience is that illustrated content is more catchy and engaging -- and that we will want to connect content from Commons with those potential articles.

Moreover, when community groups want to organize events like meta:VisibleWikiWomen or meta:Wikipedia_Pages_Wanting_Photos, it would be super useful to be able to identify which articles don't yet have Commons media on them.

Is there a way to surface whether or not articles have image in search? Right now the closest thing I can find to this, is whether or not a page has a pageimage indexed in https://en.wikipedia.org/wiki/Mother_Teresa?action=info . Magnus's petscan surfaces that element: but it's not reliable -- sometimes a page will have an image, but not a high quality one -- or it will be a logo or something that doesn't meet whatever the criteria is being used for that filter.

@DTankersley (WMF) @DCausse (WMF) & @EBernhardson (WMF) -- would love your thoughts.

TJones (WMF) (talkcontribs)

I don't think there is currently a good tool for this. You can do something with insource: regular expressions, but regexes can be very expensive queries and they aren't necessarily scalable. (We only allow so many regex queries at once, and if you have no other search terms to narrow the scope of the regex, it will always return incomplete results on large wikis because it times out.)

Here's a fairly generic regex that finds File: links with image suffixes (you may want to add other suffixes):

insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/

So, this query on enwiki currently returns about 100K results, but it times out, so the list is not complete.

The negation ( -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ ) returns 440K documents, and also timed out.

However, if you can limit your search to a particular category or title match or even fairly rare keyword, it should complete. For example: deepcat:"Film stubs" -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ finishes and gives 632 results (deepcat:"Film stubs" only gets 641 results, so it is easy for the regular expression to run over that limited set).

Note that insource: looks at the actual source of the page, so images included by templates, transclusion, etc, will not be detected.

So, as a once-in-a-while query or set of queries to generate lists for an editathon or other event, this would work. As a widely deployed user-facing tool, it probably would not—though maybe if there are always focused additional search terms.

If you are open to non-search approaches, you could also look at the dumps and write a tool to scan the latest dump for articles without images. It wouldn't be up-to-the-minute, but you could process 100% of a wiki if you wanted to, which would never be possible with insource: searches on larger wikis.

TJones (WMF) (talkcontribs)

As @DCausse (WMF) pointed out, not every wiki uses File: so another regex may work, or may not. Infoboxes and templates may have other syntaxes. I suppose just looking for things that look like image file names might work, with a few false positives where an article discusses images without actually having one—which seems rare. Parsing dumps sounds better and better.

Astinson (WMF) (talkcontribs)

@TJones (WMF) that is a very interesting solution, but regex would not be what I would want to provide to provide to organizers for regular use.

Also, just tried your query with a small set and get one image almost immediately: https://en.wikipedia.org/wiki/Lyra_McKee ). I have tried multiple examples, and its seems to be retrieving a not-insignificant number of false positives. I tried something else without regex and it seems to produce a better result. That solves my short term question.

In the long term, I would think a filter like this would be super useful in the search interface itself. I think the challenge with dumps, is that you create a huge barrier to novel use cases by folks who are wiki literate, but not necessarily technically literate. There are some tools that kindof do this kind of search live (i.e. FIST: https://tools.wmflabs.org/fist -- but that tool is kindof overwhelming, and breaks from the typical workflows (i.e. leveraging Petscan for categories because deepcat seems to break everytime I use it (as too many categories)). But search makes a lot more sense in a tool like petscan (or any other end-user tool). Theoretically you would want to be able to share a tool links that generates a query and share it around with others, and then the updates are going to be consistent from search.

Astinson (WMF) (talkcontribs)
197.218.85.218 (talkcontribs)

It seems like the most sensible place to add this functionality would be as a way to fetch page properties (https://phabricator.wikimedia.org/T200860). Of course there would need to be a generic property added to a page that contains any image at all, e.g. "hasimages" property. Currently it only gets added when an image fulfills certain criteria. Then it would work in a similar manner to Special:PagesWithProp .

Anyway, just knowing whether a page has an image can probably be done by Wikidata Query Service. For instance in a sparql query (https://w.wiki/MrK). That example simply searches for cats within the category:cats on English wikipedia.

I just cobbled that together using a few sparql examples. People more proficient with this would probably be able to make an image appear there, and make it possible to filter out pages without images. It would also be possible to generate those using templates for various use cases.


Reply to "Easy way to identify articles with/without images?"

How to list more than 1 result from a wiki page

2
Chachacha2020 (talkcontribs)

Hi, I'm using


MediaWiki     1.27.1

PHP     5.5.9-1ubuntu4.22 (apache2handler)

MySQL     5.5.53-0ubuntu0.14.04.1

ICU     52.1

Elasticsearch     1.7.5

and kinda pleased with the search result. However, I have a problem. My wiki has a page "Windows tip" and 2 heading name "windows can't sleep" and "Windows wake from sleep". A search "windows sleep" only bring "Windows wake from sleep" then the results come from another page. How to list more than 1 result from a wiki page?

PS: I can code a bit, so if this feature not available I can contribute.

DCausse (WMF) (talkcontribs)

Sadly the smallest unit is the wiki page with CirrusSearch, diverging from that might require significant changes to CirrusSearch internal data model. Another issue is that CirrusSearch does not know how to attach sections and text to each other. Imagine the search query matches a section name, the text displayed below won't be extracted necessarily from this same section causing some incomprehension (see phab:T131950).

Perhaps changing the structure of some of you wiki pages (subpages instead of sections) is an option to you?

If not I'm sorry for not being able to point at a reasonable solution for adapting the CirrusSearch code.

Reply to "How to list more than 1 result from a wiki page"
Jonteemil (talkcontribs)

Hello!

Is there a feature which you can use to search for the beginnings of pages? For example if you want to find every page on Commons that begins with {{Information but exclude every page that begins with something else and has {{Information in the second row?

DCausse (WMF) (talkcontribs)

Hi,

Cirrus does not allow searching for anchors (start or end of document) but I believe you can search for what you want by combining two regular expressions:

insource:/\{\{Information/ -insource:/.\{\{Information/


The first insource:/\{\{Information/ will search for all pages containing the wikitext {{Information. The second -insource:/.\{\{Information/ will exclude all pages that contain a character followed by {{Information (these are all the pages where the Information template is not used at the beginning of the wikitext).

Note that this regular expression is a bit slow to process as it has to scan a lot pages so you may end up only seeing partial results.

Jonteemil (talkcontribs)

I see, thanks! Why doesn't cirrus allow searching for anchors?

DCausse (WMF) (talkcontribs)

Simply because the underlying regular expression engine that we use does not support such feature :)

Jonteemil (talkcontribs)

Just to be sure. Will "beginswith:" and your insource regex have the exact 100% same result, however with different methods? "Beginswith:" is what I call the non-existant feature that would serve my need.

Jonteemil (talkcontribs)
DCausse (WMF) (talkcontribs)

@Jonteemil: no, the solution I provided only works if the characters you search for are only used at the beginning of the wikitext content not repeated elsewhere.


Assuming that we want to search for "xyz" only appearing at the beginning the wikitext insource:/xyz/ -insource/.xyz/ will discard valid results where "xyz" appears at the beginning but also somewhere else in the text.

In other words the query I provided is only accurate to 100% for the pages that include the Information template only once.


Allowing to anchor the search string to the start or the end of the string has been somehow brought up in https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Search#Search_by_suffix

Instead of adding a new keyword I think it would make more sense to add support for ^ and $ to the insource:// and intitle:// keywords rather than adding a new keyword.

Jonteemil (talkcontribs)

Okay, thanks!

Jonteemil (talkcontribs)

Aha, thanks for the knowledge!

Speravir (talkcontribs)

In addition to @DCausse (WMF): Citing the help “when possible, please avoid running a bare regexp search”. But you also have to care about the different possible cases. Note that all this allowed: {{Information, {{information, {{ Information, {{ information, in fact an almost endless number of spaces between the opening braces and the template name.

Even I narrowed down the search amount I got a warning with this query because of the heavy template use: file: hastemplate:information insource:"information" insource:/\{\{ *[Ii]nformation/

And, of course, for this I was warned, too: file: hastemplate:information insource:"information" insource:/\{\{ *[Ii]nformation/ -insource:/.\{\{ *[Ii]nformation/

Jonteemil, this is the wrong place here (it should be discussed at Commons’ Village pump, I guess), but why do you want to know this? Do you want to add == {{int:filedesc}} ==? If so: This is not mandatory!

Jonteemil (talkcontribs)

To add == {{int:filedesc}} == was indeed my intention. Eventhough it might not be mandotory I think the goal should be that all files should have it, but as you say this is not mediawiki matter, rather Commons. I asked the question here since the question itself could be of use for every Wikimedia project. Even if I intended to use the answer on Commons.

Reply to "beginswith:?"

Question about spelling corrections and "no results"

8
Equinox (talkcontribs)

For example: I put parimion into Wikipedia's search box. It says: "Showing results for pavilion. (LINK:) Search instead for parimion." I click that link and it says: "There were no results matching the query."

The spelling correction is (sometimes) useful, but in my experience, the "search instead" link never ever gives any results. Indeed that link only seems to be offered when your typed text is not present in the entire wiki, and then it does the best-guess spelling for you.

Am I right? If so, what's the point of that "search instead" link, which is guaranteed to produce no results?

TJones (WMF) (talkcontribs)

We do only replace your query with the suggestion if the original query got zero results. I think the "search instead" language pre-dates all of us who are currently working on the search platform team, so I can't give you the original justification for it—though mimicking Google's UI patterns generally makes search more understandable for most users. However, I can imagine that some people—particularly power users and editors of various sorts—would be upset if they searched for parimion, got results for pavilion, and then couldn't verify that parimion did in fact get zero results.

Google will override your intended search with their suggestion and give a link for your original search that gives fewer results. So, we are working in an environment where people might expect valid results to be overridden by a search engine; letting them see their original results even though there will be zero is goofy, but it's goofiness in the name of transparency.

197.235.220.190 (talkcontribs)

Seems rather simple to improve. Make it clear to the user that the query they chose will result in 0 entries, e.g.: "Showing results for pavilion. Search instead for parimion (Note: there are 0 results)".


>Am I right? If so, what's the point of that "search instead" link, which is guaranteed to produce no results?

No.

It is very important to keep the option allowing the user to search instead for whatever they typed. First, it allows them to verify the search engine's claim, it also makes it clear that the user isn't getting wild results because of some bug, and lastly, they can always check that it is accurate. After all, machines can and do make mistakes, and more importantly, the search engine can be wrong more often than not, especially, in a wiki where things can change. At the time of the query the search engine might be right, but just a few seconds later someone can create the page, or the new entry might simply be taking time to update the index despite the fact that the content was created right before your search.


Anyway, if a particular wiki doesn't like the message, I guess they could edit it using Mediawiki:search-rewritten.

Equinox (talkcontribs)

If there are no results then I think it would be better to say "no results for X; here are results for Y", and drop the pointless link. I take "197"'s point that there might be results if you search again a few seconds later, but if you want to do that you can just hit Refresh or F5 etc. Hardly a common use case.

Equinox (talkcontribs)

What is my next step? I have had bad experiences with bug trackers. How can I suggest this change without being shit on? Thanks.

TJones (WMF) (talkcontribs)

I'm sorry that you've had bad experiences with task trackers. It's a recurring problem for a lot of people, unfortunately. In this case, the people who would be working on it agree with you, so there shoudn't be any reason for unpleasant discussion.

I've uncovered some of the history of the message—turns out one person on our team was here when it was implemented—and the original thought was that we might allow suggestions to overwrite queries that got a non-zero number of results, but that never materialized.

The current plan is to create a new message that says there are no results for the original query, which we'll show when appropriate, and keep the existing message for a possible future case where we overwrite a query with non-zero results.

I've created a task: T236296

Equinox (talkcontribs)

Okay. Thanks. I really appreciate your help here as an "insider". Let's see how it goes :)

TJones (WMF) (talkcontribs)

Glad I could help. Please do keep in mind that we have to prioritize and work through lots of tasks, so while this is probably straightforward, it may take a while for us to get to it. But you definitely gave us a helpful push in the right direction. Thanks!

Reply to "Question about spelling corrections and "no results""
Colin M (talkcontribs)

The filters section says A namespace or a prefix term is not a filter because a namespace will not run standalone, and a prefix will not negate. This seems empirically untrue. On EnWP I get the following number of results for each of these queries, as expected:

  • incategory:"LGBT-related musical films": 58
  • incategory:"LGBT-related musical films" prefix:"Hello": 2
  • incategory:"LGBT-related musical films" -prefix:"Hello": 56

So it seems like negating a prefix does work. Am I misunderstanding what this is trying to say? For now I added a {{dubious}} tag.

TJones (WMF) (talkcontribs)

I took a long hard look at this and I'm confused, too. I'm not sure I understand the definition of "filter" being used. I don't know if the documentation is out of date or using some model that I'm not able to wrap my head around. Similarly, I don't get this: Insource ... is also a filter, but insource:/regexp/ is not a filter. insource:word and insource:/regex/ behave pretty much the same, other than the regex being much slower. Sounds like the documentation could use a thorough review to make sure all the advanced features and special cases are still described correctly.

Cpiral (talkcontribs)

A "filter" can reduce unwanted matches, providing refinement. Regex are special, catered for, terms that use filters. I made educated guesses at numerous terms, and even invented "greyspace". Just trying to help.

When I rewrote the help page to its current form, (years ago), there was neither documentation, nor discussion. So I would say prefixes can be negated now, but not then.

TJones (WMF) (talkcontribs)

@Cpiral, thanks for the explanation. Also, I very much appreciate all the work you've put into these help pages! They definitely keep getting better.

Pols12 (talkcontribs)

So, I’m editing the page to indicate we can actually negate prefixes.

EDIT: since I’m not really sure what is true, I’ve only removed the concerned sentence. Feel free to explain what is possible and what is not.

Cpiral (talkcontribs)

It tested well. It is true that prefix is a filter. I added the sentence back, modified: A namespace is a specified search domain but not a filter because a namespace will not run standalone.  A ''prefix'' will negate so it is a filter.

Reply to "Prefixes don't negate?"