Topic on Talk:Search/status

New "insource:" syntax, etc.

3 comments • 03:10, 22 November 2014 9 years ago

3

I'm happy about the new "insource:" syntax, because every once in a while I find myself wishing for just that kind of low-level inspection (finding certain kinds of malformed information on commons file pages, for instance). Can I assume that the regex flavor is implemented in a way that's smart enough to only run it on files that would be the results returned by the rest of the query? I don't want to hammer the servers playing with it, but it did seem to be quite fast in my first use, which had a "prefix:" term that narrowed the field to ~600 hits by itself. Is this a reasonable usage?

Also, does the non-regex version basically just ignore non-word characters much like the other search functionallity? It seemed so from a few quick tests.

And, just to completely overload this post, I'm also wondering about what the "first paragraph" weighting would apply to in the context of a typical commons File: page... basically only up to the first heading? This could be significant in terms of best practices for adding information to the pages, I think. Current upload methods tend to slap a ==Summary== header at the very top of the page, while many older uploads are lacking this. Could this cause some wonky weighting of older vs. newer uploads?

Reply 02:03, 29 June 2014 9 years ago

😂 (talkcontribs)

You can't hammer the servers too much, pool counter is pretty small for regex searches ;-) I'll leave it to Nik to say how late/early in result processing it handles regular expressions.

Yes, most punctuation and so forth is ignored for non-regex searches. insource: searches a different field `source_text` instead of `text`, the latter of which is configured with all kinds of language-specific bells and whistles to make it better at finding content for the majority of readers.

The first paragraph weighting isn't as nice on the PHP side as I'd like. It uses the pretty naïve approach you outline there, where it just uses stuff before the first heading which isn't necessarily the best.

Reply Edited by Shirayuki 03:10, 22 November 2014 9 years ago

NEverett (WMF) (talkcontribs)

Sorry it took me so long to get to this.... Vacation and performance work have been squeezing me dry.

As far as order of operations - Elasticsearch _should_ do the right thing and execute the expensive filter last. On Thursday (I think) we're pushing a change to Cirrus that gives Elasticsearch a big hint that the regexes need to come last. If it isn't fast now it should be then.

The non-regex flavor of insource uses the standard analyzer used for the rest of the text. So its exactly how intitle works except against source. Its not prefect but its at least somewhat intuitive.

The first paragraph weighting thing is more something we should change to work around on wiki habits rather then the other way around. I built it so you could plug multiple implementations into it but only implemented the naive, until the first heading approach. It'd be simple enough modify or create a new one that skips the first heading if it is the very first thing.

Reply 14:56, 29 July 2014 9 years ago

Reply to "New "insource:" syntax, etc."