Topic on Talk:Search/status

Weight hits that are early in the article more highly then results at the end

23
Summary by Nemo bis
Shawn à Montréal (talkcontribs)

I couldn't figure out why my search results had suddenly gone to pot until I figured out that "new search" had been auto-enabled. Sorry, but in my experience it's complete junk. I use search to find documentaries in related fields quite a lot, and suddenly it had seemed as if the search function was returning near-random results. Now, with old search back, a search for say, "Algerian War" and "documentary" gets me what I'm looking for: articles related to those terms. With the "new" function on, such a search is virtually useless.

Deskana (WMF) (talkcontribs)

Thanks for the report. Can you tell us what wiki you're searching on?

Shawn à Montréal (talkcontribs)

English Wikipedia. I'm surprised no one else has mentioned it. I'm no SEO guy, but it's as if the "new" search has lost the ability to weight results depending on where they occur within an article. Specifically, I'd been surprised to see that if there was a mention of one my search terms -- say, the word "documentary" -- even in a reference or an external link, as opposed to the actual body text of the article. the new search would return those results near the top. "Old" search seems to be to be much closer to what I get in a Google search, which is to say, the search function would have the intelligence to distinguish been non-trivial and trivial mentions, somehow.

Junkyardsparkle (talkcontribs)

Part of the issue may be related to the introduction of slop into phrase searches, which confused me initially... see "Double quotes no longer result in phrase search" thread below for gory details. TLDR:

"algerian war"~0 documentary

in the new search will gives results fairly similar to

"algerian war" documentary

in the old search, where not using the "~0" gives looser results.

Shawn à Montréal (talkcontribs)

I don't think it's (just) that. For example, a search for the words Mugabe and documentary in old mode returns the two articles on documentaries about the leader, first and second, flawlessly. But switch to new mode, and the two docs are in first and sixth place -- clearly not as good. I don't know what you folks have done, but it's a net loss not a gain, from what I can see.

Junkyardsparkle (talkcontribs)

No, I didn't mean to imply that the problem was entirely (or even mostly) that... the new search does seem to be less magical with respect to your examples. The weighting voodoo is beyond my monkey comprehension, I'm just happy that I can create an explicit query when I want to, and there is some nice syntax available now... for instance, for your purposes, this seems to work pretty well:

mugabe documentary boost-templates:"Template:infobox film|300%"

Again, I'm not trying to say you don't have a valid complaint, just presenting what might be a useful workaround (or potentially even an improvement on hoping the search will weight things the way you want). :)

Shawn à Montréal (talkcontribs)

I'm sorry I don't know what that means or what to do with it. But thanks for trying to help.

Anyway, so long as we still have access to the old search function, the one that worked, it's fine.

Junkyardsparkle (talkcontribs)

It boosts the weighting of results that have the "film" infobox on the page. I don't think they plan to maintain the old search indefinitely, so forgive me if I hijack the thread with some ideas about how to make the new one work better for your purposes.

I'm wondering if it would be possible to implement a weighting method that uses boost-templates under the hood, by mapping certain high-confidence templates to the occurrence of certain associated terms (when not used in a phrase). For instance, if "documentary", "movie", etc implied a boost to the "Infobox film" template. Sorry if this isn't feasible or is already implemented in some way, I'm pretty ignorant of the weighting magic, like I said...

Shawn à Montréal (talkcontribs)

Oh, I see, you're actually talking about improving how the dingus works?

But why don't get is why folks messed with it in the first place. It worked just fine.

Junkyardsparkle (talkcontribs)

I'm talking about that now, but I was also pointing out that you can use the boost-template syntax in your own searches; using the example given should help articles about films bubble up towards the top of the list. From what I understand, the old search was difficult to maintain on the back end, and the new one will be better in that regard.

Shawn à Montréal (talkcontribs)

I see. I'm sorry but I have no idea how to modify the syntax or anything of that nature. Like most, I guess. I just type words in the window and hit the button.

Shawn à Montréal (talkcontribs)

I'm confident they'll fix the new search before they remove the old one - but one other thing I realize one can do is use Google Advanced Search to search Wikipedia. Tried it and it works pretty well.

NEverett (WMF) (talkcontribs)

Yeah, that boost-templates thing is more to test the default template boosting. You see, there is a configuration parameter on wiki that can be set to make everyone's searches silently contain some boost. The idea was to allow community curation of the results. Commmons uses it but not that extensively. You'd use boost-templates in your query either when you want to disable the defaults or when you want to test new ones. So its really a "super expert" kind of thing. In addition to that, its a convenient hook for my regression tests to check the feature.

NEverett (WMF) (talkcontribs)

Thanks for coming here to complain about these results. We'll figure some way out to make it at least as good for this class of search.

As to why we're replacing the old search when it is so good at finding results, here is the short list:

  • Old search crashes/rans out of resources from time to time and no one knows how to fix it. Its a pretty large code base based on really old libraries. New search is based off of relatively standard services under active development.
  • Old search updates every few days and often misses things. New one updates pretty near real time. Page edits are usually in the index in under a minute. Template edits are can take longer to be reflected in the pages that contain those templates.
  • Old search doesn't do anything with templates. New search fully resolves templates. Its *righter* but its more trouble.


The truth is that the replacement project was driven internally by ops folks raising a ruckus because the old one had no maintainer and wasn't super stable. There is also a significant backlog of bugs and feature requests for search that we've had to ignore because the old one was so hard to work on. So that's how you get where we are.

As far as why the new search doesn't spit out results exactly like the old one, one of the reasons is that the old one is super customized for English Wikipedia. Its difficult to navigate and many of the customizations were speculative: they didn't really provide better results, they just were there. So we implemented the ones that were obviously better and deployed the new search as a BetaFeature so folks could try it. When we tried it we found the results were usually similar but not better or worse. You've hit on one of the customizations that we didn't reimplement: the old search weights hits that are early in the article more highly then results at the end. We didn't do this because our tests didn't show it made much difference. But for you searches it makes a pretty huge difference.

Long story short, we'll implement that.


Also, if you are curious on how scoring works you can read the first half of this presentation. The other half won't be all the interesting.

Shawn à Montréal (talkcontribs)

Thank you very much. Frankly, I didn't think people would much care what I had to say.

Junkyardsparkle (talkcontribs)

Interesting, I wouldn't have guessed that was the optimization involved, but now that you mention it, that weighting does make a huge amount of sense in the context of wikipedia articles, being summarized in the lead section...

Shawn à Montréal (talkcontribs)

I'm surprised it was not judged to be worth retaining, initially. Google has made much great strides in making its search more intelligent, in distinguishing between relevant and trivial mentions of search terms.

In Wikipedia, we have guidelines that explain the importance of summarizing key concepts in the article lead. To design a search engine to intentionally disregard that very structure is puzzling to me.

Nemo bis (talkcontribs)

They did not "intentionally disregard" the feature, they just have not spent time developing it from scratch; but you were told they will now. Also consider this weighing is not a search backend standard, it's not even valid for many MediaWikis including Wikimedia wikis (specifically, in order of traffic: Commons, Wiktionary, Wikiquote, Wikisource and Wikibooks).

This system of prioritisation makes sense to me: it would have been worse if they had tried to reimplement every single feature and customisation of the old custom search moloch, even unrequested. We would have wasted lots of developer time and ended up with another unmaintainable system which would receive no love for the next 5 years.

NEverett (WMF) (talkcontribs)

Thanks for the defence, but Shawn's right; its a relatively obvious optimization. Its something that's "been on the list" for a long time but it kept getting lower and lower under as we'd been in beta and no complained about quality in a way that this would have caught. I frankly forgot about it.

As far as intentionally disregard, if anyone did any disregarding, it was me. I'd prefer to characterize what I did in this case as getting snowblinded by all the (probably) speculative features to improve search quality that I didn't give this one as much weight as it deserves. But there isn't a clear line between that and intentionally disregard. It did, after all, make it onto my list, just too low.

I will admit to getting mired in a pet issue of mine, highlighting. The highlighter wasn't going to support it so I spent quite a bit of time on it. In fact, the highlighter used on enwiki and commons right now does prefer snippets from the beginning of the article. But I got distracted by the snippet issue and didn't cover the scoring issue.

Anyway, I'm going to go fiddle with positional boosts now. Depending on how that goes you'll get a solution soon.

Shawn à Montréal (talkcontribs)

I certainly didn't mean to offend anyone, sorry. I think it's kinda neat that this one lone comment from me has been helpful to the cause, and thanks.

NEverett (WMF) (talkcontribs)

Complaining is how I know more has to be done!

I implemented weighing terms early in the article more highly then later (locally, not deployed) but I'm not happy that'd be enough for your case. Mugabe's Zimbabwe doesn't have the word "documentary" in the opening. It calls it a "factual film". I'm sure there is a distinction but I imagine its small enough people still think of it as a documentary. I mean, it is in the "Documentary films about politicians" category. I think I'll add a search in the category with a decent weight as well. That seems like it'd help.

Shawn à Montréal (talkcontribs)

Oh yes, the "factual film" thing is a real outlier. Don't worry about that. But yes if you could weight the categories a bit more, then that might indeed help search results. Good idea.

NEverett (WMF) (talkcontribs)

Both of those changes are ready for review. I imagine the category thing will catch the factual film outlier. My best guess is we'll deploy them to the test wikis next Thursday and to wikipedias the Thursday after that. Both changes, though, will require some time to take effect because the index will have to be rebuilt. That'll take a few days. That is one of the problems with Cirrus: the old search could rebuild the entire index more quickly because it didn't bother with stuff like templates. We can't. We react more quickly because we're able to hook more tightly into the infrastructure and we can throw more cpu at the problem. But when you have to change the index it takes some time. OTOH its like 100 times easier to debug then the old stuff, so tradeoffs.....

Reply to "Weight hits that are early in the article more highly then results at the end"