Jump to content

Talk:Search/Old/status

Add topic
From mediawiki.org
Latest comment: 11 years ago by NEverett (WMF) in topic Works for Japanese


Search class

[edit]

How to implement the search class - by the wikipedian formerly known as Uncle.bungle

It took me days to piece this together so I'm saving it here for posterity (and since I know I can always check back here :p )

$search = SearchEngine::create();
$search->setLimitOffset(5000,0);	/*set max 5000 results starting at result 0
					the constructor for SearchEngine will default
					this to zero so if not set, no results */
$q = $search->replacePrefixes("searchterm");
$matches = $search->searchText($q);
while( $result = $matches->next() ) {
	$t = $result->getTitle();
	$link = $t->getFullURL();
	echo $link . "\n";
}

— Preceding unsigned comment added by 65.127.188.10 (talkcontribs) 13:25, 4 January 2009‎ 65.127.188.10 (talk · contribs) 00:48, 19 March 2012 (UTC)Reply

Maximum size

[edit]

Hi! What is the maximum size allowed in the search string? Or is defined by the number of words? Thanks! --189.59.166.40 14:43, 9 March 2012 (UTC)Reply

There's not really a strict maximum. brion (talk) 20:34, 19 March 2012 (UTC)Reply

Sister projects

[edit]

I don't understand this edit: [1]. Is it about some current implementation (which I've not seen yet), or an aim for the future? If the latter, does someone understand how those bullets were selected? Nemo 18:17, 9 December 2013 (UTC)Reply

Not sure. There isn't any implementation other than the one that's in Gerrit and on WMF sites.
Well, maybe there's secret uncommitted magic on my localhost, but nobody's got that but me ;-) ^demon[omg plz] 22:00, 9 December 2013 (UTC) 22:00, 9 December 2013 (UTC)Reply

Search for short titles not easy

[edit]

If I searched for (example) places with the name "Kil" (nl-wiki) it suggests a lot of articles starting with Kil, but miss the essential ones. Just enter "Kil" in the search field doesn't give the places with the name "Kil" iteself but only suggestions for much longer artiecle names. Romaine (talk) 13:49, 25 December 2013 (UTC)Reply

I've filed this as https://bugzilla.wikimedia.org/show_bug.cgi?id=59842 and I'll have a look at it soon. Manybubbles (talk) 23:10, 8 January 2014 (UTC)Reply

"WTF" search results

[edit]

After enabling this feature on Wiktionary, when I search for "son", instead of landing at son, I get pointed to the page... "són".

If I wanted to look for a page with a diacritic in its title, I would just type it myself, dammit. Keφr 21:09, 1 January 2014 (UTC)Reply

Same for me. This is a MAJOR BUG in my book. Wikitiki89 (talk) 01:08, 8 January 2014 (UTC)Reply
I'll have a look at this soon. I've filed it here: https://bugzilla.wikimedia.org/show_bug.cgi?id=59841 Manybubbles (talk) 23:07, 8 January 2014 (UTC)Reply
Replying to my non-WMF account as my WMF account just to make things more confusing. Anyway, I have information:
Wiktionary has 8 pages that all "near match" son with the current analysis setup:
sơn
Son
són
son
sön
SON
søn
soñ
If we turn off ascii folding it still has three:
Son
son
SON
I'm inclined to have the "go" search for things that have multiple options just drop you on the search results page.
Other options are to try to guess what you wanted with various strategies:
1. Assume you want the most linked page that near matches "son". That is "son" on wiktionary by a wide margin.
2. Assume you want the page that is "closest" to what you typed. I'm not sure how I'd resolve ties other than giving up and showing you the search results page. Still, that would also take you to the "son" page in this case.
More stuff?
Obviously going to "són" is wrong. I'm not really sure what is right though. NEverett (WMF) (talk) 14:04, 9 January 2014 (UTC)Reply
The following is a must (I think): If there is a page with the exact same capitalization+accents, go to that page (disregarding capitalization of the first letter for non-casesensitive projects).
I would then prefer to have: if there is one page with the exact same accents, but different capitalization, go directly to that page.
Otherwise, show the search results. Skalman (talk) 15:03, 11 January 2014 (UTC)Reply
@NEverett: It is undesirable to just go to search results. Skalman's sequence seems right, though there may be still more wrinkles. If you can get this right for English Wiktionary (lots entries, lots of languages, lots of scripts), you should have many situations covered. DCDuring (talk) 15:29, 14 January 2014 (UTC)Reply
I've submitted for review something pretty close to Skalman's sequence. I'm not sure when we'll next push code to Wiktionary but you should be able to monitor the bug for status. NEverett (WMF) (talk) 20:53, 14 January 2014 (UTC)Reply
Awesome! Skalman (talk) 22:06, 14 January 2014 (UTC)Reply
This is deployed. When you get a chance please give it a shot. I tried it with various changes to "son" and it seemed much better to me. NEverett (WMF) (talk) 19:30, 27 January 2014 (UTC)Reply
I'll try it. DCDuring (talk) 00:47, 19 March 2014 (UTC)Reply
Thanks. In the meantime, I've disabled the Beta. DCDuring (talk) 23:40, 14 January 2014 (UTC)Reply

"Near-real-time search" also for search suggestions

[edit]

I understand that one of the goals is to have "near-real-time search", and that seems to be working, except for search suggestions.

Writing the start of the article name still doesn't suggest going to the article. (tested on svwiktionary) Skalman (talk) 14:51, 11 January 2014 (UTC)Reply

Works for me. I heard the search suggestions are currently quite resource-intensive, maybe they're just a bit too slow? It took some fraction of second to get one when I tried. Nemo 22:12, 12 January 2014 (UTC)Reply
I think I'm expressing myself unclearly. What I meant to say is that I believe that:
One of the goals is to have new articles and changes to articles be reflected in the search results almost "instantly", i.e. if I create an article and then search for it, it'll show up in the search results (with the old engine you had to wait a while).
The above does work for me, but the search suggestions don't: If I create the page "Testing" and enter "Test" into the search box, I would expect "Testing" to show up, but only other, previously existing pages show up. Therefore I suspect that the search suggestions use the old system. Skalman (talk) 01:55, 13 January 2014 (UTC)Reply
I'm reasonably sure what you are seeing is the 24 hour cache time on prefix searches. I saw that a few months ago and thought it would be problematic but then forgot about it. I've filed a bug for it so I won't forget it again. I can't promise anything on this soon because prefix search really is slower then I'd like. We have in review fix that cuts the time for short prefixes on large wikis more than in half but it still isn't fast enough. I'm told there are features in the Elasticsearch that should let me cut that by an order of magnitude but they are still labelled "experimental". At some point I'll have a look at that though.
Another option, while I'm thinking about it, is to make the cache time relative to the number of characters being prefix searched and the size of the wiki. Only wikis with tons of titles really need much caching. Also it doesn't do too much good to cache prefixes longer than a few characters I think. NEverett (WMF) (talk) 16:59, 13 January 2014 (UTC)Reply
Thanks - I hope that you figure it out. I have another suggestion that would mostly solve the problem without incurring any additional server load.
The recently created pages that I hope to find with the search suggestions are those that I myself have created, or those that I have at least visited. A possible solution would thus be:
  • Create an index of recently visited pages (e.g. with localStorage)
  • Merge search suggestions from the JSON API and the localStorage
Do you think that this solution would be appreciated? Should I file a bug? Skalman (talk) 20:30, 13 January 2014 (UTC)Reply

sim city vs. simcity

[edit]

Hi,

I just enabled the new search on wp:fr and I have a problem with the query "sim city". This query refers to the video game SimCity (without a space). With the classic search engine, the first results are the different versions of the game SimCity. With the new search, those don't appear in the 20 first results. Note that page w:fr:Sim City exists and redirects to article w:fr:SimCity. Rinaku (t · c) 13:14, 12 January 2014 (UTC)Reply

I'm starting to have a look at this. The first thing that jumps out at me is that Cirrus finds sim city properly when we ask it to include redirects in the search: example. The current search seems to ignore that check box entirely. Include redirects really ought to be the default. I've filed a bug for that.
The second problem is that even when redirects are included w:fr:SimCity is the sixth result and the redirect that is highlighted is w:fr:Sim City (série) which obviously isn't as good as w:fr:Sim City. So this is both an unexpected score problem and a highlighting problem. I'll investigate those more and reply to myself. NEverett (WMF) (talk) 15:00, 14 January 2014 (UTC)Reply
The results ordering problem is due to bugzilla:60045
Quick explanation: the more redirects to a page the less each redirect counts. That is silly and I'm going to fix it. NEverett (WMF) (talk) 17:00, 14 January 2014 (UTC)Reply
Thanks! Rinaku (t · c) 19:20, 14 January 2014 (UTC)Reply

Searching common words yields nothing

[edit]

Originally I tried searching for åt (Swedish for "to" or "for"), but didn't get any search results. I had expected to see pages such as wikt:sv:gå åt and other pages with "åt" as a word in the title. I would've wanted to also see pages like wikt:sv:inåt and wikt:sv:utåt (which link to "åt"), but nåt "båt" ("boat").

  • At the very least, searching common words shouldn't be disabled for titles.
  • If possible, I'd also like to see pages with links that have "åt" as a word in their link text.
  • If possible, it'd be nice to also see other pages with "åt" in their text.

Edit: An English example would be searching for "to". Skalman (talk) 13:35, 15 January 2014 (UTC)Reply

This is the first thing I'll work on when I can actually sit down and code again. At worst it'll be early next week. Tracking at bugzilla:60302. The fix for that will likely also be the fix for bugzilla:54937. NEverett (WMF) (talk) 19:51, 21 January 2014 (UTC)Reply
Wonderful! It's amazing how it at first feels like the new search is horrible and much worse than the old one, but then you come along and say that this and other stuff will soon be fixed. A few more weeks of this magic and I'll wonder how I put up with the old search.
Sorry... just a convoluted way to say thanks. Skalman (talk) 23:20, 21 January 2014 (UTC)Reply
We have a fix for this in review. It mostly makes things better but it isn't perfect, unfortunately. This is one of the cases where lsearchd is still more advanced then the the rest of the open source world. In particular the change does this well:
1. Finds both exact matches and stemmed matches but sorts the exact
matches higher. Yay.
2. Finds articles that only contain stop words! Sweet!
3. Highlights stopwords in results. Very nice!
But it does these things poorly:
1. Stop words will be worth as much as exact matching text. They really
ought to be worth less than stemmed text. Like 10% or something. Anyway
that requires some work in Elasticsearch to get that hopping. I mean, stop
words in the article text will still be worth very little because they
are common but they will be uncommon in things like headings and titles
which will make them worth more when they appear.
2. Stop words are now required. They used to be ignored which was bad
but now they are required which, I think, is less bad but still not good.
The side effect here is that if you search for "the once and future king"
you won't find an article named "once and future king" if the article
doesn't contain the word "the". This is somewhat moot because more
articles (in English) will likely contain the word "the". On enwiki, the
article for "The Once and Future King" even contains the word "The" in the
title.
I think the trade off is worth it but it is a tradeoff. I've started work upstream to get rid of the problems though I really can't comment on we'll get the fixes. NEverett (WMF) (talk) 19:28, 27 January 2014 (UTC)Reply
We've deployed the fix I mentioned above. NEverett (WMF) (talk) 16:40, 4 February 2014 (UTC)Reply

Underscores (and a question)

[edit]

I think it's a bug that the new search displays all talk namespaces with underscores instead of spaces, e.g. "Benutzer_Diskussion" instead of "Benutzer Diskussion". Here is an example. The old search used spaces.

Additional question: The order of the results in my example is very, very different. I'm not sure if this is good or bad. To understand what's happening you need to know my user name "TMg" is unfortunately used as an abbreviation for a law in Germany (Telemediengesetz, TMG). The old search focused on articles. The new search pushes my own pages, project and talk pages up. The articles are far, far away. Basically I have to disable all other namespaces to be able to search for articles. Is this on purpose? Does the search take into account if a user searches for his own name? TMg 21:23, 17 January 2014 (UTC)Reply

Regarding Underscores: That certainly is a bug. Filed here: bugzilla:60298.
Regarding searching for TMg: We certainly don't push articles to the top based on user information. CirrusSearch is pulling the articles with matches in the title above articles with matches in the text. LuceneSearch does that but too but it isn't doing it isn't pulling quite as hard as CirrusSearch. CirrusSearch is doing exactly what I expect it to do in this case. Do you think we should lower the importance of title matches relative to text matches? NEverett (WMF) (talk) 18:43, 21 January 2014 (UTC)Reply
My "TMG" example feels very unbalanced. I tried other examples where a word is in the title of non-article pages and unfortunately it's always the same.
  1. The old search focused very much on articles. This wasn't a bad thing. Most users are looking for articles and should look for articles first. If you don't want to search for articles there are plenty of options.
    → To get roughly the same old result with the new search I have to disable all namespaces except for the article, file and portal namespaces.
  2. The new search pretty much does the opposite. My user and user talk pages are boosted way up.
    → To get roughly the same new result with the old search I have to disable the article, file and portal namespaces.
In my opinion:
  • Articles should be more important than all other namespaces.
  • Subject pages should be more important than talk pages (in all namespaces, e.g. "User:" should be more important than "User talk:").
In my example a perfect match like the article "TMG" should always be first, followed by a decent mixture of both articles that have the word in the text and user pages that have the word in the title. Basically my idea is to give
  • text in the article namespace and
  • titles in non-article namespaces
more or less the same relevance. If that makes sense. TMg 18:00, 22 January 2014 (UTC)Reply
Sorry I didn't get to this earlier. I've filed it as bugzilla:61053. NEverett (WMF) (talk) 20:03, 7 February 2014 (UTC)Reply
Thanks for nailing it down! Very interesting bug report. Nemo 14:12, 8 February 2014 (UTC)Reply
Wow, this is so much better. Thank you. The ordering of the search results in my example is still very different in the old and the new search but much, much better. I guess the old search used a different ranking. The Portal: had a higher ranking, for example. Would be good to add NS_PORTAL to your patch, if that's possible. TMg 20:54, 27 February 2014 (UTC)Reply
I see what you mean about portal. I've filed 62056 to handle that. My plan is to default everything that isn't in MAIN to 0.5 if it isn't otherwise specified. Maybe 0.25. I'll play with your query a bit and see what makes sense. If you have some idea what the number should be please share. NEverett (WMF) (talk) 15:16, 28 February 2014 (UTC)Reply
Aren't you giving content namespaces a special treat? Nemo 22:21, 6 March 2014 (UTC)Reply
Yeah, I did. This ended up being annoyingly complicated. The rules:
  1. If there is an override take that score.
  2. Is this the main namespace? Score = 1
  3. Is this a talk namespace? Start this process over again for the non-talk version then multiply the result by .25.
  4. Score = .2
There are some "default overrides" for namespaces that don't follow the rules. We also can change them wiki by wiki....
NS_USER => 0.05,
NS_PROJECT => 0.1,
NS_MEDIAWIKI => 0.05,
NS_TEMPLATE => 0.005,
NS_HELP => 0.1,
NEverett (WMF) (talk) 20:05, 14 March 2014 (UTC)Reply
Sorry, I've not understood: what ended up being very complicated, using $wgContentNamespaces ? Ideally $wgContentNamespaces would also directly affect the scoring. Nemo 10:43, 15 March 2014 (UTC)Reply
Sorry, I mean the logic. Do you have a proposal for how $wgContentNamespaces should affect scoring? Do the same thing we do for NS_MAIN for all content namespaces? NEverett (WMF) (talk) 14:05, 17 March 2014 (UTC)Reply
Yes. All namespaces in $wgContentNamespaces are equal to main namespace, in core; I don't see a reason to make people set them up for search separately. Nemo 00:02, 18 March 2014 (UTC)Reply
What was the purpose of this comment? Jorm (WMF) (talk) 19:28, 19 February 2014 (UTC)Reply
[edit]

If I want "clientèle" (there are about 25 on en.wikipedia), I also get several thousand "clientele". Is there a way to make it sensitive to accents and diacritics? Also very desirable would be a case-sensitive search.

I brought this topic up on Help talk:CirrusSearch, but that does not seem to be a very active page. I don't know whether the differences are due to incomplete help instructions, or features that are actually missing from CirrusSearch.

Wikignomes are going to need to search for *exact* matches for accents/diacritics/hyphens (or lack thereof). While we're at it, a case-sensitive search would be very helpful (e.g to find and fix "washington" and "FaceBook"). Chris the speller yack 21:41, 20 January 2014 (UTC)Reply

Nik said in October that «Accents folding is turned on for English wikis but off for all other wikis». These things can be configured per language. Nemo 09:29, 21 January 2014 (UTC)Reply
Well, that's not going to be a good situation. If I am in reading mode and searching for an article that contains "Harrods has an upper-class clientele", I will surely accept "clientèle" as well. If I am in fixing mode and looking for articles that incorrectly use the French word "clientèle" instead of the English word "clientele", then I don't want to see every article that contains "clientele", just the ones with the diacritic mark. Same issue with hyphens; Lucene pays attention to hyphens if you specify them, but CirrusSearch does not. With Lucene, I can search "to clean-out the" and remove the incorrect hyphens, but CirrusSearch produces all pages that contain that phrase with or without the hyphen. I need a way to specify that an exact match is sought. Chris the speller (talk) 16:04, 21 January 2014 (UTC)Reply
Proposal for accents: If you search with an accent it should only return results with the accent. If you search without it it should return both. This will still only be enabled for English unless some other language wants it. A potential twist: only forgo squashing accents if the accented string is quoted like "clientèle" or "Harrods has an upper-class clientèle". I could see arguments for both behaviours and both would be about the same to implement.
Regarding hyphens: Cirrus certainly treats all hyphens as word breaks. We could do the same trick for hyphenated word as I propose for accented words though it'd apply to all languages.
I'm tracking both together as bugzilla:60299. NEverett (WMF) (talk) 19:02, 21 January 2014 (UTC)Reply
I think this will get us back on the right track. Thanks. Chris the speller yack 21:38, 21 January 2014 (UTC)Reply
We've deployed this and it looks better to me. If you get a chance to give it a shot please let me know if it is better for you.
NEverett (WMF) (talk) 17:52, 24 January 2014 (UTC)Reply
I tried searching "clientèle" and "clientele" on en.wikipedia, and both returned 2,609 pages. This is what it did before, not what I was hoping for. Chris the speller (talk) 20:47, 25 January 2014 (UTC)Reply
Sorry about that! I was commenting on my phone while at a conference and commented on the wrong thread. I haven't had a chance to start this yet. NEverett (WMF) (talk) 19:25, 27 January 2014 (UTC)Reply
Just an update while I'm here:
I've landed a fix to Lucene and Elasticsearch to support this but it has yet to be released. When it is released we'll upgrade and then we can start using it.... NEverett (WMF) (talk) 19:57, 14 March 2014 (UTC)Reply
I'm also looking for a way to search on ptwiki for "ciencia" without getting results for "ciência", so I can add the accent where users forgot about them. Helder.wiki 14:48, 11 March 2014 (UTC)
If you turn on the "New Search" BetaFeature they you can search for "ciencia" and won't ciência. I'm unsure of exactly why the quotes are required. They switch to a "plain" analyzer, but it looks to me like pt shouldn't get the ascii folding even in its aggressive analyzer. I'll look into it. NEverett (WMF) (talk) 19:56, 14 March 2014 (UTC)Reply

prefix: search for localized namespaces broken

[edit]

The following queries stopped working with the new search. I guess it has something to do with the localized namespaces.

On the second, no, it's for the space (confirmed it works with lucene): 33 results, 0 results. Nemo 18:51, 22 January 2014 (UTC)Reply
Oh, thanks for the hint. Unfortunately this does make things more confusing. ;-)
  1. TMg prefix:Benutzer_Diskussion:TMg is valid in both the old and the new search → OK.
  2. prefix:Benutzer_Diskussion:TMg TMg with the two terms switched is not valid in both. According to Help:CirrusSearch this is by design → OK.
  3. TMg prefix:Benutzer Diskussion:TMg was valid in the old search. According to Help:CirrusSearch it should be the same in the new search → bug.
  4. TMg prefix:"Benutzer Diskussion:TMg" should be possible if the space does have a different meaning now → missing feature.
  5. TMg prefix:Benutzer does not work but TMg prefix:Benutzer: does → missing feature. TMg 22:44, 22 January 2014 (UTC)Reply
Logged as bugzilla:60489. I feel bad about travelling last week. It is just when folks are really starting to leave lots of feedback. NEverett (WMF) (talk) 19:40, 27 January 2014 (UTC)Reply
I believe I have fixes for 3 and 4 (proposed, not merged or deployed). I don't _think_ Cirrus' behavior differs from LuceneSearch's for 5 some I reticent to change it. What were you looking for TMg prefix:Benutzer to do? Right now I believe what it does is search for articles in the content namespaces that contain tmg who's title's start with benutzer. Nothing to do with namespaces without the :. NEverett (WMF) (talk) 16:39, 4 February 2014 (UTC)Reply
Probably he means that the latter is enough to trigger search in an otherwise non-searched namespace, while the other doesn't. Someone proposed that if I search "help references" the search should select the "help" namespace too (was this on wikitech-l? can't remember if it's filed), but I'm not sure making prefix too smart is a good idea. For instance, how would I search subpages of "Benutzer" in main namespace, one could ask, adding to complications. Nemo 21:44, 4 February 2014 (UTC)Reply
Yeah, doing too much "smart" stuff can get a bit scary. I was always in favor of using a suggest-like mechanism to ask folks searching for tables help if they actually mean to search for help with tables. Something the user will notice but can ignore if they don't mean it. "All we'd have to do" is detect the help namespace (in the language of the user or the wiki or both?) and then make a suggest like link that hits the help namespace with their search without the word "help" in it.
For now, I'm closing the bug. If I'm wrong in doing so either reopen it or file another. Or reply here. Either is fine by me. NEverett (WMF) (talk) 19:51, 7 February 2014 (UTC)Reply
You are right. Seems I got a bit confused. Searching for something like prefix:talk can mean two things:
  1. Search for articles with that prefix.
  2. Search in that namespace.
It would be possible to check if the prefix:... value the user entered is equal to a known namespace and if that is the case do both. This leads to the next problem: How to rank this? It's basically a mixture of two different queries. Probably more confusing than helpful. Shouldn't be done. TMg 16:03, 26 February 2014 (UTC)Reply

Cirrus returns pages where the text is only found within templates on the page

[edit]

On en.wikipedia, a search for "one way ticket" dredges up "The Fur Collar", which does not contain that text, but a template on that page, "Lawrence Huntington", does contain that text. Lucene does not return that page. I doubt that this feature will be universally welcomed. Chris the speller (talk) 21:38, 25 January 2014 (UTC)Reply

It's actually the most popularly-acclaimed feature of Cirrus. :) If you have examples/usecases of stuff that shouldn't be included, they would be useful to identify how to satisfy them. Originally, I proposed to add a feature for some templates to be hidden from indexing, a bit like the "Hide from print" category, but in months of testing nobody complained about this yet! So it's not clear to me if and what is needed. Nemo 11:14, 26 January 2014 (UTC)Reply
Search en.wikipedia for ~"finacial". Lucene returns a handful of the misspellings, while Cirrus returns more than 90, most of them from a template (Canadian banks) where I fixed it. I don't want to see those pages pop up for the next few months or few years. I think we need a switch. Chris the speller (talk) 19:35, 26 January 2014 (UTC)Reply
So your problem is not with template expansion itself, but with stale content from templates. It's fair to ask what are the plans for re-indexing/re-parsing pages after one of their templates is edited: it's not a trivial problem. Nemo 21:22, 26 January 2014 (UTC)Reply
Yes, the stale content is my immediate problem, but who wants to see hundreds or thousands of pages that have identical text? I may be the first to complain, but you should expect many more. The other problem with this is that the template's content may be hidden. The person who is searching is liable to be confused and frustrated; even if they pick one, and even if they know that the text is in a hidden template (and they won't, not the first time), they won't know on which template they need to click "Show". Chris the speller (talk) 00:10, 27 January 2014 (UTC)Reply
On stale stuff: We're working on it. The goal is to keep up with the refreshLinks job and we're mostly there. The refreshLinks job itself is slow so I'm rattling the appropriate chains to get it faster. I don't have a bug for tracking this at the moment.
On hidden templates/elements: bugzilla:60484
Identical text: For the most part I've heard most folks want to search on the expanded contents of most templates. I'm coming to the conclusion that expanding some templates is more trouble then it is worth, commons:Template:Cc-by-sa-all for example. I wonder if we can find the space to index the unexpanded text as well..... Right now I think that'd consume more space then we have but hard drives we can get. Especially if it is useful for finding errors in articles. I've filed it as bugzilla:60487 NEverett (WMF) (talk) 19:08, 27 January 2014 (UTC)Reply

Plurals are not found when singular form is searched

[edit]

On en.wikipedia, a CirrusSearch for "one way ticket" does not find plurals (it should find the page "Flying imams incident", which contains "one-way tickets"). OK, so "one way ticket*" should find all endings, but no. Searching for "one way tickets" finds it. Compare to Lucene, which finds "ticket" and "tickets" when searching for "ticket". Chris the speller (talk) 23:05, 25 January 2014 (UTC)Reply

Quotes turn off all stemming in Cirrus. You can turn it back on by searching "one way ticket"~. That syntax isn't discoverable at all but it is there. I'm certainly open to change here but I think everything but the ~ turning stemming back on is pretty google-like. I'd love to have a more discoverable solution though. NEverett (WMF) (talk) 19:23, 27 January 2014 (UTC)Reply
Thanks. That helps. Chris the speller (talk) 03:49, 28 January 2014 (UTC)Reply

Appears to give totally incorrect results for "the the "

[edit]

Hi, I have tried the new search on one of my usual searches and get totally the wrong results.

The search is '"the the " Yorkshire' which should give me all of the pages with a double word "the" and Yorkshire in the text, the old search gives 61 results, most of which are false reports, the new search gives 15,404 results. It looks as though it is not breaking correctly on a word boundary as it should especially when a trailing space follows the second the in the search to indicate a word boundary. Though in the old search the trailing blank is not necessary in this case as it give same results with and without the space. Keith D (talk) 23:17, 4 February 2014 (UTC)Reply

Does this search "the the"~0 yorkshire yield better results for you? Cirrus defaults to a phrase slop of 1 mostly because when I made the decision we weren't doing the right thing with stopwords. We're doing better (not right, yet, unfortunately) with them now so it'd make sense to drop the default slop to 0 which would make the ~0 search the default. NEverett (WMF) (talk) 19:44, 7 February 2014 (UTC)Reply
[edit]

I tried searching for "Template:non-existent template", but I didn't get a link for creating the page. Skalman (talk) 09:10, 19 February 2014 (UTC)Reply

Agreed. This has caused me to turn the feature off. Is there some other way of creating a page that I'm missing? The documentation seems to hide it so that they aren't flooded with bad articles. Exercisephys (talk) 20:36, 21 February 2014 (UTC)Reply
No, it's just disabled when keywords contain (apparently) hyphens or colons, to avoid creating pages like writer prefix:Federico. It needs to be tweaked a bit. Nemo 08:30, 24 February 2014 (UTC)Reply
Filed as 62055 and set to "high" priority so we'll pick it up in the next few days. The idea was to stop it asking you if you want to create a page containing search syntax. It'd be silly to make a page called "intitle:yummy incategory:cake chocolate" but it certainly isn't silly to make a page with a dash in the title. NEverett (WMF) (talk) 15:07, 28 February 2014 (UTC)Reply

Problem with %27

[edit]

When I search for Burger%27s_Daughter in the old search, It gives me a page of results, with the page I'm looking for, Burger's Daughter, as the third result. When I search for the same thing in the new search, it gives me no results. I am not sure if it is tripping up on the %27 part, or if it's something else, but being able to search for things without converting %27 to ' each time is useful. Sven Manguard (talk) 03:20, 27 February 2014 (UTC)Reply

Filed as bugzilla:62059. I believe it is tripping over both the _ and the %27. NEverett (WMF) (talk) 15:38, 28 February 2014 (UTC)Reply
[edit]

I randomly stumbled upon a major problem with searches like that in the German Wikipedia. The old search found 3 pages, the new CirrusSearch finds 2 pages only. The missing page is de:Portal Diskussion:Nationalsozialismus/Archiv/2008 where my search term is part of a link. It seems the page is not indexed because the word is part of a link. Which would be a bug. I can confirm that behavior with other similar searches. TMg 19:49, 27 February 2014 (UTC)Reply

I have no clue what is going on here but will have a further look soon. bugzilla:62058
To be clear, I thought we did index words inside links. I know we have some trouble with :s when performing exact matches (we don't think of them as word breaks). But I don't know if any of those are what caused this. I've linked other bugs in the bugzilla link above if you enjoy reading bugs. NEverett (WMF) (talk) 15:29, 28 February 2014 (UTC)Reply

Zero Width Joiner and Zero Width Non Joiner

[edit]

Hi.

I'm curious as to what behaviour search has when an input string has a ZWJ or ZWNJ unicode character. Are results without the ZWJ / ZWNJ searched for? And what if a search doesn't contain ZWJ/ZWNJ but a page with the exact same spelling but including one of these characters in between exists?

As far as I know, search on the WMF cluster as of now doesn't treat words including ZWJ/ZWNJ the same as those not including these. I don't think this behaviour is correct, and the matter probably needs to be investigated since I think some indic language IMEs provide options for the input of these characters (to force the rendering of a particular glyph) and pages with titles containing these characters may be created. Siddhartha Ghai (talk) 13:18, 1 March 2014 (UTC)Reply

So we've been holding off on these kinds of issues until we're able to get the unicode plugin for Elasticsearch deployed on the cluster. The plan is for CirrusSearch to use it (if it is installed on Elasticsearch) to take a first crack at the problem and then go from there. We're willing and able to go beyond that but we'd like to start there. The holdup is just that Elasticsearch plugins are deployed differently then most other things at WMF so we have to work up a special mechanism for them. We're moving along on that project so we should be able to start really improving things "soon".
Still, I'd love some good test cases to make sure that we're going in the right direction. I'd be thrilled if you filed a bug with some examples of things that don't work but should. NEverett (WMF) (talk) 17:50, 4 March 2014 (UTC)Reply
TL;DR version:
I would file a bug except that I'm not sure what the behaviour should be. I think the issue needs some discussion before an actual bug is filed, since as I see the issue, it is complicated, and there are several potential methods to resolve it.
Full comment:
My interest in these chars is in indic languages, specifically hindi.
Per the Unicode Indic joining behaviour model, there are 4 different ways in which ZWJ/ZWNJ can be used, with the resulting renderings differing.
An example case is the following four pages (the page content has the unicode sequence used):
(Note: The last two were created today and may not show up in search till tomorrow)
It should be noted that the rendering would differ depending on what glyphs the actual font has. So, a font designed for, say, Sanskrit may have a full conjunct glyph, whereas one for hindi may not (since sanskrit used many more conjunct forms than hindi IIRC). As for the current situation, the proprietary Mangal font that ships with Windows by default shows the above four in the same way, in the fully expanded form with explicit viram, since it doesn't contain any glyph. However, changing the font family to Lohit (the font used for hindi in ULS), the rendering for the first page differs from the other three, the first showing a conjunct glyph with the others still showing the fully expanded form. There may be cases where all four renderings differ, but I'm not aware if the behaviour model is implemented by any fonts yet or not.
Now, as far as language is concerned, the subpagename in all four is essentially the same word. The fact that the glyph may be rendered differently doesn't change how it's read (pronounced), or what it means.
So what we have effectively is four different ways to write the same word, possibly with four different renderings or one rendering depending on the font the user has.
This means that as of now, depending on the IME a particular user is using, he/she may not find in search what they were looking for and end up creating duplicate pages on the same topic. And the two titles may be rendered exactly the same for another user. Needless to say, this will leave the average user perplexed.
(Note: IIRC, I have come across one such case where a dupe was created by a newbie when he couldn't find the article that he created)
I find this to be complicated, similar to the unicode normalization issue, with various possible solutions.
Solution 1
Strip all ZWJ/ZWNJ from all text and pagenames and search queries
Pros:
  • No chances of page duplication
  • No search issues
Cons:
  • No ability to force particular glyphs
  • Probably problematic for sanskrit wikisource (where ZWJ/ZWNJ may be really needed)
Solution 2
Strip all ZWJ/ZWNJ from pagenames and search queries
Pros:
  • No chances of page duplication
  • No search issues
Cons:
  • No ability to force particular glyphs
Solution 3
Treat all four cases as one for search
Pros:
  • Probably easiest to implement
Cons:
  • Duplicate page creation remains possible
  • Even if the search functionality works, the text find and replace in the editbar, and the inbuilt find/replace feature of browsers may not work correctly. Siddhartha Ghai (talk) 05:10, 7 March 2014 (UTC)Reply
Sorry for the super duper late reply, but, here goes:
I can use case folding to flatten all four of these examples into "the same" word from search's perspective. That is, NFKC with case folding tacked on the end.
Now some choices:
1. Do this on both the analyzers that we use for text or just the less exact one. If I just do the less exact one then the words that match without normalization will bubble above those that match with normalization. And, by default, "quoting" a word will not find it normalized. I'm leaning towards adding the normalization to both analyzers for this reason.
2. Should I add this to all languages, most languages, just languages for which I don't have a good default, or just languages that ask for it? Note that I'm actually waiting on a change upstream to enable me to add things to "all" or "most" languages.
3. Other stuff? NEverett (WMF) (talk) 17:50, 8 May 2014 (UTC)Reply
Sorry for the super duper late reply (went on a wikibreak):
I don't think applying case folding to search queries will have a major effect on projects in languages that don't have case. AFAIK, none of the indic family scripts have case. Do note though that just because the project is in an indic language doesn't necessarily mean that there won't be any content in other case-sensitive languages. There can always be discussions, Help pages and Mediawiki: namespace stuff in english. So searches related to such stuff will be affected.
The decision about whether or not to apply case folding by default could be decided on the basis of how much content on a particular project seems to be in a case-sensitive language. Finding this out, will, ofcourse, require some database queries to analyze how much content is in which script on the project.
So:
  1. I also think applying it to both analyzers would be better
  2. The change should be applied on a case-by-case basis to language projects that ask for it (Although if the change is found useful on a few language projects of the indic script family, I think it can be extended to all indic scripts).
  3. Other stuff: This resolves the search part, but not the title part. Ideally, it shouldn't be possible to create four different pages for the same title, and, if needed, the glyph to be used in the title should be controlled by a magic word or something. Not sure where to raise this point for a proper discussion. Ideas? Siddhartha Ghai (talk) 15:25, 24 July 2014 (UTC)Reply

Cannot create article with hyphen

[edit]

Searching for Swap-Option on dewiki yields some fulltext search results. However, there is no link (as with the old search) to create a new article under this name. Pajz (talk) 23:58, 11 March 2014 (UTC)Reply

This one is fixed but not yet deployed: https://bugzilla.wikimedia.org/show_bug.cgi?id=62055 NEverett (WMF) (talk) 19:40, 14 March 2014 (UTC)Reply

CSS/JS not indexed?

[edit]

This example search returns

  • 46 results with the old search but
  • 5 results with the new.

I guess JS and CSS pages aren't indexed at all. Please re-add them. TMg 13:18, 16 March 2014 (UTC)Reply

Filed at Bugzilla:62733. NEverett (WMF) (talk) 14:03, 17 March 2014 (UTC)Reply

I cannot find text inside of templates.

[edit]

For example if I search the IMDb-ID of en:12 Years a Slave (film), "2024544", I just get an result, if I use the old search. CennoxX (talk) 14:15, 30 March 2014 (UTC)Reply

'.' considered a word character?

[edit]

When using the following query on commons:

incategory:California_Historical_Society_Collection,_1860-1960 intitle:restoration.jpg

One result is found, but if "restoration.jpg" is truncated to "restoration" (as would normally be the case when searching for that term) no results are returned. This is highly problematic for title searches on commons, where most page titles include file extensions. Possibly related to this this bug? Junkyardsparkle (talk) 01:08, 7 April 2014 (UTC)Reply

I believe this will be fixed by the fix for bugzilla:63861. That should hit commons tomorrow and I'll rebuild the index and see if that fixes it. NEverett (WMF) (talk) 21:55, 21 April 2014 (UTC)Reply
Great. In general, the new search works so well that I forget that it's an opt-in beta... in particular, it's nice for creating fairly tight heuristically-defined lists of files for acting on with cat-a-lot on commons. Most of the outliers are conveniently sorted to the end of the listing for easy unselecting. I have bumped my head on (apparently) some query complexity limits while doing crazy things, but otherwise it's a very powerful tool for more than just landing a casual user on the right article page... cheers to everybody working on it. :) Junkyardsparkle (talk) 23:11, 21 April 2014 (UTC)Reply
Thanks! Its just User:Demon and I working on it but we're leaning on Elasticsearch and Lucene which are pretty powerful. Would you mind posting examples of some of the neat queries that work for you? I'll add them to the regression test suite.
I did try to rebuild commons yesterday but bumped up against a timeout error during one of the rebuild steps an hour and a half into the process. I built a fix this morning and I'll try to squeeze it out to production today and try again. NEverett (WMF) (talk) 13:02, 23 April 2014 (UTC)Reply
@Junkyardsparkle: That seems to have done the trick. Give it a shot now. NEverett (WMF) (talk) 19:02, 23 April 2014 (UTC)Reply
Looks good, I'll get back to categorizing now... don't know if my actually used queries would be useful for regression testing, because they tend to obsolete themselves after I act on the results, at least as far as the (possibly negated) "incategory:" terms go, which is what I tend to end up with a lot of... but if I can abstract a good test that reflects my use case, I'll mention it here. Thanks again. Junkyardsparkle (talk) 06:08, 24 April 2014 (UTC)Reply
As of right now, things have regressed to this being a problem again. Example case, this query no longer finds this file. Hope those new servers help enough to put the fix back in place. :) Junkyardsparkle (talk) 07:57, 18 July 2014 (UTC)Reply

Odd results on Location searches.

[edit]
  • search on denmark goes to DENMARK which is redirected to Denmark
  • search on florida goes to Flórida which is redirected to Florida
  • search on maryland goes to MarylanD which is redirected to Maryland
  • search on oregon goes to Oregón which is redirected to Oregon
  • search on california goes to Califórnia which is redirected to California
  • search on germany goes to GerMany (what?) which is redirected to Germany

What would be the problems with making the first search result on a lowercase search be the version simply initcapped and without the accented characters folded? Naraht (talk) 16:17, 8 April 2014 (UTC)Reply

This looks like bugzilla:63627. A fix was approved and should be live in a week at most; you can also try and test at http://en.wikipedia.beta.wmflabs.org Nemo 09:07, 10 April 2014 (UTC)Reply
The test looks good. The page it said "didn't" exist, at least had the capitalization that I was expecting. So for searching on florida, it said Florida didn't exist, which at least means it is searching for the right thing in the test database. Naraht (talk) 19:43, 10 April 2014 (UTC)Reply

Double quotes no longer result in phrase search.

[edit]

Not sure when this happened, but just noticing that "find this phrase" now returns results with "find", "this", and "phrase" anywhere instead of the expected behavior. I'm assuming this isn't intentional... Junkyardsparkle (talk) 02:02, 27 April 2014 (UTC)Reply

Doesn't seem to be happening anymore. Ok, it doesn't seem to happen in a consistent way, which is driving me nuts, so I'm switching back to old search for now. Example case: "pacific electric" search on commons is sneaking in results with "pacific gas & electric", etc... probably not a big deal for some purposes, but for rounding up files to batch process, not so great. :/ Junkyardsparkle (talk) 23:58, 3 May 2014 (UTC)Reply
I don't _believe_ I've changed the behavior of cirrus with regards to double quotes on commons. I'm going to be swapping out the component that does the highlighting with another one that is faster. It is currently deployed on test, test2, wikidatatest, and mediawiki.org. It doesn't support limiting the results to matching phrases at the moment but I'm fixing it. What I believe your seeing is that Cirrus's default phrase slop is 1 rather then 0. In other words, one word in between is OK. I'll switch it to 0 this week. You can actually control the slop by putting ~0 after the phrase. So "pacific electric"~0 will get you what you want. It just isn't intuitive and I'll fix that. NEverett (WMF) (talk) 15:35, 5 May 2014 (UTC)Reply
Yes, I'm pretty sure now that I just wasn't tripping over the slop feature enough to notice it at first. Thanks very much for the ~0 workaround, that's useful information to have. If it's documented somewhere, I missed it, but if it's made clear in the basic search syntax guide, then the slop may not be such a terrible default setting. I don't really know what the "normal" user expectation about this behavior is. :)
EDIT: I found this, but it states that "closer to 1 is less fuzzy"... which is backwards, isn't it? Ok, it works differently for phrases and single terms. Should have read a little bit more, sorry. I think I'm gonna need a cheat sheet. Junkyardsparkle (talk) 18:07, 5 May 2014 (UTC)Reply
I updated it to make it more clear, I hope. If you have ideas for what cheat sheet should look like please start one and I'll work on it! NEverett (WMF) (talk) 12:41, 6 May 2014 (UTC)Reply
Well, I just meant something like this... of course, having made it, I feel like I won't need it, but I'm always wrong when I assume that. ;) Junkyardsparkle (talk) 23:30, 6 May 2014 (UTC)Reply
If you move it to Help namespace here on mediawiki.org we can later also translate it. Nemo 05:51, 7 May 2014 (UTC)Reply
If you think it's worth translating, you're more than welcome to copy it to any appropriate place, but please look it over first to make sure it actually makes sense to someone other than me... Junkyardsparkle (talk) 05:50, 8 May 2014 (UTC)Reply

search by mimetype/filetype?

[edit]

I suspect this isn't quite the right place to ask it; feel free to point me at the right component in Bugzilla if not. But here goes: it should be possible to do something equivalent to Google's filetype: operator, at least on Commons. There are times when you want video; times when you want audio; etc., etc. So it'd be really nice to provide that :) LuisVilla (talk) 23:37, 3 May 2014 (UTC)Reply

I've found that using something like "intitle:ogg" works reasonably well, since the number of allowed file types/extensions on commons is fairly limited... Junkyardsparkle (talk) 23:55, 3 May 2014 (UTC)Reply
That's helpful, thanks! I still think a proper/accurate solution would be really helpful if we want people to treat us as a serious source of non-photo materials.
(With the right account this time). LuisVilla (talk) 03:01, 4 May 2014 (UTC)Reply
Luis, this is one of many things depending on bugzilla:17503. Nemo 22:02, 4 May 2014 (UTC)Reply
Great pointer, thanks, Nemo. LuisVilla (talk) 22:18, 4 May 2014 (UTC)Reply

Filters not working

[edit]

Search filters like prefix: and intitle: are not working from the search box. Spinningspark (talk) 00:21, 7 May 2014 (UTC)Reply

Which wiki? Can you provide an example? NEverett (WMF) (talk) 17:15, 8 May 2014 (UTC)Reply
This is en.wiki with the new search engine checked under my preferences in beta. Here are some results using the filter "prefix:Mechanical"
As you can see, the Cirrus results do not have the search term in the title of many results at all, let alone as a prefix. SpinningSpark 17:53, 8 May 2014 (UTC)Reply
Seems to be following redirects (ie. "Mechanical and Aeronautical Engineering" > "Engineering"). Junkyardsparkle (talk) 18:18, 8 May 2014 (UTC)Reply
Well that's still a problem. First of all, it is not transparent that redirects are being followed, I am being presented with the top result of "Engineering" which patently does not match my search specification and no indication of why it was included. Secondly, while I might possibly (or possibly not) have wanted to know about redirects, I probably don't want them at the top of the list. If redirects are going to be included, the name of the redirect should be in the results, not its target, it should be marked as a redirect, and it would be rather useful if it could be optionally suppressed.
The whole reason a user would use the prefix filter is they want exactly those pages that match. Not pages that are associated in some way. SpinningSpark 21:32, 8 May 2014 (UTC)Reply
Filed at bugzilla:65232 and I'll look at it now. NEverett (WMF) (talk) 18:19, 12 May 2014 (UTC)Reply
And I've submitted a patch to fix it (https://gerrit.wikimedia.org/r/#/c/132973/). If all goes well we'll merge the patch in a few hours and this'll be fixed on enwiki a week from Thursday. We can get it faster if it is really killing you. NEverett (WMF) (talk) 18:54, 12 May 2014 (UTC)Reply
It's not urgent as far as I'm concerned. While it's still a beta function it can be turned off. You might want to look at bug 65237 which I just raised before you patch anything though. SpinningSpark 19:27, 12 May 2014 (UTC)Reply

wikidata multi-lingual search.

[edit]

Are we using wikidata translations for cirrussearch? to sting matching to show wikidata items or even just to have multi-lingual searching working Jaredzimmerman (WMF) (talk) 16:49, 8 May 2014 (UTC)Reply

We haven't done any special integration with wikidata beyond being their primary search backend. We don't have a proper strategy for multilingual wikis with Cirrus either. The plan is to get there after we've got cirrus everywhere. At least we'll be better off then we were. I'll admit that it isn't a great argument but it is where we are. NEverett (WMF) (talk) 17:22, 8 May 2014 (UTC)Reply

Search result not including articles created in recent months in zh-yue

[edit]

Hello. I am a user of Cantonese Wikipedia (zh-yue). I am seeking for help because there is a major bug in our search function - only articles created on or before 2012 are included in the search result. Admins in Cantonese Wikipedia said that they don't know how to fix it. Please help us. Thank you very much. (Related discussion in Cantonese) Yaukasin (talk) 02:43, 10 May 2014 (UTC)Reply

Can you check how the "New Search" BetaFeature works for you? If it does a decent job of finding stuff then I can switch the whole wiki over to using it as the default. This (not being updated) is exactly the kind of thing that is difficult to fix in the old search and really simple to work on in the new one. NEverett (WMF) (talk) 18:39, 12 May 2014 (UTC)Reply
Thanks for your advice. I turn on "New Search" BetaFeature and discover that there seems to be no problem for searching recent articles. Take searching a Han character "辣" (meaning hot & spicy) as an example: In the old search, only 8 results were found dated from Aug 2011 to Apr 2012. In the new search, 163 results are found and the top 50 results contains pretty many articles created during 2013 to 2014. Yaukasin (talk) 14:43, 15 May 2014 (UTC)Reply
I'll schedule zh-yue to switch to "New Search" as the primary search backend sometime next week then. NEverett (WMF) (talk) 16:48, 16 May 2014 (UTC)Reply
Switched. Let me know (here or in bugzilla) if anyone has any trouble with it. NEverett (WMF) (talk) 14:45, 19 May 2014 (UTC)Reply
Thank you. I have just notified the zh-yue community about this improvement. Yaukasin (talk) 15:57, 19 May 2014 (UTC)Reply

Works for Japanese

[edit]

I don't see much feedback from Japanese wikis so I'd like to give one. The new search works pretty well for me on Japanese Wikipedia and Wiktionary. I especially like the section title highlighting and the improved word count in each search result. For example, against this query the old search gives a search result with a line saying "6 kb (24 words)" which is unreasonable, while the new search gives "6 kb (1,836 words)" which is reasonable. whym (talk) 08:31, 16 May 2014 (UTC)Reply

Yay! Thanks! The old search used spaces for word count (I believe) but the Cirrus delegates to the text analyzer which has some knowledge about Japanese.
Questions:
  1. There is an Elasticsearch plugin that is supposed to make Japanese analysis better. Would you be willing to try it out if I expose it in jawiki in beta and tell me if it is better/worse/the same?
  2. I'd like to start enabling cirrus as the default on more wikipedias. We're almost everywhere but wikipedias. Anyway, would you be willing to talk about it on jawiki's village pump? I'd love to do it with community support rather then force it on folks. NEverett (WMF) (talk) 16:47, 16 May 2014 (UTC)Reply
+1 on Nik: whym, it would be wonderful if you could help with that. :) Nemo 19:02, 16 May 2014 (UTC)Reply
I deployed the plugin in beta this afternoon and loaded a few pages. You can try it and compare:
http://ja.wikipedia.beta.wmflabs.org/w/index.php?title=%E7%89%B9%E5%88%A5%3A%E6%A4%9C%E7%B4%A2&profile=default&search=%E4%B8%89&fulltext=Search NEverett (WMF) (talk) 21:31, 16 May 2014 (UTC)Reply
NEverett, I'd like to try both. Do you know what exactly is the difference between the kuromoji plugin and the one you currently use? Is the current one inherited from lsearchd? This will also allow the community to know how the difference will be (and maybe how they can help debug).
The version on the beta.wmflabs.org looks not bad, but it is hard to say whether it is "better" unless we test these analyzers against the same document set. In general, I believe the difference of those Japanese analysis engines will be very subtle when looking at the search result quality, as long as they use the same or a similar morphological dictionary. whym (talk) 02:16, 17 May 2014 (UTC)Reply
The Kuromoji plugin looks to be an effort to integrate this which claims support for lemmatization and readings for kanji. I'm playing with the default setup for it and I don't see any kanji normalization, but it does a much better job with word segmentation then the one that is deployed on jawiki now. The one deployed on jawiki now is Lucene's StandardAnalyzer which implements unicode word segmentation. I haven't dove into that deeply enough to explain it, but some examples.
日本国 becomes
日本 and 国 in kuromoji
日 and 本 and 国 in standard
にっぽんこく becomes
にっぽん and こい in kuromoji
に and っ and ぽ and ん and こ and い in standard
From that it looks like kuromoji should be better but standard is saved by executing the search for all the characters as a phrase search which makes everything line up _reasonably_ well. It won't perform as well, but that should be ok too.
And it looks like my fancy highlighter chokes on kuromoji, which isn't cool. Look here. There are results without any highlighted anything which isn't good.
With regards to lsearchd: I'm not sure what it uses. It doesn't have the api that lets me see how text is analyzed so I have to guess from reading the code and there is a lot of it. NEverett (WMF) (talk) 17:45, 19 May 2014 (UTC)Reply
Do you want to continue working on kuromoji plugin until it is ready regarding highlighting? Or do you want to officialize the current beta feature as it is? I support your observations in that kuromoji's segmentation is more linguistically meaningful, which could improve search. However, failing to highlight is a major issue, and so far I personally cannot see how much search/snippet would be improved by kuromoji.
Is it easy for you to import all pages from jawiki to the test instance, or create another test instance using the same reduced document set, but processed by StandardAnalyzer? I'd be interested in testing various queries to check differences in search results, looking at what are retrieved and what are not by which. whym (talk) 14:16, 23 May 2014 (UTC)Reply
My I'm bad at replying to these. Sorry. So, yeah, my plan right now is to go ahead with the standard analyzer and do more work on getting the Japanese one better later. Sorry for the late reply. NEverett (WMF) (talk) 14:57, 29 July 2014 (UTC)Reply

Weight hits that are early in the article more highly then results at the end

[edit]

I couldn't figure out why my search results had suddenly gone to pot until I figured out that "new search" had been auto-enabled. Sorry, but in my experience it's complete junk. I use search to find documentaries in related fields quite a lot, and suddenly it had seemed as if the search function was returning near-random results. Now, with old search back, a search for say, "Algerian War" and "documentary" gets me what I'm looking for: articles related to those terms. With the "new" function on, such a search is virtually useless. Shawn in Montreal (talk) 16:07, 23 May 2014 (UTC)Reply

Thanks for the report. Can you tell us what wiki you're searching on? Dan Garry, Wikimedia Foundation (talk) 21:17, 23 May 2014 (UTC)Reply
English Wikipedia. I'm surprised no one else has mentioned it. I'm no SEO guy, but it's as if the "new" search has lost the ability to weight results depending on where they occur within an article. Specifically, I'd been surprised to see that if there was a mention of one my search terms -- say, the word "documentary" -- even in a reference or an external link, as opposed to the actual body text of the article. the new search would return those results near the top. "Old" search seems to be to be much closer to what I get in a Google search, which is to say, the search function would have the intelligence to distinguish been non-trivial and trivial mentions, somehow. Shawn in Montreal (talk) 22:23, 23 May 2014 (UTC)Reply
Part of the issue may be related to the introduction of slop into phrase searches, which confused me initially... see "Double quotes no longer result in phrase search" thread below for gory details. TLDR:
"algerian war"~0 documentary
in the new search will gives results fairly similar to
"algerian war" documentary
in the old search, where not using the "~0" gives looser results. Junkyardsparkle (talk) 23:07, 23 May 2014 (UTC)Reply
I don't think it's (just) that. For example, a search for the words Mugabe and documentary in old mode returns the two articles on documentaries about the leader, first and second, flawlessly. But switch to new mode, and the two docs are in first and sixth place -- clearly not as good. I don't know what you folks have done, but it's a net loss not a gain, from what I can see. Shawn in Montreal (talk) 23:37, 23 May 2014 (UTC)Reply
No, I didn't mean to imply that the problem was entirely (or even mostly) that... the new search does seem to be less magical with respect to your examples. The weighting voodoo is beyond my monkey comprehension, I'm just happy that I can create an explicit query when I want to, and there is some nice syntax available now... for instance, for your purposes, this seems to work pretty well:
mugabe documentary boost-templates:"Template:infobox film|300%"
Again, I'm not trying to say you don't have a valid complaint, just presenting what might be a useful workaround (or potentially even an improvement on hoping the search will weight things the way you want). :) Junkyardsparkle (talk) 00:22, 24 May 2014 (UTC)Reply
I'm sorry I don't know what that means or what to do with it. But thanks for trying to help.
Anyway, so long as we still have access to the old search function, the one that worked, it's fine. Shawn in Montreal (talk) 01:31, 24 May 2014 (UTC)Reply
It boosts the weighting of results that have the "film" infobox on the page. I don't think they plan to maintain the old search indefinitely, so forgive me if I hijack the thread with some ideas about how to make the new one work better for your purposes.
I'm wondering if it would be possible to implement a weighting method that uses boost-templates under the hood, by mapping certain high-confidence templates to the occurrence of certain associated terms (when not used in a phrase). For instance, if "documentary", "movie", etc implied a boost to the "Infobox film" template. Sorry if this isn't feasible or is already implemented in some way, I'm pretty ignorant of the weighting magic, like I said... Junkyardsparkle (talk) 02:21, 24 May 2014 (UTC)Reply
Oh, I see, you're actually talking about improving how the dingus works?
But why don't get is why folks messed with it in the first place. It worked just fine. Shawn in Montreal (talk) 02:52, 24 May 2014 (UTC)Reply
I'm talking about that now, but I was also pointing out that you can use the boost-template syntax in your own searches; using the example given should help articles about films bubble up towards the top of the list. From what I understand, the old search was difficult to maintain on the back end, and the new one will be better in that regard. Junkyardsparkle (talk) 03:04, 24 May 2014 (UTC)Reply
I see. I'm sorry but I have no idea how to modify the syntax or anything of that nature. Like most, I guess. I just type words in the window and hit the button. Shawn in Montreal (talk) 03:12, 24 May 2014 (UTC)Reply
I'm confident they'll fix the new search before they remove the old one - but one other thing I realize one can do is use Google Advanced Search to search Wikipedia. Tried it and it works pretty well. Shawn in Montreal (talk) 11:14, 24 May 2014 (UTC)Reply
Yeah, that boost-templates thing is more to test the default template boosting. You see, there is a configuration parameter on wiki that can be set to make everyone's searches silently contain some boost. The idea was to allow community curation of the results. Commmons uses it but not that extensively. You'd use boost-templates in your query either when you want to disable the defaults or when you want to test new ones. So its really a "super expert" kind of thing. In addition to that, its a convenient hook for my regression tests to check the feature. NEverett (WMF) (talk) 20:50, 4 June 2014 (UTC)Reply
Thanks for coming here to complain about these results. We'll figure some way out to make it at least as good for this class of search.
As to why we're replacing the old search when it is so good at finding results, here is the short list:
  • Old search crashes/rans out of resources from time to time and no one knows how to fix it. Its a pretty large code base based on really old libraries. New search is based off of relatively standard services under active development.
  • Old search updates every few days and often misses things. New one updates pretty near real time. Page edits are usually in the index in under a minute. Template edits are can take longer to be reflected in the pages that contain those templates.
  • Old search doesn't do anything with templates. New search fully resolves templates. Its *righter* but its more trouble.
The truth is that the replacement project was driven internally by ops folks raising a ruckus because the old one had no maintainer and wasn't super stable. There is also a significant backlog of bugs and feature requests for search that we've had to ignore because the old one was so hard to work on. So that's how you get where we are.
As far as why the new search doesn't spit out results exactly like the old one, one of the reasons is that the old one is super customized for English Wikipedia. Its difficult to navigate and many of the customizations were speculative: they didn't really provide better results, they just were there. So we implemented the ones that were obviously better and deployed the new search as a BetaFeature so folks could try it. When we tried it we found the results were usually similar but not better or worse. You've hit on one of the customizations that we didn't reimplement: the old search weights hits that are early in the article more highly then results at the end. We didn't do this because our tests didn't show it made much difference. But for you searches it makes a pretty huge difference.
Long story short, we'll implement that.
Also, if you are curious on how scoring works you can read the first half of this presentation. The other half won't be all the interesting. NEverett (WMF) (talk) 17:53, 2 June 2014 (UTC)Reply
Thank you very much. Frankly, I didn't think people would much care what I had to say. Shawn in Montreal (talk) 01:00, 3 June 2014 (UTC)Reply
Interesting, I wouldn't have guessed that was the optimization involved, but now that you mention it, that weighting does make a huge amount of sense in the context of wikipedia articles, being summarized in the lead section... Junkyardsparkle (talk) 21:58, 3 June 2014 (UTC)Reply
I'm surprised it was not judged to be worth retaining, initially. Google has made much great strides in making its search more intelligent, in distinguishing between relevant and trivial mentions of search terms.
In Wikipedia, we have guidelines that explain the importance of summarizing key concepts in the article lead. To design a search engine to intentionally disregard that very structure is puzzling to me. Shawn in Montreal (talk) 02:31, 4 June 2014 (UTC)Reply
They did not "intentionally disregard" the feature, they just have not spent time developing it from scratch; but you were told they will now. Also consider this weighing is not a search backend standard, it's not even valid for many MediaWikis including Wikimedia wikis (specifically, in order of traffic: Commons, Wiktionary, Wikiquote, Wikisource and Wikibooks).
This system of prioritisation makes sense to me: it would have been worse if they had tried to reimplement every single feature and customisation of the old custom search moloch, even unrequested. We would have wasted lots of developer time and ended up with another unmaintainable system which would receive no love for the next 5 years. Nemo 07:21, 4 June 2014 (UTC)Reply
Thanks for the defence, but Shawn's right; its a relatively obvious optimization. Its something that's "been on the list" for a long time but it kept getting lower and lower under as we'd been in beta and no complained about quality in a way that this would have caught. I frankly forgot about it.
As far as intentionally disregard, if anyone did any disregarding, it was me. I'd prefer to characterize what I did in this case as getting snowblinded by all the (probably) speculative features to improve search quality that I didn't give this one as much weight as it deserves. But there isn't a clear line between that and intentionally disregard. It did, after all, make it onto my list, just too low.
I will admit to getting mired in a pet issue of mine, highlighting. The highlighter wasn't going to support it so I spent quite a bit of time on it. In fact, the highlighter used on enwiki and commons right now does prefer snippets from the beginning of the article. But I got distracted by the snippet issue and didn't cover the scoring issue.
Anyway, I'm going to go fiddle with positional boosts now. Depending on how that goes you'll get a solution soon. NEverett (WMF) (talk) 20:02, 4 June 2014 (UTC)Reply
I certainly didn't mean to offend anyone, sorry. I think it's kinda neat that this one lone comment from me has been helpful to the cause, and thanks. Shawn in Montreal (talk) 20:21, 4 June 2014 (UTC)Reply
Complaining is how I know more has to be done!
I implemented weighing terms early in the article more highly then later (locally, not deployed) but I'm not happy that'd be enough for your case. Mugabe's Zimbabwe doesn't have the word "documentary" in the opening. It calls it a "factual film". I'm sure there is a distinction but I imagine its small enough people still think of it as a documentary. I mean, it is in the "Documentary films about politicians" category. I think I'll add a search in the category with a decent weight as well. That seems like it'd help. NEverett (WMF) (talk) 11:41, 5 June 2014 (UTC)Reply
Oh yes, the "factual film" thing is a real outlier. Don't worry about that. But yes if you could weight the categories a bit more, then that might indeed help search results. Good idea. Shawn in Montreal (talk) 13:54, 5 June 2014 (UTC)Reply
Both of those changes are ready for review. I imagine the category thing will catch the factual film outlier. My best guess is we'll deploy them to the test wikis next Thursday and to wikipedias the Thursday after that. Both changes, though, will require some time to take effect because the index will have to be rebuilt. That'll take a few days. That is one of the problems with Cirrus: the old search could rebuild the entire index more quickly because it didn't bother with stuff like templates. We can't. We react more quickly because we're able to hook more tightly into the infrastructure and we can throw more cpu at the problem. But when you have to change the index it takes some time. OTOH its like 100 times easier to debug then the old stuff, so tradeoffs..... NEverett (WMF) (talk) 14:45, 5 June 2014 (UTC)Reply

New "insource:" syntax, etc.

[edit]

I'm happy about the new "insource:" syntax, because every once in a while I find myself wishing for just that kind of low-level inspection (finding certain kinds of malformed information on commons file pages, for instance). Can I assume that the regex flavor is implemented in a way that's smart enough to only run it on files that would be the results returned by the rest of the query? I don't want to hammer the servers playing with it, but it did seem to be quite fast in my first use, which had a "prefix:" term that narrowed the field to ~600 hits by itself. Is this a reasonable usage?

Also, does the non-regex version basically just ignore non-word characters much like the other search functionallity? It seemed so from a few quick tests.

And, just to completely overload this post, I'm also wondering about what the "first paragraph" weighting would apply to in the context of a typical commons File: page... basically only up to the first heading? This could be significant in terms of best practices for adding information to the pages, I think. Current upload methods tend to slap a ==Summary== header at the very top of the page, while many older uploads are lacking this. Could this cause some wonky weighting of older vs. newer uploads? Junkyardsparkle (talk) 02:03, 29 June 2014 (UTC)Reply

You can't hammer the servers too much, pool counter is pretty small for regex searches ;-) I'll leave it to Nik to say how late/early in result processing it handles regular expressions.
Yes, most punctuation and so forth is ignored for non-regex searches. insource: searches a different field `source_text` instead of `text`, the latter of which is configured with all kinds of language-specific bells and whistles to make it better at finding content for the majority of readers.
The first paragraph weighting isn't as nice on the PHP side as I'd like. It uses the pretty naïve approach you outline there, where it just uses stuff before the first heading which isn't necessarily the best. ^demon[omg plz] 21:16, 30 June 2014 (UTC) 21:16, 30 June 2014 (UTC)Reply
Sorry it took me so long to get to this.... Vacation and performance work have been squeezing me dry.
As far as order of operations - Elasticsearch _should_ do the right thing and execute the expensive filter last. On Thursday (I think) we're pushing a change to Cirrus that gives Elasticsearch a big hint that the regexes need to come last. If it isn't fast now it should be then.
The non-regex flavor of insource uses the standard analyzer used for the rest of the text. So its exactly how intitle works except against source. Its not prefect but its at least somewhat intuitive.
The first paragraph weighting thing is more something we should change to work around on wiki habits rather then the other way around. I built it so you could plug multiple implementations into it but only implemented the naive, until the first heading approach. It'd be simple enough modify or create a new one that skips the first heading if it is the very first thing. NEverett (WMF) (talk) 14:56, 29 July 2014 (UTC)Reply