Talk:Requests for comment/CirrusSearch

From mediawiki.org

Questions, comments[edit]

  • Most stuff we'll test in labs. Will the labs instance(s) show search results for a live wikipedia(s)? I hope so, otherwise people won't be motivated to test. If so, you could even change the messages on the live wiki's Special:Search page to add "Repeat this search against our new search engine in testing" (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))Reply
  • How is the scoring going to be? Will it replicate the current Lucene scoring (which considers number of incoming links and the like) or not, and will it be easier to adjust gradually as needed in the future (as opposed to the current monolithic scoring nobody understands, impossibile to tailor)? --Nemo 07:05, 18 June 2013 (UTC)Reply
    • We aren't going to replicate the current scoring. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)Reply
      • Thanks. Is scoring another thing you'd rather work on upstream as tokenizers mentioned below? --Nemo 16:31, 24 June 2013 (UTC)Reply
        • That depends on the scoring problem and what we determine is the most appropriate way to solve it. Some scoring issues will be resolved by patching upstream (Solr) and submitting those patches (problems with tokenizers and analyzers), some will be resolved by modifying CirrusSearch (weighting issues, sending more data to the index?), and yet others might require some tweaks to MediaWiki itself (template expansion?). I'm sorry I can't be too specific but I think we'll discover more as the rubber meets the road. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)Reply
    • We haven't yet decided whether to take into account the number of incoming links. The results seem pretty good without it. I think this is something worth deploying without to a subset of wikis and adding it if we feel that search results are less good and we feel that this is the cause. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)Reply
  • Will it index the pages with all templates expanded? This is particularly important for Wiktionary, but also Wikisource and Wikipedia. --Nemo 07:05, 18 June 2013 (UTC)Reply
    • The plan is to expand all templates. One question that has come up is, should we not expand some of the templates? NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)Reply
      • Thanks. In general, I'd say no, but it depends on how smart the scoring is: you wouldn't want to have thousands of articles containing a word or name in an infobox coming before articles actual mentions of that word, when searching for it. If the scoring is not smart enough, it's possible wikis would like to exclude some templates (say, navigational templates) with some tag similar to noinclude or Category:Exclude in print. --Nemo 16:31, 24 June 2013 (UTC)Reply
        • What'll happen is that phrases in super common infoboxes will become less important with regards to scoring. Searching for <citation publication> will sort things about publications and not citations higher than things about citations but not publications. It'll still spit out things about citation publications above all of those which should be fine. Really, the problem with expanding all templates while indexing is that it is slow during batch indexing which is what we're actively working on right now. NEverett (WMF) (talk) 14:45, 25 June 2013 (UTC)Reply
        • If we do decide that we want to not expand (or remove entirely) some of the templates then we can always make that change later and reindex everything. It'll take time but we're making sure that reindexing is something we can do if we need it. NEverett (WMF) (talk) 14:45, 25 June 2013 (UTC)Reply
  • I absolutely love the faceted search feature of Solr. Are there any plans to use this one? I think this might have to be reflected in building the schema. --Mglaser (talk) 10:02, 18 June 2013 (UTC)Reply
    • One of our requirements is the ability to change the schema without too much pain. So no, we don't plan on using faceting right yet, but yes, we'll gladly do something with it when we know what would be useful. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)Reply
  • A search for categories would be so great! In BlueSpice, when indexing the articles, we also store their categories. On the scale of our wikis, this is very performant. Would you think this might also be an idea at this large scale? --Mglaser (talk) 10:02, 18 June 2013 (UTC)Reply
  • In order to measure the quality of full-text search, would it be helpful to (in an automated way) compare with the search result given by major web search engines when run exclusively for a Wikimedia site, such as "site:lang.wikipedia.org" on Google? Of course we should not be targeting to replicate them, but the comparison could give some hint when doing things wrong (such as by weird tokenization). --Whym (talk) 14:43, 22 June 2013 (UTC)Reply
    • I think something like this would probably take a while to implement and give too many false positives so I don't plan on it. I'd prefer to deal with individual bugs and build a regression suite around that. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)Reply
  • I don't mean to sound like a broken record, but I'm not sold on the comparison of Solr vs. ElasticSearch. The current Solr installation is irrelevant as it's small, built for a different purpose, badly designed, does not use Solr 4.x/SolrCloud and it really just needs to go (unless you really want to compare Solr 3 with ElasticSearch :). I can't argue with the "we have more experience with Solr" argument but I'd really prefer a comparison on their technical merits, if anything to learn more about their differences. http://solr-vs-elasticsearch.com/ seems like a good resource and it seems to suggest ElasticSearch for "large installations" (note that the comparison welcomes input, they have their HTML in GitHub and welcome contributions). Faidon Liambotis (WMF) (talk) 23:56, 22 June 2013 (UTC)Reply
  • Will it be possible to use CirrusSearch on a non-Wikimedia installation of MediaWiki? -- Tim Starling (talk) 06:08, 22 July 2013 (UTC)Reply
    • That's the plan. We've written the extension to be very generic, so as long as you're able to setup Solr or Elastic you'd be able to use it. Of course this doesn't help people on shared hosting, but they've never been able to use anything except database-backed searching. ^demon[omg plz] 06:17, 22 July 2013 (UTC)Reply
  • How will updates work exactly? How much latency, if any, will be added to page saves? Will the parser cache be used? Will the API be used? -- Tim Starling (talk) 06:13, 22 July 2013 (UTC)Reply
    • We're using the SearchUpdate deferred update. There shouldn't be much latency noticeable to users as it's deferred until after page output, but I haven't done any profiling yet. And yes, the pcache is being used. ^demon[omg plz] 06:17, 22 July 2013 (UTC)Reply
      • With some configurations deferred updates do delay the serving of the page. I don't know if it is inherent when using nginx or if it is a configuration issue. I haven't seen that explained anywhere. --Nikerabbit (talk) 17:43, 22 July 2013 (UTC)Reply
  • I do not see any changes in the section about languages after changing the preference from Solr to Elasticsearch. How well does Elasticsearch support languages other than English, compared to Solr? siebrand (talk) 15:48, 22 July 2013 (UTC)Reply

Features lost after migrating to CS[edit]

One point I see which is lost on wiki-sites migrating to this new search engine. You know that there are lots and lots of navboxes breeding by pages, people try to fight with them, but they are often stronger (and there are another kinds of templates which also add some standard texts to many pages). So, many articles are connected to some term only by mentioning it in some template in a hundred of others; one can't use WLH mechanism to filter them out. But since old SE ignores template text, one is able to get results where the term is mentioned in text, not template. But new engine doesn't behave in such manner. You can't just say "don't show me pages with template XXX", since many of them are truly relevant.

I would like that a parallel search engine is introduced that searches in bare wikitexts. It also will help in cases like finding deprecatedly named parameters of protected templates; currently, such task can be effectively done only by a bot operating on a dump, which is slower and unaccessible to many users. Ignatus (talk) 21:31, 13 January 2014 (UTC)Reply