User:Dantman/Search system

From mediawiki.org

I encountered some deficiencies in our search system that could ideally be improved and made more flexible and abstract.

  • maintenance/Maintenance.php contains sql specific searchindex code inside of it. This should be moved into a common subclass...
  • maintenance/rebuildtextindex.php is database fulltext search specific, it's also not completely abstract as there are some mysql conditionals in it.
  • maintenance/updateSearchIndex.php uses some abstract code but ultimately makes some fulltext search specific calls
  • MWSearch (lucene) has it's own luceneUpdate.php maintenance script for pushing data. This kind of thing should probably be supported as part of MW's own search support, common to all search engine types. Combined with and replacing maintenance/rebuildtextindex.php and maintenance/updateSearchIndex.php.
  • MWSearch looks like it can support live push updates to the index. We should try to support this as an option, instead of only supporting incremental and pull based index updates.
  • We have a script to do incremental updates based on rc (which is db fulltext specific). However we do not have a config option to turn off live push updates of the index to switch over to incremental updates.
  • The scripts and interfaces we use should not be locked to the current $wgSearchType configured search index:
    • The maintenance script we use to do push updates and rebuilding of search indexes should support an argument to work on a different search type so that we can pre-populate and index before switching to it.
    • Special:Search should also support an argument so that a alternative search index can be tested before it is enabled.
  • Internal and interwiki searching support is alright. However I believe we should support an extra alternate section of the output (like the interwiki section) that uses urls rather than . An intersite section. This would support interwiki results where the index doesn't have good interwiki data, and also cases where MW is indexed along side things like a site's own blog and other sites, so that the wiki search can display extra results to the other related sites. The interwiki support we have isn't good enough for all situations.
  • We may want to consider an extra $wg variable with an array of index types to update instead of just the one specified by $wgSearchType. This would allow for support of multiple indexers in a transitional period or when a wiki is evaluating different search indexes and deciding what to use.

Comparison of search indexers[edit]

We take the following search types into account here:

  • Database engine based fulltext search indexing
  • SphinxSearch and Sphinx
  • The Solr search engine and a potential extension supporting it
  • Wikimedia's MWSearch + Lucene-search (lsearchd) + OAI (somewhat optional)
Engine query-suggest ("Did you mean?") push update support pull update support compressed text, ExternalStore compatible
Database full-text None Yes No Yes
Sphinx No. SphinxSearch supports a deficient spellchecker based one Can't support Yes, only method No, does sql on text table
Solr Yes, theoretically good (should be index based) Yes Theoretically possible (DataImportHandler: SQL, HTTP, XML Dump, etc...) Yes
lsearchd Yes, best (index based) Cron only, no live push Through OAI Yes
  • push updates:
    • A push update is one where part of MW makes an update call to the daemon or index.
    • Push updates can be done live, in other words, they can be made on-save for instant updates to the search index.
    • Push updates can also be run incrementally using a maintenance script and cron.
    • However, we currently do not have a config option to defer live-update so you can't simply switch to incremental push updates currently.
  • pull updates:
    • A pull update is one where the index daemon, or a cron script tied to the indexer makes a request to part of MW to fetch data and update the search index.
    • If this request is sql based like Sphinx's the indexing solution is incompatible with compressed text and ExternalStore. ie: It only works with inefficient methods of text storage.
    • Pull updates can usually support incremental indexing.