Extension:SphinxSearch

Description
As a MediaWiki-based site administrator, one of the most common complaints I receive is that the default search engine is far from excellent. In our day and age where Google sets the standard for search engine capabilities, users aren't happy with a basic search engine. They need, or should I say demand a faster, easier, better engine.

The Sphinx Search Engine seems to promise exactly that; a full text search engine that is both flexible and fast. This extension incorporates the Sphinx engine into MediaWiki, currently only as a stand alone Special Page. It is the intension of this developer to find a convenient hook in the core code to have Sphinx act as the default search engine (i.e. the search box on the left of every page), while still having access to the MediaWiki's build in search engine as a special page.

This extension is very similar in nature to Extension:LuceneSearch. The main difference is obviously the search engine backend.

Step 1
Download Sphinx Search Engine. Follow the installation instructions.

Step 2
Create a sphinx.conf file in some directory (let's call it $SPHINX) with the following content: source src1 { type              = mysql strip_html       = 0 index_html_attrs =

sql_host         = localhost sql_user         = root sql_pass         = ******* sql_db           = ??????? sql_sock         = /var/lib/mysql/mysql.sock sql_port         = 3306

sql_query_pre    = sql_query        = SELECT old_id,old_text,page_title,page_namespace FROM \ ( \                         SELECT MAX(rev_text_id) AS latest_text_id, page_title, page_namespace \                          FROM wiki_page, wiki_revision \                          WHERE rev_page=page_id GROUP BY page_latest \                      ) AS latest, wiki_text WHERE old_id=latest_text_id sql_query_post   = sql_group_column = page_namespace sql_query_info   = SELECT old_text,page_title FROM \ ( \                         SELECT MAX(rev_text_id) AS latest_text_id, page_title FROM wiki_page, wiki_revision \                          WHERE rev_page=page_id GROUP BY page_latest \                      ) AS latest, wiki_text WHERE old_id=$id }

source src1stripped : src1 { strip_html      = 1 }

index wiki { source          = src1 path           = $SPHINX/wiki.sphinx docinfo        = extern morphology     = none stopwords      = min_word_len   = 1 charset_type   = utf-8 min_prefix_len = 0 min_infix_len  = 0 }

index wikistemmed : wiki {  path          = $SPHINX/wikistemmed morphology   = stem_en }

indexer {  mem_limit     = 512M }

searchd { port          = 3312 log          = /tmp/sphinx-searchd.log query_log    = /tmp/sphinx-query.log read_timeout = 5 max_children = 30 pid_file     = /tmp/sphinx-searchd.pid max_matches  = 1000 }

Make sure to adjust all values to suit your setup. Pay careful attention to the sql_ settings including the wiki databage prefix (here assumed to be wiki_) and make sure to substitute all instances of $SPHINX with whichever directory you chose above.

Note: I have to give credit where credit is due. This sphinx.conf file was in large part borrowed from this excellent article.

Step 3
Run the sphinx indexer to prepare for searching: Once again, make sure to replace $SPHINX whatever you chose above. This process is actually pretty fast, but clearly depends on how large your wiki is. Just be patient and watch the screen for updates.

Step 4
When the indexer is finished, test that sphinx searching is actually working:

Step 5
In order to speed up the searching capability for the wiki, we must run the sphinx in daemon mode. Add the following to whatever sever startup script you have access to: Note: without the daemon running, searching will not work. That is why it is critical to make sure the daemon process is started every time the server is restarted.

Step 6
Copy the Sphinx API file, sphinxapi.php to the extensions directory. This file is part of the sphinx source code, under the api/ directory.

Step 7
Copy the following contents into extensions/SphinxSearch.php ");       }    }

function createNewSearchForm($SearchWord='') { global $wgOut, $wgUser;

$titleObj = SpecialPage::getTitleFor( "SpecialSphinxSearch" ); $kiAction = $titleObj->getLocalUrl; $wgOut->addHTML("                                                  ");

# get user settings for which namespaces to search $wgOut->addHtml(wfMsg('sphinxSearchInNamespaces')); foreach( SearchEngine::searchableNamespaces as $ns => $name ) { $checked = $wgUser->getOption('searchNs' . $ns) ? ' checked="checked"' : ''; $name = str_replace( '_', ' ', $name ); if('' == $name) $name = wfMsg('blanknamespace'); $wgOut->addHtml(" {$name} "); }       $wgOut->addHTML("");

# Put a Sphinx label for this search $wgOut->addHtml(" Powered by  "); } }

?> Note: Make sure to adjust the Configuration Options section near the top of this file. Most importantly, change line to the name of the search index you defined in.

Step 8
Add the following text to your LocalSettings.php

Searching
The Sphinx Search is currently only available as a Special Page. To access it, go to Special:SpecialSphinxSearch. The syntax for searching is quite intuitive. For a complete set of options, see Sphinx documentation.

Configuration
There are currently 3 configuration options that could be configured from LocalSettings.php or from SphinxSearch.php directly. Those are: When setting these options in LocalSettings.php, make sure to do so after the call to require_once for this extension.
 * $wgSphinxSearch_index - the name of the index Sphinx will search. See above for details.
 * $wgSphinxSearch_mode - the Sphinx search mode. The default mode is the most intuitive. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_matches - the number of search hits to display per result page.

ToDo

 * 1) Make the search box on the left of every page search with the Sphinx search engine, while still having access to the built in search through a special page.
 * 2) Make the database fulltext extraction faster.
 * 3) document incremental updates to the indexer

Revisions

 * v0.1 - September 24, 2007 - Initial release (RFC)