Extension:SphinxSearch

Description
As MediaWiki-based site administrators, one of the most common complaints we receive is that the default search engine is far from excellent. In our day and age where Google sets the standard for search engine capabilities, users aren't happy with a basic search engine. They need, or should I say demand a faster, easier, better engine.

The Sphinx Search Engine seems to promise exactly that; a full text search engine that is both flexible and fast. This extension incorporates the Sphinx engine into MediaWiki to provide a better alternative for searching. The extension can be installed in one of two modes:
 * 1) Provide an additional Special Page for searching using Sphinx. This method is excellent for providing a method for evaluating the performance of the extension while still maintaining the default search engine.
 * 2) Completely replace the built in search engine with the sphinx search engine.

This extension is very similar in nature to Extension:LuceneSearch and Extension:Hyper Estraier. The main difference is obviously the search engine backend. SphinxSearch extension also adds some additional features like "Did You Mean" suggestions for misspelled searches. This functionality is fundamentally different from Extension:DidYouMean which only suggests alternate article names for existing articles. Also, SphinxSearch can be easily evaluated before rolling it out as a complete replacement search engine.

Compatibility
This extension has been shown to work / not work with the following MediaWiki versions. Please add more successes and failures to this list
 * 1.6.? -- Fails - The guy who tested old version of Mediawiki - (Pessoft)
 * 1.8.? -- Fails - The guy who uses old version of Mediawiki - (125.17.142.146)
 * 1.9.3 - Works - (Gri6507)
 * 1.10 - Works - (Svemir)
 * 1.11 - Works - (125.17.142.146, Svemir)
 * 1.12 - Works - (80.152.175.189 (Windows/IIS), Svemir)
 * 1.13.0 - Works - (Erik Gregg), Thanks guys! Nice Job!  It works on Wikipedia!
 * 1.13.2 - Works - (Jipipayo)
 * 1.13.3 - Works - (RADION Openlab), Kamil Wencel, thanks works well on our new testsite LAMP + sphinx 0.9.8.1
 * 1.14 - Works - 130.234.189.190 12:47, 24 February 2009 (UTC), works great on our WIMP

The extension has been shown to work / not work with the following Sphinx versions. Please add more successes and failures to this list
 * 0.9.6rc1 - Does not work - (125.17.142.146)
 * 0.9.7 - Works - (Gri6507)
 * 0.9.8svn - Works - (Svemir)
 * 0.9.8svn-r1112 (Jan 28, 2008 snapshot) - Does not work for 130.234.189.190, but it works for Svemir
 * 0.9.8-rc2 r1234 - Works - (Gmoyle, 80.152.175.189 (Windows/IIS))
 * 0.9.8 - Works - (Svemir)
 * 0.9.8.1 - Works - 130.234.189.190 12:47, 24 February 2009 (UTC), works great on our WIMP

The extension has been shown to work / not work with the following languages. Main problem may be that it cannot separate the words and the phrases.
 * English - Works - all versions - (Alpha3)
 * Chinese - Works on win2003 wamp 1.7.3 - all versions - (Alpha3)
 * Please see this post in Sphinx forums for details.


 * German - Works on W2k3 and IIS - (80.152.175.189)
 * Russian - Works (XAMPP, Debian) - StasFomin.
 * Hebrew - Works on W2K3 and IIS - CrushKing.

Step 1 - Install Sphinx
Download Sphinx Search Engine. Follow the installation instructions. You only need to do the actual installation, which means you do not need to do the "Quick Sphinx usage tour". You can verify your installation by following the rest of the steps here. Note: if installing on a Windows server, you do not need to compile anything; just download the Win32 release binaries.

Step 2 - Configure Sphinx
Download and extract the extension to a temporary directory. Copy the sphinx.conf file from this download to some directory (we will refer to this file as "/path/to/sphinx.conf" below.) This directory should not be web-accessible, so you should not use the extensions folder. Make sure to adjust all values to suit your setup:
 * Set correct database, username, and password for your MediaWiki database
 * Update table names in SQL queries if your MediaWiki installation uses a prefix (backslash line breaks may need to be removed if the indexer step below fails)
 * Update the file paths (/var/data/sphinx/..., /var/log/sphinx/...) and create folders as necessary
 * If your wiki is very large, you may want to consider specifying a query range in the conf file.
 * If your wiki is not in English, you will need to change (or remove) the morphology attribute.

Note: To give credit where credit is due, we must thank the author of this excellent article for providing an excellent starting point on configuring this file.

Step 3 - Run Sphinx Indexer
Run the sphinx indexer to prepare for searching: Once again, make sure to replace the paths to match your installation. This process is actually pretty fast, but clearly depends on how large your wiki is. Just be patient and watch the screen for updates.

Step 4 - Test Out Sphinx
When the indexer is finished, test that sphinx searching is actually working: You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.

Step 5 - Start Sphinx Daemon
In order to speed up the searching capability for the wiki, we must run the sphinx in daemon mode. Add the following to whatever sever startup script you have access (i.e. /etc/rc.local): Note: without the daemon running, searching will not work. That is why it is critical to make sure the daemon process is started every time the server is restarted.
 * Please Refer http://www.mediawiki.org/wiki/Extension_talk:SphinxSearch#More_Windows_Install_Issues for help for Windows Users

Step 6 - Configure Incremental Updates
To keep the index for the search engine up to date, the indexer must be scheduled to run at a regular interval. On most UNIX systems edit your crontab file by running the command: crontab -e Add this line to set up a cron job for the full index - for example once every night: 0 3 * * * /path/to/sphinx/installation/indexer --config /path/to/sphinx.conf wiki_main --rotate >/dev/null 2>&1 Add this line to set up a more frequent cron to update the smaller index regularly: 0 9,15,21 * * * /path/to/sphinx/installation/indexer --config /path/to/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1 As before, make sure to adjust the paths to suit your configuration. Note that --rotate option is needed if searchd deamon is already running, so that the indexer does not modify the index file while it is being used. It creates a new file and copies it over the existing one when it is done.

Step 7 - Extension Preparation - Sphinx PHP API
Create extensions/SphinxSearch directory and copy the Sphinx API file, sphinxapi.php there. This file is part of the sphinx source code, under the api/ directory. Note: if you installed Sphinx from a Win32 binary release, it may not have come with a copy of sphinxapi.php. You must download either the source code package or an API update package. Just use your favorite uncompress utility (i.e. WinZip) and extract only the sphinxapi.php to the extensions directory; the other files can be ignored.

Step 8 - Extension Preparation - Mediawiki Extension Functions
Download ExtensionFunctions.php from SVN and copy it to your extensions/SphinxSearch directory. svn export http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExtensionFunctions.php

Step 9 - Extension Installation - PHP Files
Copy all remaining files (SphinxSearch.php, SphinxSearch_body.php, SphinxSearch.js, SphinxSearch_PersonalDict.php, SphinxSearch_spell.php, spinner.gif) from the temporary directory you extracted the code to in to your extensions/SphinxSearch directory.

Step 10 - Extension Installation - Local Settings
Add the following text to your LocalSettings.php

Options
For the most part, the extension's default options do not need any modification. However, if tweaking is needed/desired, there are a number of configuration options that could be configured from LocalSettings.php or from SphinxSearch.php directly. Those are:
 * $wgSphinxSearch_host - the hostname on which sphinx's searchd daemon is running (default to localhost)
 * $wgSphinxSearch_port - the port number on which sphinx's searchd daemon is running (default to 3312)
 * $wgSphinxSearch_mode - the Sphinx search mode. The default mode is the most intuitive. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_matches - the number of search hits to display per result page.
 * $wgSphinxSearch_weights - the way Sphinx orders the results. The default is pretty good. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_groupby, $wgSphinxSearch_groupsort - define how to group the results. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_sortby - set matches sorting mode (default to SPH_SORT_RELEVANCE). See Sphinx documentation for other valid options.

When setting these options in LocalSettings.php, make sure to do so after the call to require_once for this extension.

Mode Of Operation
By default, this extension will run so as not to overwrite the built-in search engine, but instead provide a new Special Page called Search Wiki Using Sphinx. This allows the users to evaluate this extension by directory comparing the search results of the built-in search vs. Sphinx search.

If the performance is deemed acceptable to replace the built-in search engine, this extension can easily be configured to act as the default search engine. To do so, modify SphinxSearch.php to uncomment the lines containing Now, the standard search method will use Sphinx by default. Note: when used in this way, the extension preserves the functionality of the Go and Search buttons.

Did You Mean
When performing a search and the search query is misspelled, the search results could be greatly impaired. Without knowing about the misspelling, it may take the user a while to figure out why their search results are not very good. That is why this extension has an optional "Did You Mean" support. When enabled, this feature will suggest a properly spelled search query for the user in case of a spelling mistake. Also, since many wikis utilize their own jargon, in order to make the "Did You Mean" suggestions more reasonable, this extension can optionally utilize a personalized dictionary.

The spell checking capability is provided via one of two methods. The Did You Mean feature is turned off because it requires the presence of a spell checker and some configuration. In order to enable this functionality edit the SphinxSearch.php to uncomment the line containing This will automatically pick whichever method for interactive with the spellchecker utility is more efficient. If your wiki server does not have Pspell support, then specify the path to the Aspell executable by editing the line containing If for whatever reason the Aspell dictionary files on the server are not in the default location, you can specify the proper path to the dictionary files by setting If using a personalized dictionary, edit the line containing to point to where you'd like to keep the dictionary file.
 * 1) Aspell - a command line program for performing spell checking
 * 2) Pspell - PHP native interface to aspell

When the Did You Mean feature is enabled and is configured to use a personal dictionary file, then the next step is to add contents to this dictionary. SphinxSearch will create a new restricted access special page called Wiki-specific Sphinx search spellcheck dictionary. This page is only accessible by users with DELETE permissions (typically PowerUser and SysOp groups). These users can utilize this page to view the words already in the dictionary, add words into the dictionary, and remove words from the dictionary.

Stop Words
When modifying the sphinx.conf file (see ), there is an option for specifying a file containing search stop words. Stop words are those common words like 'a' and 'the' that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found here. Simply copy those words into a text file, and modify your sphinx.conf to point to that file with

ToDo

 * Use auto-load and make other performance improvements.
 * Add "ignore" checkbox to category search (so only articles that do not have that category are returned.)
 * Additional search options (exact match, etc.)
 * Add image thumbnails to image matches.
 * Smarter handling of multiple Sphinx index files.
 * Assign weights to namespaces.
 * Sort the results in SPH_SORT_EXTENDED mode by @relevance and by number of times the page has been viewed (available from wiki database). The idea behind this is that given two pages that have the same relevance to the search, if one has been viewed more times, there is probably a reason for that. Number of links to each page could also be included in the calculation.
 * Use existing titles in "did you mean" suggestions.
 * If originally "Go" was clicked, and "did you mean" link results in a direct match, redirect to that page.
 * Easier install of the extension. Perhaps a script?

Completed ToDos

 * We use SPH_MATCH_EXTENDED for better relevance weights, but we process the search term to make it assume an OR instead of an AND on multiple. This will be replaced with an option on the search form.
 * Add the "did you mean" functionality to the search results.

Revisions

 * v0.6.1 - November 11, 2008
 * Added SphinxSearchGetNearMatch hook - called with $term and $title (or null) returned from SearchEngine::getNearMatch.
 * If PECL SphinxClient is installed, do not include sphinxapi.php.
 * Added $wgSphinxSearch_maxmatches (defaults to 1000) and $wgSphinxSearch_cutoff (default 0) for full control of SetLimits call (and to prevent PECL extension from breaking.)
 * $wgSphinxSearchJSPath can be used to specify a different web path for SphinxSearch.js (for category search.)
 * Search term is now urlencoded when used in URLs (thanks Stas!)
 * Make sure $wgSphinxSuggestMode and $wgSphinxSearchPersonalDictionary are declared before being accessed.


 * v0.6 - August 25, 2008
 * fixed several bugs discovered since 0.6 beta release (or earlier...)


 * v0.6 beta - April 12, 2008
 * category filtering and AJAX-based sub-category filtering
 * various bug fixes
 * compatibility with Sphinx 0.9.8


 * v0.5.3 - October 27, 2007
 * case insensitive Did You Mean suggestions
 * allow for custom ASpell dictionary locations
 * support for editing the Personal Dictionary via special page.


 * v0.5.1 - October 25, 2007
 * fixed a bug where search results with long strings without spaces forced the user to use the horizontal scroll bar.


 * v0.5.0 - October 20, 2007
 * added google-like "Did You Mean" support for misspelled queries
 * fixed a bug for Internet Explorer users where pressing enter in the search form did not act like clicking the Go button
 * have an option to match any or match all terms
 * added the Before and After hooks around the search results


 * v0.4.3 - October 15, 2007
 * when sphinx is not the default search engine, viewing pages 2 and up of the results now actually uses sphinx.


 * v0.4.2 - October 12. 2007
 * when sphinx is not the default search engine, the special page search actually uses sphinx now.


 * v0.4.1 - October 11, 2007
 * made it optional to replace the default search with Sphinx completely. By default, Sphinx search becomes just another special page.
 * fixed a bug when search would crash if a matching article was deleted after last indexer run.


 * v0.3 - October 5, 2007
 * numerous updates and improvements by Svemir Brkic


 * v0.1 - September 24, 2007
 * initial release (RFC)

Charset for all languages
just copy the charset you need to the end of the definition of the charset_table in the sphinx.conf file after doing so you need to run a full index for me it also helped to restart the service.
 * http://www.sphinxsearch.com/wiki/doku.php?id=charset_tables