Extension:SphinxSearch

As MediaWiki-based site administrators, one of the most common complaints we receive is that the default search engine is far from excellent. In our day and age where Google sets the standard for search engine capabilities, users aren't happy with a basic search engine. They need, or should I say demand, a faster, easier and better engine.

The Sphinx Search Engine seems to promise exactly that; a full text search engine that is both flexible and fast. This extension incorporates the Sphinx engine into MediaWiki to provide a better alternative for searching.

Sphinx operates as a standalone server and does not keep any text to itself. It creates an index which is based on a SQL query that retrieves documents from a database (Mediawiki MySQL etc.), stores indices and at a later stage returns corresponding rows that matches the search.

Download
Two separate software components are necessary, first you need the Sphinx Search Engine (hereafter called Sphinx) and second the SphinxSearch Extension (hereafter called extension).

Sphinx
Sphinx Search Engine.

Extension
Current (SVN trunk) version of the extension supports "intitle:", "incategory:", "prefix:", and other advanced Wikipedia search techniques described here and it also supports extended sphinx search syntax.

Installation Instructions
Instructions on how to install Sphinx on Windows OS, can be found on the Windows installation subpage.

Step 1 - Install Sphinx
Follow the instructions. You only need to do the actual installation, which means you do not need to do the "Quick Sphinx usage tour". You can verify your installation by following the rest of the steps here. Note: if installing on a Windows server, you do not need to compile anything; just download the Win32 release binaries.

A more detailed description about the Sphinx Search Engine installation process can be found in Sphinx Search Beginner's Guide.

Step 2 - Configure Sphinx
Download and extract the extension to a temporary directory. Copy the sphinx.conf file from this download to some directory (we will refer to this file as "/path/to/sphinx.conf" below.) This directory should not be web-accessible, so you should not use the extensions folder. Make sure to adjust all values to suit your setup:
 * Set correct database, username, and password for your MediaWiki database
 * Update table names in SQL queries if your MediaWiki installation uses a prefix (backslash line breaks may need to be removed if the indexer step below fails)
 * Update the file paths (/var/data/sphinx/..., /var/log/sphinx/...) and create folders as necessary (i.e. for default unix install, add /usr/local on front and mkdir /usr/local/var/data/sphinx).
 * If your wiki is very large, you may want to consider specifying a query range in the conf file.
 * If your wiki is not in English, you will need to change (or remove) the morphology attribute.

Credit: Thanks to the author of this excellent article for providing an excellent starting point on configuring this file.

Note: If running on SQLite instead of MySQL, please see Extension:SphinxSearch/SQLite_configuration

Step 3 - Run Sphinx Indexer
Run the sphinx indexer to prepare for searching: Once again, make sure to replace the paths to match your installation. This process is actually pretty fast, but clearly depends on how large your wiki is. Just be patient and watch the screen for updates.

Step 4 - Test Out Sphinx
When the indexer is finished, test that sphinx searching is actually working: You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.

Step 5 - Start Sphinx Daemon
In order to speed up the searching capability for the wiki, we must run the sphinx in daemon mode. Add the following to whatever server startup script you have access (i.e. /etc/rc.local):

Note: without the daemon running, searching will not work. That is why it is critical to make sure the daemon process is started every time the server is restarted.


 * Please see Windows installation subpage if running sphinx on Windows.

Step 6 - Configure Incremental Updates
To keep the index for the search engine up to date, the indexer must be scheduled to run at a regular interval. If your wiki is small, it's best to comment out wiki_incremental in sphinx.conf and just run the indexer for wiki_main. The reason is that wiki_main and wiki_incremental are additive only. Words that have been removed since wiki_main was updated will still appear even after wiki_incremental is run.

On most UNIX systems edit your crontab file by running the command: crontab -e Add this line to set up a cron job for the full index - for example once every night: 0 3 * * * /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_main --rotate >/dev/null\ 2>&1; /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_incremental --rotate >/dev/null\ 2>&1 Add this line to set up a more frequent cron to update the smaller index regularly: 0 9,15,21 * * * /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1 As before, make sure to adjust the paths to suit your configuration. Note that --rotate option is needed if searchd deamon is already running, so that the indexer does not modify the index file while it is being used. It creates a new file and copies it over the existing one when it is done.

On Windows, commands like these inside a batch file should do the trick, provided you previously created the .CMD files running the indexer: at 23:00 /INTERACTIVE /every:M,T,W,TH,F,S,Su "%~dp0%__IndexMain__.cmd" at 08:00 /INTERACTIVE /every:M,T,W,TH,F,S,Su "%~dp0%__IndexIncr__.cmd" Note that those tasks will only be manageable by the "at" command, and not through the control panel "Scheduled tasks" interface.

Also, adjust the SQL query for src_wiki_incremental source in sphinx.conf to match the time in the crontab for wiki_main, keeping in mind that MediaWiki may be storing the times in UTC while server that runs the cron may be using a different time zone.

Step 7 - Extension Preparation - Sphinx PHP API
Create extensions/SphinxSearch directory and copy the Sphinx API file, sphinxapi.php there. This file is part of the sphinx download, under the api/ directory. You will need to copy this file again each time you update the Sphinx engine.

Step 8 - Extension Installation - PHP Files
Copy all remaining files of the extension (SphinxSearch.php, SphinxSearch_body.php, etc.) from the temporary directory you extracted the code to in to your extensions/SphinxSearch directory.

Step 9 - Extension Installation - Local Settings
Add the following text to your LocalSettings.php

Troubleshooting
What can I do when it doesn't seemed to work? What should I check first? Is their a way to switch to some kind of debug mode?

For those and other questions, please consult the troubleshooting page, which is a collection of some of the more common issues that might happen during an installation.

Options
For the most part, the extension's default options do not need any modification. However, if tweaking is needed/desired, there are a number of configuration options that could be configured from LocalSettings.php after the above require_once line. Those are:
 * $wgSphinxSearch_host - the hostname on which sphinx's searchd daemon is running (defaults to localhost)
 * $wgSphinxSearch_port - the port number on which sphinx's searchd daemon is running (defaults to 9312)
 * $wgSphinxSearch_mode - the Sphinx search mode. The default mode is the most intuitive. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_matches - the number of search hits to display per result page.
 * $wgSphinxSearch_weights - the way Sphinx orders the results. The default is pretty good. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_groupby, $wgSphinxSearch_groupsort - define how to group the results. See Sphinx documentation for other valid options.
 * $wgSphinxSearch_sortby - set matches sorting mode (default to SPH_SORT_RELEVANCE). See Sphinx documentation for other valid options.

Did You Mean
When performing a search and the search query is misspelled, the search results could be greatly impaired. Without knowing about the misspelling, it may take the user a while to figure out why their search results are not very good. That is why this extension has an optional "Did You Mean" support. When enabled, this feature will suggest a properly spelled search query for the user in case of a spelling mistake. Also, since many wikis utilize their own jargon, in order to make the "Did You Mean" suggestions more reasonable, this extension can optionally utilize a personalized dictionary.

This section is being updated. In the meantime, please see: Extension:SphinxSearch/Search suggestions

Stop Words
When modifying the sphinx.conf file (see ), there is an option for specifying a file containing search stop words. Stop words are those common words like 'a' and 'the' that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found, here. Simply copy those words into a text file, and modify your sphinx.conf to point to that file with

Charsets for all languages
Copy the charset you need from here to the end of the definition of the charset_table in the sphinx.conf file. After doing so you need to run a full index and restart the service. See this post on Sphinx forums for additional details.

Compatibility
MediaWiki prior to 1.9 is not supported. MW from 1.9 to 1.15 requires extension version 0.7.2 or below.

The extension has been shown to work with the following Sphinx versions. Sphinx engine prior to 0.9.9 may require older versions of the extension.
 * 0.9.9 - Works - (Svemir)
 * 1.1.0 beta - Works, but only with SVN version of this extension - (Fungiblename)

The extension has been shown to work with the following languages. See below for

Comparison matrix
The following matrix should help identify commonalities and differences in the various search engines available on Mediawiki. It is a work in progress and anybody with additional information is encouraged to alter the matrix. Additional information about the standard Mediawiki search design deficits, a discussion about the performance between Sphinx and Lucene can be found  , and a benchmark study of Sphinx searchd performance from Jon Schutz. . For a more general comparison of open source search engines, please see.

ToDo

 * Assign weights to namespaces
 * Sort the results in SPH_SORT_EXTENDED mode by @relevance and by number of times the page has been viewed (available from wiki database). The idea behind this is that given two pages that have the same relevance to the search, if one has been viewed more times, there is probably a reason for that. Number of links to each page could also be included in the calculation. (Not official yet, but here is how you can to do this)
 * If originally "Go" was clicked, and "did you mean" link results in a direct match, redirect to that page.
 * Show categories in result list
 * Search separate databases and display combined results
 * Exclude selected categories from search (done in 0.8.4)
 * Support for $wgCompressRevisions
 * Using the search function in templates
 * Bugzilla 30839 Real-time indexes (Sphinx 2.0.1)
 * Use updateAttributes call to update categories and other attributes as soon as the page is updated (at least until RT indexes become stable.)
 * Bugzilla 30869 Display of relevance ranking (search confidence) within search results

Support
For general inquiries, you might consult the SphinxSearch talk page or Troubleshooting page, while for errors appearing in connection with the extension one should file a bugzilla report. Questions related to the Sphinxsearch software, Sphinxsearch API, Sphinxsearch indexer itself should be directed to Sphinxsearch forum.

By reporting problems or issues one should always include information about the Sphinxsearch software version, Mediawiki version and extension version to help track down possible areas of impact.

Revisions
Old revisions described here can be downloaded at SourceForge
 * v0.8 - September 7, 2011
 * Still updating the documenation
 * v0.7 - February 17, 2010
 * Added "ignore" checkbox to category search (so only articles that do not have that category are returned.)
 * Smarter handling of multiple Sphinx index files.
 * Added experimental support for excluding categories ($wgUseExcludes)
 * Use addcslashes to escape new sphinx operators (/[]"!)
 * Added a warning when sphinx stats may appear misleading
 * Added 'match titles only' checkbox
 * Added $wgSphinxSearch_index_list (defaults to '*', can be used to set specific list of indexes to search)
 * Added $wgSphinxSearch_index_weights (allows setting different weight per index)
 * Added i18n and alias files for correct way to provide translations
 * Changed the default port to the new official sphinx port (9312)
 * Use listen in sphinx.conf (address directive deprecated)
 * Use autoload directly, no more dependence on ExtensionFunctions
 * Moved things around so it is not necessary to edit SphinxSearch.php anymore
 * Added initial Search API support
 * Added $wgSphinxSearch_sortmode, fixed $wgSphinxSearch_sortby
 * Add nowiki tags around user input


 * v0.6.1 - November 11, 2008
 * Added SphinxSearchGetNearMatch hook - called with $term and $title (or null) returned from SearchEngine::getNearMatch.
 * If PECL SphinxClient is installed, do not include sphinxapi.php.
 * Added $wgSphinxSearch_maxmatches (defaults to 1000) and $wgSphinxSearch_cutoff (default 0) for full control of SetLimits call (and to prevent PECL extension from breaking.)
 * $wgSphinxSearchJSPath can be used to specify a different web path for SphinxSearch.js (for category search.)
 * Search term is now urlencoded when used in URLs (thanks Stas!)
 * Make sure $wgSphinxSuggestMode and $wgSphinxSearchPersonalDictionary are declared before being accessed.


 * v0.6 - August 25, 2008
 * fixed several bugs discovered since 0.6 beta release (or earlier...)


 * v0.6 beta - April 12, 2008
 * category filtering and AJAX-based sub-category filtering
 * various bug fixes
 * compatibility with Sphinx 0.9.8


 * v0.5.3 - October 27, 2007
 * case insensitive Did You Mean suggestions
 * allow for custom ASpell dictionary locations
 * support for editing the Personal Dictionary via special page.


 * v0.5.1 - October 25, 2007
 * fixed a bug where search results with long strings without spaces forced the user to use the horizontal scroll bar.


 * v0.5.0 - October 20, 2007
 * added google-like "Did You Mean" support for misspelled queries
 * fixed a bug for Internet Explorer users where pressing enter in the search form did not act like clicking the Go button
 * have an option to match any or match all terms
 * added the Before and After hooks around the search results


 * v0.4.3 - October 15, 2007
 * when sphinx is not the default search engine, viewing pages 2 and up of the results now actually uses sphinx.


 * v0.4.2 - October 12. 2007
 * when sphinx is not the default search engine, the special page search actually uses sphinx now.


 * v0.4.1 - October 11, 2007
 * made it optional to replace the default search with Sphinx completely. By default, Sphinx search becomes just another special page.
 * fixed a bug when search would crash if a matching article was deleted after last indexer run.


 * v0.3 - October 5, 2007
 * numerous updates and improvements by Svemir Brkic


 * v0.1 - September 24, 2007
 * initial release (RFC)