Extension:SphinxSearch

From MediaWiki.org
(Redirected from Sphinx)
Jump to: navigation, search
MediaWiki extensions manual - list
Crystal Clear action run.png
SphinxSearch

Release status: beta

Powered by sphinx.png
Implementation Search, Hook
Description Replaces built-in MediaWiki search with Sphinx
Author(s) Svemir Brkic, Paul Grinberg
Latest version 0.8.5 (Sep 11, 2011)
MediaWiki 1.16+ (use previous versions for older MW)
License GPL Licence
Download
README
Change Log
Example New World Encyclopedia
Parameters

$wgSphinxSearch_host, $wgSphinxSearch_port, $wgSphinxSearch_index, $wgSphinxSearch_index_list, $wgSphinxSearch_index_weights, $wgSphinxSearch_mode, $wgSphinxSearch_sortmode, $wgSphinxSearch_sortby, $wgSphinxSuggestMode, $wgSphinxSearchAspellPath, $wgSphinxSearchPersonalDictionary, $wgSphinxSearch_maxmatches, $wgSphinxSearch_cutoff, $wgSphinxSearch_weights, $wgSphinxSearchMWHighlighter, $wgSearchType, $wgAdvancedSearchHighlighting, $wgEnableMWSuggest

Hooks used
SphinxSearchBeforeResults

SphinxSearchBeforeQuery

Translate the SphinxSearch extension if possible

Check usage and version matrix; code metrics
Bugs: list open list all report

As MediaWiki-based site administrators, one of the most common complaints we receive is that the default search engine is far from excellent. In our day and age where Google sets the standard for search engine capabilities, users aren't happy with a basic search engine. They need, or should I say demand, a faster, easier and better engine.

The Sphinx Search Engine seems to promise exactly that; a full text search engine that is both flexible and fast. This extension incorporates the Sphinx engine into MediaWiki to provide a better alternative for searching.

Sphinx operates as a standalone server and does not keep any text to itself. It creates an index which is based on a SQL query that retrieves documents from a database (Mediawiki MySQL etc.), stores indices and at a later stage returns corresponding rows that matches the search.[1]

Download[edit | edit source]

Two separate software components are necessary, first you need the Sphinx Search Engine (hereafter called Sphinx) and second the SphinxSearch Extension (hereafter called extension).

Sphinx[edit | edit source]

Sphinx Search Engine.

Extension[edit | edit source]

SphinxSearch 0.8.5 version supports "intitle:", "incategory:", "prefix:", and other advanced Wikipedia search techniques described here and it also supports extended sphinx search syntax.

Download 0.8.5 git trunk.

Installation Instructions[edit | edit source]

Instructions on how to install Sphinx on Windows or Linux are similar but for a more comprehensive view on Windows and Sphinx see:

If you are running SQLite instead of MySQL, you might have a look at

Step 1 - Install Sphinx[edit | edit source]

Follow the instructions. You only need to do the actual installation, which means you do not need to do the "Quick Sphinx usage tour". You can verify your installation by following the rest of the steps here. Note: if installing on a Windows server, you do not need to compile anything; just download the Win32 release binaries.

A more detailed description about the Sphinx Search Engine installation process can be found in Sphinx Search Beginner's Guide.[2]

Step 2 - Configure Sphinx[edit | edit source]

Extract and copy the file sphinx.conf from the ../extensions/SphinxSearch directory into the Sphinx installation directory (we will refer to this file as /path/to/sphinx.conf hereafter) This directory should not be web-accessible, so you should not use the extensions folder. Make sure to adjust all values to suit your setup:

  • Set correct database, username, and password for your MediaWiki database
  • Update table names in SQL queries if your MediaWiki installation uses a prefix (backslash line breaks may need to be removed if the indexer step below fails)
  • Update the file paths (/var/data/sphinx/..., /var/log/sphinx/...) and create folders as necessary (i.e. for default unix install, add /usr/local on front and mkdir /usr/local/var/data/sphinx).
  • If your wiki is very large, you may want to consider specifying a query range in the conf file.
  • If your wiki is not in English, you will need to change (or remove) the morphology attribute.

Step 3 - Run Sphinx Indexer[edit | edit source]

Run the sphinx indexer to prepare for searching:

/path/to/sphinx/installation/bin/indexer --config /path/to/sphinx.conf --all

Once again, make sure to replace the paths to match your installation. This process is actually pretty fast, but clearly depends on how large your wiki is. Just be patient and watch the screen for updates.

Step 4 - Test Out Sphinx[edit | edit source]

With older versions of sphinx, there was a working command-line tool you could use to test the index without starting the searchd daemon. Currently, only a very basic test can be done using that tool, but at least it will provide some confirmation that things are setup correctly. Run this command:

/path/to/sphinx/installation/bin/search --config /path/to/sphinx.conf

You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.

Note: Even if there are issues at this step, they may be due to the nature of the command tool, which is meant only for debugging. Proceed with the remaining steps.

Step 5 - Start Sphinx Daemon[edit | edit source]

In order to speed up the searching capability for the wiki, we must run the sphinx in daemon mode. Add the following to whatever server startup script you have access (i.e. /etc/rc.local):

/path/to/sphinx/installation/bin/searchd --config /path/to/sphinx.conf >> /var/log/sphinx/sphinx-startup.log 2>&1

Note: without the daemon running, searching will not work. That is why it is critical to make sure the daemon process is started every time the server is restarted.

For Windows see ...

Step 6 - Configure Incremental Updates[edit | edit source]

To keep the index for the search engine up to date, the indexer must be scheduled to run at a regular interval. If your wiki is small, it's best to comment out wiki_incremental in sphinx.conf and just run the indexer for wiki_main. The reason is that wiki_main and wiki_incremental are additive only. Words that have been removed since wiki_main was updated will still appear even after wiki_incremental is run.

On most UNIX systems edit your crontab file by running the command:

crontab -e

Add this line to set up a cron job for the full index - for example once every night:

 0 3 * * * /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_main --rotate >/dev/null\
 2>&1; /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_incremental --rotate >/dev/null\
 2>&1

Add this line to set up a more frequent cron to update the smaller index regularly:

0 9,15,21 * * * /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1

As before, make sure to adjust the paths to suit your configuration. Note that --rotate option is needed if searchd deamon is already running, so that the indexer does not modify the index file while it is being used. It creates a new file and copies it over the existing one when it is done.

On Windows, commands like these inside a batch file should do the trick, provided you previously created the .CMD files running the indexer:

 at 23:00 /INTERACTIVE /every:M,T,W,TH,F,S,Su "%~dp0%__IndexMain__.cmd"
 at 08:00 /INTERACTIVE /every:M,T,W,TH,F,S,Su "%~dp0%__IndexIncr__.cmd"

Note that those tasks will only be manageable by the "at" command, and not through the control panel "Scheduled tasks" interface.

Also, adjust the SQL query for src_wiki_incremental source in sphinx.conf to match the time in the crontab for wiki_main, keeping in mind that MediaWiki may be storing the times in UTC while server that runs the cron may be using a different time zone.

Step 7 - Extension Preparation - SphinxSearch Folder[edit | edit source]

Create a 'SphinxSearch' folder, either by extracting a compressed file or downloading via SVN and place the SphinxSearch folder within the main MediaWiki 'extensions' folder.

Step 8.1 - Extension Preparation - Sphinx PHP API[edit | edit source]

The sphinxapi.php file is part of the sphinx download tarball (tar.gz). Extract and under the api/ directory you'll find sphinxapi.php . Copy the Sphinx API file, sphinxapi.php in the main MediaWiki SphinxSearch 'extensions' folder. You will need to copy this file again each time you update the Sphinx engine.

Step 8.2 - Extension Installation - PHP Files[edit | edit source]

Copy all remaining files of the extension (SphinxSearch.php, SphinxMWSearch.php, SphinxSearch_setup.php, SphinxSearch.i18n.php as of 2012-10-23) to your extensions/SphinxSearch directory.

Step 9 - Extension Installation - Local Settings[edit | edit source]

In the file LocalSettings.php (for more help, please see the LocalSettings.php manual) in the main MediaWiki directory, add the following line below the:

$wgSearchType = 'SphinxMWSearch';
require_once "$IP/extensions/SphinxSearch/SphinxSearch.php";

Step 10 Show Sphinx Search Support[edit | edit source]

If you want the general public let know that you are using Sphinx as back-end search engine you might want to add the following lines to your SphinxSearch.php. The logo can be downloaded from [3] and be copied in the directory folder .../extensions/SphinxSearch/skins/images/

$wgFooterIcons['poweredby']['sphinxsearch'] = array(
	'src' => "$wgScriptPath/extensions/SphinxSearch/skins/images/Powered_by_sphinx.png",
	'url' => 'http://www.mediawiki.org/wiki/Extension:SphinxSearch',
	'alt' => 'Search Powered by Sphinx',
);

Troubleshooting[edit | edit source]

What can I do when it doesn't seemed to work? What should I check first? Is there a way to switch to some kind of debug mode?

For those and other questions, please consult the troubleshooting page, which is a collection of some of the more common issues that might happen during an installation.

Configuration[edit | edit source]

Options[edit | edit source]

For the most part, the extension's default options do not need any modification. However, if tweaking is needed/desired, there are a number of configuration options that could be configured from LocalSettings.php after the above require_once line. Those are:

  • $wgSphinxSearch_host - the hostname on which sphinx's searchd daemon is running (defaults to localhost)
  • $wgSphinxSearch_port - the port number on which sphinx's searchd daemon is running (defaults to 9312)
  • $wgSphinxSearch_mode - the Sphinx search mode. The default mode is the most intuitive. See Sphinx documentation for other valid options.
  • $wgSphinxSearch_matches - the number of search hits to display per result page.
  • $wgSphinxSearch_weights - the way Sphinx orders the results. The default is pretty good. See Sphinx documentation for other valid options.
  • $wgSphinxSearch_groupby, $wgSphinxSearch_groupsort - define how to group the results. See Sphinx documentation for other valid options.
  • $wgSphinxSearch_sortby - set matches sorting mode (default to SPH_SORT_RELEVANCE). See Sphinx documentation for other valid options.

Search Box "As-You-Type" Suggestions[edit | edit source]

  • $wgEnableSphinxPrefixSearch - set to true to return suggestions from sphinx index by matching the query against the beginning of page titles.

Note: If you are using an older version of MW, you may need to set $wgEnableMWSuggest to true to enable search box suggestions.

Namespaces[edit | edit source]

A description on how to change the default namespaces can be found here.

Did You Mean[edit | edit source]

When performing a search and the search query is misspelled, the search results could be greatly impaired. Without knowing about the misspelling, it may take the user a while to figure out why their search results are not very good. That is why this extension has an optional "Did You Mean" support. When enabled, this feature will suggest a properly spelled search query for the user in case of a spelling mistake. Also, since many wikis utilize their own jargon, in order to make the "Did You Mean" suggestions more reasonable, this extension can optionally utilize a personalized dictionary.

This section is being updated. In the meantime, please see: Extension:SphinxSearch/Search suggestions

Stop Words[edit | edit source]

When modifying the sphinx.conf file (see #Step 2 - Configure Sphinx), there is an option for specifying a file containing search stop words. Stop words are those common words like 'a' and 'the' that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found [4], [5] and [6] here. Simply copy those words into a text file, and modify your sphinx.conf to point to that file with

stopwords = /path/to/stopwords.txt

Sphinx Indexing Performance[edit | edit source]

Please, have a look at How To Improve Sphinx Indexing Performance for more details.

Charsets for all languages[edit | edit source]

Copy the charset you need from here to the end of the definition of the charset_table in the sphinx.conf file. After doing so you need to run a full index and restart the service. See Sphinx forum or How to tell Sphinx that your document has CJK characters? for additional details.

Compatibility[edit | edit source]

MediaWiki prior to 1.9 is not supported. MW from 1.9 to 1.15 requires extension version 0.7.2 or below.

MW Sphinx engine MW Sphinx Search Status Description
1.16  ?  ? Yes check.svg works on CentOS 5.5 LAMP server (contact:tigerheight@gmail.com)
1.16.2  ?  ? Yes check.svg (Fungiblename)
Sphinx Search 0.7.2
1.16.5 1.10-beta (r2420) 0.7.2 Yes check.svg PHP 5.2.13 (apache2handler), MySQL 5.1.44-community, Windows Vista --MWJames 18:35, 3 June 2011 (UTC)
1.17.0 1.10-beta (r2420) 0.7.2 Yes check.svg PHP 5.2.13 (apache2handler), MySQL 5.1.44-community, Windows Vista --MWJames 18:35, 3 June 2011 (UTC)
1.19alpha (r92860) 0.9.9-6 0.7.2 (r92860) Yes check.svg PHP 5.3.5-1ubuntu7.2 (apache2handler), MySQL 5.1.54-1ubuntu4, --Jeroen De Dauw 17:05, 22 July 2011 (UTC)
Sphinx Search 0.8+
1.16.5 2.0.5-release (r3308) 0.8.5 Yes check.svg PHP 5.2.17 (cli), MySQL 4.1.22-community-nt, Server 2008 R2 Datacenter --Steevie (talk) 16:26, 5 August 2012 (UTC)
1.17.0 1.10-beta (r2420) 0.8.2 Yes check.svg PHP 5.3.5 (apache2handler), MySQL 5.5.8, Windows Vista --MWJames 03:39, 10 September 2011 (UTC)
1.18.0 2.0.3-release (Dec 2011) 0.8.5 Yes check.svg PHP 5.3.8 (apache2handler) MySQL 5.5.16, Windows Vista --MWJames 01:38, 4 January 2012 (UTC)
1.18.0 2.0.4-release (r3135) 0.8.5 Yes check.svg PHP 5.3.8 (apache2handler) MySQL 5.5.16, Windows Vista --MWJames (talk) 00:17, 6 May 2012 (UTC)
1.19.0 2.0.4-0ubuntu1 release 0.8.5 Yes check.svg PHP 5.3.10, Apache 2.2.22, MySQL-Client-Version: 5.5.24, Linux Ubuntu 12.04 --SmartK (talk) 12:43, 16 August 2012 (UTC)
1.19.2 2.0.5-release 0.8.5 Yes check.svg PHP 5.4.6, Apache 2.2.14, Oracle Client 10.2.0 + sqlrelay-0.46, Solaris 2.10 David Taylor (talk) 18:33, 16 September 2012 (UTC)
1.20.2 2.0.4-0ubuntu1 release 0.8.5 Yes check.svg PHP 5.3.10, Apache 2.2.22, MySQL-Client-Version: 5.5.28, Linux Ubuntu 12.04 --SmartK (talk) 16:11, 2 January 2013 (UTC)
1.21.0 2.1.1-beta Win64 w/MySQL (30-08-2013) 0.8.5 Yes check.svg PHP 5.4.16 TS (Win32, Apache 2.0 Handler), Apache 2.4.4 (Win32), MySQL Server 5.6 (64 bit), Windows Server 2008 SP1 64-bit --Jongfeli 13:11, 30 August 2013 (UTC)
1.21.1 2.0.4-0ubuntu1 release 0.8.5 Yes check.svg PHP 5.3.10-1ubuntu3.7 (apache2handler), Apache 2.2.22, MySQL-Client-Version: 5.5.32-0ubuntu0.12.04.1, Linux Ubuntu 12.04 --SmartK (talk)
1.22.1 2.1.3-release 0.8.5 Yes check.svg PHP 5.4.22, Apache 2.2.24, MySQL 5.6.14, OSX 10.9.1 --Svemir Brkic (talk) 00:45, 27 January 2014 (UTC))

The extension has been shown to work with the following Sphinx versions. Sphinx engine prior to 0.9.9 may require older versions of the extension.

  • 0.9.9 - Works - (Svemir)
  • 1.1.0 beta - Works, but only with SVN version of this extension - (Fungiblename)
  • 2.1.3 seems to work with SphinxSearch-ad8780e (Centos 64b, xampp 1.8.3-2, MySQL 5.6.14, MW 1.22.0)

The extension has been shown to work with the following languages. See below for #Charsets for all languages

Language Status Description
English Works all versions - (Alpha3)
German Works W2k3 and IIS - (80.152.175.189)
German Works apache2 on Ubuntu 12.04 - (SmartK)
Chinese Works MW1.15 + XAMPP + SphinxSearch 0.7 (MarkYin)
Chinese Works Win2003 wamp 1.7.3 - (Alpha3)
Chinese Works RHEL 5.4 + Nginx + Mediawiki With HTTPS -(atyu30)
Chinese Works OpenBSD 4.5 -(atyu30)
Russian Works (XAMPP, Debian) - StasFomin.
Hebrew Works W2K3 and IIS - CrushKing.
Japanese Works MediaWiki 1.16.5, PHP 5.2.13 (apache2handler), MySQL 5.1.44-community, SphinxSearch (Version 0.7.2), Sphinx 1.10-beta (r2420), Windows Vista --MWJames 18:35, 3 June 2011 (UTC)

Comparison matrix[edit | edit source]

The following matrix should help identify commonalities and differences in the various search engines available on Mediawiki. It is a work in progress and anybody with additional information is encouraged to alter the matrix. Additional information about the standard Mediawiki search design deficits[3], a discussion about the performance between Sphinx and Lucene can be found[4] [5], and a benchmark study of Sphinx searchd performance from Jon Schutz.[6] . For a more general comparison of open source search engines, please see[7].

Topic Sphinx Search1 MWSearch2 EzMwLucene2 Lucene-search2 ZSL2 Extension:SolrStore
OS System
  • Windows
  • Linux
  • Reportedly running on Windows 2003 server
  • Linux
  • Windows
  • Linux
  • Only needs a PHP environment
  • Windows
  • Linux
Requirements
  • MediaWiki 1.16
  • Sphinx Search 0.8+
  • Sphinx search engine 0.9+
  • MediaWiki 1.13+ (1.11.1 works with only basic features)
  • MediaWiki 1.13
  • Java 1.6+
  • php_curl package
  • MediaWiki 1.16+
  • PHP Zend Framework 1.11-1.12,
  • PHP >=5.2.3
  • MediaWiki 1.16+
  • PHP 5
Features
Search Syntax
  • Proximity search
  • Boolean search
  • Phrases in double quotes
  • Wildcard search
  • Exclusion
  • Supporting intitle:, incategory:, prefix:
  • Proximity search
  • Boolean search
  • Phrases in double quotes
  • Wildcard search
  • Exclusion
  • Supporting intitle:, incategory:, prefix:
Result weighting
Indexing
Performance
Miscellaneous
  • External Sphinx search engine
  • Simplified Lucene search to Mediawiki
  • It is based on Lucene search API
  • Provide ranking based on number of backlinks, distributed searching and indexing
  • PHP based on Java-based original Apache Lucene
  • Java-based Apache Solr/Lucene
Target group Full-text search Semantic supported search
1 Data corresponds to Sphinx Search 0.8+

2 Data are copied from corresponding pages, changes might have occurred.

Feature requests[edit | edit source]

Support[edit | edit source]

For general inquiries, you might consult the SphinxSearch talk page or Troubleshooting page, while for errors appearing in connection with the extension one should file a bugzilla report. Questions related to the Sphinxsearch software, Sphinxsearch API, Sphinxsearch indexer itself should be directed to Sphinxsearch forum.

By reporting problems or issues one should always include information about the Sphinxsearch software version, Mediawiki version and extension version to help track down possible areas of impact.

Revisions[edit | edit source]

Prior to version 0.8, revisions can be accessed at revisions log while old revisions can still be downloaded at SourceForge

  • v0.8 - September 7, 2011
    • Use of standard MW search interface
    • Support of individual indexed columns weight
    • Support of three different suggestion mode (enchant, soundex, aspell)
    • Still updating the documenation

See also[edit | edit source]

  • Rhea/Assimi is a visual search engine using the SphinxSearch extension to MediaWiki.
  • despite-behavior.com is a French installation guide.
  • Mars Tekkom DK is a site using an older version of this extension, and it has good installation instructions
  • IEEE Global History Network uses SphinxSearch to search documentation, analysis and explanation of the history of electrical, electronic, and computer technologies of its Global History Network wiki.

Notes[edit | edit source]

  1. Sphinx & MySQL: facts and misconceptions
  2. Ali, A. (2011). Sphinx Search Beginner's Guide, Packt Publishing, ISBN 9781849512541, [1]
  3. An evaluation of the standard Mediawiki search
  4. Choosing a stand-alone full-text search server: Sphinx or SOLR?
  5. Performance comparison between Sphinx and Lucene
  6. Sphinx Search Engine Comparative Benchmarks is a a comparative benchmark study of Sphinx searchd performance, looking at the effect of distributing an index across multiple CPU cores and/or multiple CPUs
  7. Christian Middleton, Ricardo Baeza-Yates. A Comparison of Open Source Search Engines, [2]