Extension:Lucene-search

Lucene-search is a search engine extension for large MediaWiki websites; it is currently the search engine for Wikimedia wikis. (Smaller sites might want to consider SphinxSearch.) Lucene-search extends the Apache Lucene search API to rank pages based on number of backlinks, distributed searching and indexing, parsing of wiki text, incremental updates, etc.

Lucene-search requires a front-end extension to fetch the results from the search engine:
 * Extension:MWSearch (for MediaWiki version 1.13+)
 * Extension:LuceneSearch (for MediaWiki version prior to 1.13)

Versions

 * 2.1 (development) - used on all Wikimedia Foundation wikis
 * Features:
 * "Did you mean.."
 * Result Highlighting
 * Advanced ranking capabilities based on term proximity, relatedness and anchor text.


 * 2.0.2 (stable) c.f. Extension:Lucene-search/2.0 docs''
 * Features:
 * Distributed search
 * Scalability
 * Basic ranking,
 * Accentless search

The following documentation is for the latest development version (2.1). The old documentation is at Extension:lucene-search/2.0 docs.

Requirements

 * Linux
 * Java 6+ Jdk (OpenJDK or Sun)
 * Apache Ant 1.6
 * Rsync (required for distributed architecture )
 * Subversion client

Note to Windows users: From version 2.0 onward, the LSearch daemon doesn't support the Windows platform (since it uses hard and soft file links). You can still use the old daemon written in C#. See the installation instructions.

Single Host Setup
1. If using MediaWiki version 1.17 or before, ensure that AdminSettings.php is set up. AdminSettings.sample must be renamed AdminSettings.php, and modified so that it contains: $wgDBadminuser = "database_admin_username"; $wgDBadminpassword = "database_admin_password";

2. Get Lucene-search to
 * Download the binary release from and unpack it.
 * Download the source from subversion
 * run "ant" to build the jar. Bulbgraph.png recommended.

ant

3. Generate configuration files by running:

./configure 


 * This script will examine your MediaWiki installation, and generate configuration files to match your installation.  Before configure, you may customize some options in template/simple/lsearch-global.conf, for example language option.  See /2.0 docs for more details about these options.

4. If everything went without exception, build indexes

./build


 * This will build search, highlight and spellcheck indexes from xml database dump. For small wikis, just put this script into daily cron and installation is done, move onto Running.


 * For larger wikis, install Extension:OAIRepository MediaWiki extension and after building the initial index use incremental updater:

./update


 * This will fetch latest updates from your wiki, and update various indexes with search, page links and spell check data. Put this into daily cron to keep the indexes up-to-date.

5. Start the daemon. Do this by running:

./lsearchd


 * The Lucene-search daemon needs to be started in order for searching to work. If you want to setup lsearchd to start automatically when the server boots you can use this init.d script (for Ubuntu 10.04).

6. Install Extension:MWSearch and make sure to set $wgLuceneSearchVersion = 2.1.

Running
Once the indexes have been built and MWSearch installed, run the daemon:

./lsearchd

The daemon will listen on port 8123 for incoming search requests from MediaWiki, and on port 8321 for incoming incremental updates for the index. MWSearch extension will reroute all search requests to this daemon.

Your may simply test the search result by browsing to the HTTP URL like, http :// :8123/search// For example, http://localhost:8123/search/wikidb/hello.

Further instructions
This extension supports all kinds of exotic options, like distributing the search architecture, index updates with custom rotation exceptions, multiple wikis, etc... However, the documentation for these advanced options is currently scattered around java doc strings. Old documentation and this page talk page might provide further information.

Brief reflection on algorithms used is available at User:Rainman/search internals.

... put the indexer on a different host
[search] path =  comment = Lucene Search 2 index data read only
 * Make sure Java is installed on the new indexer host.
 * Look at lsearch-global.conf in [Index] section. Replace the search host name in this section by the new indexer host name.
 * Copy your lucene-search installation (with config files and indexes) to the new indexer host
 * On the indexer, edit /etc/rsyncd.conf and add these lines:

The local path to indexes is just the indexes/ subdirectory of your lucene-search installation on the indexer.
 * Run rsyncd via rsync --daemon
 * transfer the appropriate cron jobs to indexer (e.g. build or update)
 * (re)start lsearchd on indexer and searcher

After new index is built on the indexer (e.g. via a daily cronjob), searcher will pick it up, transfer it and use it.

Note however that this will produce two different set of configuration files, and you need to update both of them on any subsequent changes. A better idea is to share the lsearch-global.conf</tt> file via NFS, or put it on a URL (to do this, edit the lsearch-global.conf</tt> location in lsearch.conf</tt>).