Extension:Lucene-search

Lucene-search is a search engine designed to index and search MediaWiki content on large websites. It is based on Lucene search API. It extends the API to provide ranking based on number on backlinks, distributed searching and indexing, parsing of wikitext, incremental updates etc.. This is the search engine currently being used on Wikimedia wikis.

MediaWiki can use Extension:LuceneSearch (pre 1.13) or Extension:MWSearch (1.13+) to fetch results from this search engine.

Versions

 * 2.1 (devel) - enabled on English Wikipedia (last announcement)
 * Features: "Did you mean..", highlighting, ranking based on proximity, relatedness and anchor text


 * 2.0.2 (stable) - running on other Wikimedia projects
 * Features: distributed search, scalability, basic ranking, accentless search

The following documentation is for the latest development version (2.1). The old documentation is at Extension:lucene-search/2.0 docs.

Installation
Requires: Linux, Java 5, Apache Ant 1.6, Rsync (for distributed architecture)


 * Note Windows users: LSearch daemon from version 2.0 doesn't support Windows platform (since it uses hard and soft file links). (It should be possible to get this to work in Vista with enough fiddling . . .) You can still use the old daemon written in C#. Here are the installation instructions: m:Installing lucene search.

The rest of the documentation will assume linux.


 * Download binary and unpack. Or, get the latest version from svn and then run "ant" to build the jar.
 * Generate configuration files by running:

./configure 


 * This script will examine your MediaWiki installation, and generate configuration files to match your installation.


 * If everything went without exception, build indexes:

./build


 * This will build search, highlight and spellcheck indexes from xml database dump. For small wikis, just put this script into daily cron and installation is done.


 * For larger wikis, install Extension:OAIRepository MediaWiki extension and after building the initial index use incremental updater:

./update


 * This will fetch latest updates from your wiki, and update various indexes with search, page links and spell check data. Put this into daily cron to keep the indexes up-to-date.


 * Install Extension:MWSearch and make sure to set $wgLuceneSearchVersion = 2.1.

Running
Once the indexes have been built and MWSearch installed, run the daemon:

./lsearchd

The deamon will listen on port 8123 for incoming search requests from MediaWiki, and on port 8321 for incoming incremental updates for the index. MWSearch extension will reroute all search requests to this daemon.

Further instructions
This extension supports all kinds of exotic options, like distributing the search architecture, index updates with custom rotation exceptions, multiple wikis, etc... However, the documentation for these advanced options is currently scattered around java doc strings. Old documentation and this page talk page might provide further information.

Brief reflection on algorithms used is available at User:Rainman/search internals.