Extension:Lucene-search

Lucene-search is a search engine back end for large MediaWiki websites. It is the search engine used by Wikimedia wikis. (Smaller sites might want to consider SphinxSearch.) Lucene-search extends the Apache Lucene search API to rank pages based on number of backlinks, distributed searching and indexing, parsing of wiki text, incremental updates, etc.

Lucene-search requires a front-end extension to fetch the results from the search engine:
 * Extension:MWSearch (for MediaWiki version 1.13+)
 * Extension:LuceneSearch (for MediaWiki version prior to 1.13)

Versions

 * 2.1 (development) - used on all Wikimedia Foundation wikis
 * Features:
 * Result Highlighting
 * "Did you mean.." type query correction. AKA spell checking.
 * Advanced ranking capabilities based on term proximity, relatedness and anchor text.


 * 2.0.2 (stable) c.f. Extension:Lucene-search/2.0 docs''
 * Features:
 * Distributed search
 * Scalability
 * Basic ranking,
 * Accentless search

The following documentation is for the latest development version (2.1). The old documentation is at Extension:lucene-search/2.0 docs.

Requirements

 * Linux
 * Java 6+ Jdk (OpenJDK or Sun)
 * Apache Ant 1.6
 * Rsync (required for distributed architecture )
 * Subversion client

Note to Windows users: From version 2.0 onward, the LSearch daemon doesn't support the Windows platform (since it uses hard and soft file links). You can still use the old daemon written in C#. See the installation instructions.

Single Host Setup (MediaWiki & Lucene-Search On The Same Host)
1. If using MediaWiki version 1.17 or before, ensure that AdminSettings.php is set up. AdminSettings.sample must be renamed AdminSettings.php, and modified so that it contains: $wgDBadminuser = "database_admin_username"; $wgDBadminpassword = "database_admin_password";

2. Get Lucene-search to
 * Download the binary release from and unpack it.
 * Download the source from subversion
 * run "ant" to build the jar. Bulbgraph.png recommended.

ant

3. Generate configuration files by running:

./configure 


 * This script will examine your MediaWiki installation, and generate configuration files to match your installation.  Before configure, you may customize some options in template/simple/lsearch-global.conf, for example language option.  See /2.0 docs for more details about these options.

4. If everything went without exception, build indexes

./build


 * This will build search, highlight and spellcheck indexes from xml database dump. For small wikis, just put this script into daily cron and installation is done, move onto Running.


 * For larger wikis, install Extension:OAIRepository MediaWiki extension and after building the initial index use incremental updater:

./update


 * This will fetch latest updates from your wiki, and update various indexes with search, page links and spell check data. Put this into daily cron to keep the indexes up-to-date.

5. Start the daemon. Do this by running:

./lsearchd


 * The Lucene-search daemon needs to be started in order for searching to work. If you want to setup lsearchd to start automatically when the server boots you can use this init.d script (for Ubuntu 10.04).

6. Install Extension:MWSearch and make sure to set $wgLuceneSearchVersion = 2.1.

7. Once the indexes have been built and MWSearch installed, run the daemon:

./lsearchd

The daemon will listen on port 8123 for incoming search requests from MediaWiki, and on port 8321 for incoming incremental updates for the index. MWSearch extension will reroute all search requests to this daemon.

Your may simply test the search result by browsing to the HTTP URL like, http :// :8123/search/'' file. Each of these sections needs to be updated to use the correct host name and database.



[Database] Section

 * list the databases to be indexed in the [Database] section.
 * 1) * is the name set in $wgDBname in your MediaWiki LocalSettings.php file). or
 * 2) *  where  provide a uri to a database list.


 * are
 * 1) distributed index configuraion
 * 2) *  takes a value:
 * 3) ** single   - index is not distributed.
 * 4) ** mainsplit - two part index. mainspace with [0] namespace; restspace with all other namespaces. (recommended)
 * 5) ** split    - split ???
 * 6) ** nssplit  - split by name space list.
 * : true to optimize while indexing,  false to skip. (optional)
 * : the set size of document cache, default is XXX. (optional)
 * : the set merge factor, default is YY. (optional)
 * : the set subdivisions, default is YY. (optional)
 * 1) *(nssplit,number)
 * 2) (language,en) default language and stemming type
 * 3) optional parameters:  (warmup, NUM) bootstrap after an index update using NUM queries. This enables smooth transition in performance by ensuring indexes are always well cached and buffered.

An Example
[Database] {file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3) enwiki : (nssplit,2) enwiki : (nspart1,[0],true,20,500,2) enwiki : (nspart2,[],true,20,500) enwiki : (spell,40,10) (warmup,500)
 * 1) wikilucene : (single) (language,en) (warmup,100)
 * 2) wikidev : (single) (language,sr)
 * 3) splitLucene : (nssplit,3), (nspart1,[0]), (nspart2,[4,5,12,13]), (nspart3,[])
 * 4) wikilucene : (language,en) (warmup,10)


 * all the databsese at  should be indexed (single) (language,en)' - declares that   is a single (nondistributed) index and that it should use English as its default language, and thus use English stemming. The optional   instructs Lucene to apply 100 queries to the index when an updated version is fetched to warm it up.
 * - declares that  is a single (nondistributed) index and that it should use English as its default language, and thus use English stemming. The optional   instructs Lucene to apply 100 queries to the index when an updated version is fetched to warm it up.
 * declares that  is a distributed index. With three parts
 * which will store the index for namespace 0
 * which will store the index for namespaces 4,5,12,13
 * which will store the index for the other namespaces

[Search-Group] Section
[Search-Group] ''' :  with your local host name.}}
 * [Search-Group] section: Map your server hostname to the database that's being searched and indexed.
 * 1) oblak : wikilucene wikidev+

[Index] Section
Change oblak to your host name like you did for [Search-Group].

[Namespace-Prefix] Section
Add customized user namespaces used in the wiki to this section.

For other properties you can leave default values.

Incremental Updates
If you feel that periodically rebuilding the index puts too much load on your database, you can use the incremental updater. It requires some additional work:
 * 1) Install OAI Repository extension for MediaWiki. This extension enables the incremental updater to fetch the latest articles.  It is a fairly complex installation but it is the most practical way to keep your index up-to-date without causing serious performance issues.  This is used on Wikimedia servers.

Split index
If your index is too big, and cannot fit into memory, you might want to split it up in smaller parts. There are a couple of ways to do this. Simplest way is to do mainsplit. This is split index into two parts, one with all articles in main namespace, and one for everything else. You can also do a nssplit, which will let split index by any combination of namespaces. Finally, there is a split architecture which randomly assigns documents to one of the N index parts. From performance viewpoint it's best to split index by namespaces, if possible as mainsplit. This is best if we assume the user almost always wants to search only the main namespace.

If you split index to many hosts, the usage will be load-balanced. E.g. at every search different combination of hosts having the required index parts will be searched. The MediaWiki Lucene Search extension doesn't need to worry about this, just have to get the request to host that has some part of the index.

There are examples of using these index architectures in lsearch-global.conf in the package.

This extension supports all kinds of exotic options:
 * index updates with custom rotation exceptions,