Extension:Lucene-search

From MediaWiki.org
Jump to: navigation, search


MediaWiki extensions manual
Crystal Clear action run.png
lucene-search

Release status: beta

Implementation Search
Description Search engine for MediaWiki
Author(s) Robert Stojnić (Rainmantalk)
Latest version 2.1.3 (devel)
2.0.2 (stable)
MediaWiki 1.5+
License GPL
Download Binary for Java6 (2.1)

SVN (2.1)
Binary (2.0)
Wikimedia git repository: operations/debs/lucene-search-2
README.txt (stable)

Translate the Lucene-search extension if it is available at translatewiki.net

Check usage and version matrix; code metrics

Phabricator:

Open tasks · Report a bug

Lucene-search is a search engine back end for large MediaWiki websites. It is the search engine used by Wikimedia wikis. (Smaller sites might want to consider SphinxSearch.) Lucene-search extends the Apache Lucene search API to rank pages based on number of backlinks, distributed searching and indexing, parsing of wiki text, incremental updates, etc.

Lucene-search requires a front-end extension to fetch the results from the search engine:

Versions[edit | edit source]

2.1 (development) - used on all Wikimedia Foundation wikis
Features:
  • Result Highlighting
  • "Did you mean.." type query correction. AKA spell checking.
  • Advanced ranking capabilities based on term proximity, relatedness and anchor text.
2.0.2 (stable) c.f. Extension:Lucene-search/2.0 docs
Features:
  • Distributed search
  • Scalability
  • Basic ranking,
  • Accentless search

The following documentation is for the latest development version (2.1). The old documentation is at Extension:lucene-search/2.0 docs.

Requirements[edit | edit source]

  • Linux
  • Java 6+ Jdk (OpenJDK or Sun)
  • Apache Ant 1.6 (for building from source)
  • Rsync (required for distributed architecture)
  • Subversion client

Note to Windows users: From version 2.0 onward, the LSearch daemon doesn't support the Windows platform (since it uses hard and soft file links). You can still use the old daemon written in C#. See the installation instructions.

Installation[edit | edit source]

Single Host Setup (MediaWiki & Lucene-Search On The Same Host)[edit | edit source]

1. If using MediaWiki version 1.17 or before, ensure that AdminSettings.php is set up. AdminSettings.sample must be renamed AdminSettings.php, and modified so that it contains:

 $wgDBadminuser = "database_admin_username";
 $wgDBadminpassword  = "database_admin_password";

2. Get Lucene-search to /usr/local/search/ls2

  • Download the binary release from and unpack it.
  • Download the source from subversion
    • run "ant" to build the jar. Note recommended.
  ant

3. Generate configuration files by running:

 ./configure <path to mediawiki root directory>
This script will examine your MediaWiki installation, and generate configuration files to match your installation. Before configure, you may customize some options in template/simple/lsearch-global.conf, for example language option. These options are explained below.

4. If everything went without exception, build indexes

 ./build
This will build search, highlight and spellcheck indexes from a xml database dump.
  • For small wikis, just put this script into daily cron and your installation is done (i.e. skip the ./update step below, since small wikis don't need the OAIRespository extension).
  • For larger wikis, install Extension:OAIRepository MediaWiki extension and after building the initial index use incremental updater:
 ./update
This will fetch latest updates from your wiki, and update various indexes with search, page links and spell check data. Put this into daily cron to keep the indexes up-to-date.

5. Start the daemon. Do this by running:

 ./lsearchd
Note: The Lucene-search daemon needs to be started in order for searching to work and does not install an init.d script to start the program automatically on boot. As noted in this post, an init.d script can be created manually and added to the startup queue with a separate command. This should not be version specific but has been tested to work in 10.04 and 12.04LTS. Alternately, this rc.local entry (tested in Ubuntu 12.04 LTS) can be used.
Use the optional command line parameter -configfile to specify the path the lsearch.conf file you wish to use. This is handy when using the absolute path to lsearchd.
 /opt/lucene-search/lsearchd -configfile /opt/lucene-search/lsearch.conf

6. Install Extension:MWSearch and make sure to set $wgLuceneSearchVersion = 2.1.

7. Once the indexes have been built and MWSearch installed, run the daemon:

 ./lsearchd

The daemon will listen on port 8123 for incoming search requests from MediaWiki, and on port 8321 for incoming incremental updates for the index. MWSearch extension will reroute all search requests to this daemon.

You may simply test the search result by browsing to the HTTP URL like,

http://<hostname>:8123/search/<database_name>/<your_test_query>

For example, http://localhost:8123/search/wikidb/hello.

Dual Host Setup (MediaWiki & Lucene-Search On Different Hosts)[edit | edit source]

  1. Install search in the MediaWiki host.
  2. Ensure that Java is installed on the new indexer host.
  3. In lsearch-global.conf in [Index] section. Replace the search host name in this section by the new indexer host name.
  4. Copy your lucene-search installation (with config files and indexes) to the new indexer host
  5. On the indexer, edit /etc/rsyncd.conf and add these lines:
 [search]
 path = <put your local path to indexes here>
 comment = Lucene Search 2 index data
 read only

The local path to indexes is just the indexes/ subdirectory of your lucene-search installation on the indexer.

  • Run rsyncd via rsync --daemon
  • transfer the appropriate cron jobs to indexer (e.g. build or update)
  • (re)start lsearchd on indexer and searcher

After new index is built on the indexer (e.g. via a daily cronjob), searcher will pick it up, transfer it and use it.

Note however that this will produce two different set of configuration files, and you need to update both of them on any subsequent changes. A better idea is to share the lsearch-global.conf file via NFS, or put it on a URL (to do this, edit the lsearch-global.conf location in lsearch.conf).

Local configuration[edit | edit source]

  1. Obtain a copy of lsearch daemon, unpack it in e.g. /usr/local/search/ls2/
    If you downloaded from SVN, you'll also need mwdumper.jar in e.g. /usr/local/search/ls2/lib
  2. Make a directory where the indexes will be stored, e.g. /usr/local/search/indexes
  3. Edit lsearch.conf file:
    • MWConfig.global - put here the URL of global configuration file (see below), e.g. file:///etc/lsearch-global.conf
    • MWConfig.lib - put here the local path to lib directory, e.g. /usr/local/search/ls2/lib
    • Indexes.path - base path where you want the deamon to store the indexes, e.g. /usr/local/search/indexes
    • Localization.url - url to MediaWiki message files, e.g. file:///var/www/html/wiki/phase3/languages/messages
    • Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch SVN has a sample log4j file you can use called lsearch.log4j-example)

For other properties you can leave default values.

Global configuration tells the daemon about your databases, and your network setup.

Advanced Configuration[edit | edit source]

To index multiple databases or to distribute search using multiple search servers you will need to modify the Global Configuration as well as other steps bellow.

Global Configuration[edit | edit source]

Edit lsearch-global.conf file. Each of these sections needs to be updated to use the correct host name and database.


Warning Warning: Make sure there are no spaces in the arguments (e.g. (warmup,10)). This condition can lead to failure to create search, snapshot or index folders when building Index.

[Database] Section[edit | edit source]

  • list the databases to be indexed in the [Database] section.
    • <database_name> : <opts>+ is the name set in $wgDBname in your MediaWiki LocalSettings.php file). or
    • {http://path/to/all.dblist} : <opts>+ where provide a uri to a database list.
  • <opts> are
  1. distributed index configuraion (<index_type>,<optimize>,<docBuffer>,<mergeFactor>,<subdivisions>)
    • <index_type> takes a value:
      • single - index is not distributed.
      • mainsplit - two part index. mainspace with [0] namespace; restspace with all other namespaces. (recommended)
      • split - split ???
      • nssplit - split by name space list.
    • <optimize> : true to optimize while indexing, false to skip. (optional)
    • <docBuffer> : the set size of document cache, default is 10. (optional)
    • <mergeFactor> : the set merge factor, default is 2. (optional)
    • <subdivisions>: the set number index subdivisions, (required for nssplit)
  1. (language,en) default language and stemming type
  2. optional parameters: (warmup, NUM) bootstrap after an index update using NUM queries. This enables smooth transition in performance by ensuring indexes are always well cached and buffered.
  3. <typeid>
    • spell:
    • links:
    • related:
    • prefix:
    • title_ngram:
additional definition to a server will override previous settings
An Example[edit | edit source]
[Database]
{file:///home/wikipedia/common/all.dblist} : (single,true,20,1000) (prefix) (spell,10,3)
enwiki : (nssplit,2) 
enwiki : (nspart1,[0],true,20,500,2)
enwiki : (nspart2,[],true,20,500)
enwiki : (spell,40,10) (warmup,500)
#wikilucene : (single) (language,en) (warmup,100)
#wikidev : (single) (language,sr)
#splitLucene : (nssplit,3), (nspart1,[0]), (nspart2,[4,5,12,13]), (nspart3,[])
#wikilucene : (language,en) (warmup,10)
  • all the databases at file:///home/wikipedia/common/all.dblist should be indexed
  • wikilucene : (single) (language,en) - declares that wikilucene is a single (nondistributed) index and that it should use English as its default language, and thus use English stemming. The optional (warmup,100) instructs Lucene to apply 100 queries to the index when an updated version is fetched to warm it up.
  • splitLucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[]) declares that splitLucene is a distributed index. With three parts
    • nspart1 which will store the index for namespace 0
    • nspart2 which will store the index for namespaces 4,5,12,13
    • nspart3 which will store the index for the other namespaces

[Search-Group] Section[edit | edit source]

  • [Search-Group] section: Map your server hostname to the database that's being searched and indexed.
[Search-Group]
#oblak : wikilucene wikidev+
<host_name> : <database_name>
Replace <host_name> with your local host name.
Warning Warning: don't use localhost, but your hostname exactly as in environment variable $HOSTNAME. To find this value you can run echo $HOSTNAME - use whatever value it returns.

[Index] Section[edit | edit source]

Change oblak to your host name like you did for [Search-Group].

[Namespace-Prefix] Section[edit | edit source]

Add customized user namespaces used in the wiki to this section.

For other properties you can leave default values.


Incremental updates[edit | edit source]

If you feel that periodically rebuilding the index puts too much load on your database, you can use the incremental updater. It requires some additional work:

  1. Install OAI Repository extension for MediaWiki. This extension enables the incremental updater to fetch the latest articles. It is a fairly complex installation but it is the most practical way to keep your index up-to-date without causing serious performance issues. This is used on Wikimedia servers.

Distributed architecture[edit | edit source]

A common distribution is many searcher/one indexer approach. By a quick look at global config file (lsearch-global.conf) it should be obvious how to distribute searching. You just need to add more host : dbname mappings and startup lsearchd at those hosts. However, searchers need to be able to fetch and update their index, so:

  1. Setup rsyncd.conf and start the rsync daemon (there is sample config file in SVN) on the indexer host
  2. Add rsync path on the indexer host to global configuration in Index-Path section.

Split index[edit | edit source]

If your index is too big, and cannot fit into memory, you might want to split it up in smaller parts. There are a couple of ways to do this. Simplest way is to do mainsplit. This is split index into two parts, one with all articles in main namespace, and one for everything else. You can also do a nssplit, which will let split index by any combination of namespaces. Finally, there is a split architecture which randomly assigns documents to one of the N index parts. From performance viewpoint it's best to split index by namespaces, if possible as mainsplit. This is best if we assume the user almost always wants to search only the main namespace.

If you split index to many hosts, the usage will be load-balanced. E.g. at every search different combination of hosts having the required index parts will be searched. The MediaWiki Lucene Search extension doesn't need to worry about this, just have to get the request to host that has some part of the index.

There are examples of using these index architectures in lsearch-global.conf in the package.

This extension supports all kinds of exotic options:

  • index updates with custom rotation exceptions,

Setting Up Suggestions for the Search Box[edit | edit source]

To enable this feature:

1. Modify the global settings:

Add the "(prefix)" option to your Database entry, for example:

[Database]
wikidb : (single) (prefix) (spell,4,2) (language,en)

2. Re-run the build script to build the prefix index as well.

3. Update the MediaWiki installation to use lucene as backend for prefix matches. Modify the LocalSettings.php:

$wgEnableMWSuggest = true;
$wgEnableLucenePrefixSearch = true;
# default host for mwsuggest backend
$wgLucenePrefixHost = '10.0.3.18'; # IP or hostname of your lucene box

It is tricky here. Do not use substitutes as localhost or 127.0.0.1 for wgLucenePrefixHost. This value would be injected in an AJAX JavaScript that is sent to the browsers of your clients. There is no way a client browser can figure out where your server is unless you tell it. So put the real IP or hostname of the server where Lucene is running.

Performance tuning[edit | edit source]

Default values for lucene indexer facilitate minimal memory usage and minimal number of segments. However, indexing might be very slow because of this. The default is 10 bufferred documents, and merge factor of 2. You might want to increase these values, for instance to 500 buffered docs, and merge factor of 20. You can do this in global configuration, e.g. wikidb : (single,true,20,500). Beware however that increasing number of buffered docs will quickly eat up heap. It's best to try out different values and see what are the best value for your memory profile.

If you run the searcher at a multi-CPU host, you might want to adjust SearcherPool.size in local config file. The pool size corresponds to number of IndexSearchers per index. You need to set it at least to number of CPUs, or better number of CPUs+1. This prevents CPUs from locking each other by accessing the index via single RandomAccessFile instance.

FAQ[edit | edit source]

Q1. Is a Single Search for Multiple Wikis using Multiple Databases possible.

A1. It is not supported. A possible workround is to index after dump all the wikis into a single file and index that.

Q2. If Lucene's being used by WMF, why isn't it in the CommonSettings or InitialiseSettings files?

A2. It's just not called "Lucene-search". Look at CommonSettings.php referring to $wmfConfigDir/lucene.php -- operations/mediawiki-config/wmf-config/lucene.php.

See Also[edit | edit source]

However, the documentation for these advanced options is currently scattered around in

might provide further information...


References[edit | edit source]