Extension:LuceneSearch

What can this extension do?
Full-text search based on Lucene. Can distribute the search/indexing process, use stemmers, do some link analysis for ranking, support prefixed queries...

The software is split in two parts. First is the lsearch daemon, which does all the searching and indexing, the other is the MediaWiki extension which is only a MediaWiki interface for the daemon.

Installing MediaWiki Extension
Put into your LocalSettings.php. wfLuceneHost can be an array of hosts.

$wgDisableInternalSearch = true; $wgDisableSearchUpdate = true; $wgSearchType = 'LuceneSearch'; $wgLuceneHost = '127.0.0.1'; $wgLucenePort = 8123; require_once("$IP/extensions/LuceneSearch/LuceneSearch.php"); $wgLuceneSearchVersion = 2; $wgLuceneDisableSuggestions = true; $wgLuceneDisableTitleMatches = true;

Using search (namespace) prefixes
Searches can be prefixed with canonical namespace names, or localized namespace names. E.g. entering into search box: help:images will search the Help namespace for occurances of word images. There is also a special prefix all, which instructs the search engine to search everything (by default only main namespace is searched).

Note that spaces are not allowed between the namespace name and the semicolon.

If you want to search more than one namespace, you can comma-separate their names. e.g. help,project:images.

Customizing namespace prefixes
You can also specify your own prefixes, that will combine different namespaces. You can do this by editing MediaWiki:searchaliases. Here is a sample entry: wp|Project all_talk|Talk,User_talk,Image_talk,MediaWiki_talk,Template_talk,Help_talk,Category_talk

So, the syntax is: "alias|&lt;list of namespaces&gt;" per line. You can also localize the all keyword by editing MediaWiki:searchall and adding your localized names one per line.

Installing LSearch daemon
Requires: Linux, Java 5, Apache Ant 1.6, Rsync (for distributed architecture)

There are few typical installation scenarios, depending on the size of your system.

Single-host setup
In most cases a single host will be able to handle both indexing and searching. Searching is typically very memory-hungry, and it's a good practice to have at least half of the index buffered up in memory. If the index is 2x larger than available memory you'll probably experience some serious performance degradation, and should consider distributing search.

Typically, search index is around 3-5 time smaller than the corresponding xml database dump.

For easy maintainance of distributed architecture, configuration is split into two parts: global and local. In single-host install you also need to setup both of them:

Local configuration: For other properties you can leave default values.
 * 1) Obtain a copy of lsearch daemon, unpack it in e.g. /usr/local/search/ls2/ If you downloaded from SVN, you'll also need mwdumper.jar in e.g. /usr/local/search/ls2/lib
 * 2) Make a directory where the indexes will be stored, e.g. /usr/local/search/indexes
 * 3) Edit lsearch.conf file:
 * 4) * MWConfig.global - put here the URL of global configuration file (see below), e.g. file:///etc/lsearch-global.conf
 * 5) * MWConfig.lib - put here the local path to lib directory, e.g. /usr/local/search/ls2/lib
 * 6) * Indexes.path - base path where you want the deamon to store the indexes, e.g. /usr/local/search/indexes
 * 7) * Localization.url - url to MediaWiki message files, e.g. file:///var/www/html/wiki/phase3/languages/messages
 * 8) * Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch SVN has a sample log4j file you can use)

Global configuration tells the daemon about your databases, and your network setup.

Global configuration: For other properties you can leave default values.
 * 1) Add some databases (database names is $wgDBname in MediaWiki) in Database section. e.g. wikidb : (single) (language,en) - this declares that wikidb should be built as single (nondistributed) index and that it should use English as it's default language, and thus use English stemming. Additionally, you might also add property (warmup,100)', this will instruct the searchers to in background apply 100 queries to warmup the index when an updated version is fetched. This enables smooth transition in performance and ensures indexes are always well cached and buffered.
 * 2) Declare your host as searching and indexing wikidb in sections Search-Group and Index sections (Important: don't use localhost, but your hostname as in environment variable $HOSTNAME)
 * 3) Optionally, add your custom user namespace to Namespace-Prefix section

Next, build LuceneSearch.jar by invoking ant, or if you got binary relase, just start the daemon with ./lsearchd. Note that you need to have java in your path.

Building the index
Simplest way to keep the index up-to-date is to periodically rebuild it. You can put in a cronjob, or make a script that rebuild the index and then sleeps for some time.

To build the index, you would need an xml dump of database. Then you can use the helper tool Importer, to rebuild the index. Here is the sample code: (you might want to adjust the dump file path, etc.. ) php maintenance/dumpBackup.php --current --quiet > wikidb.xml && java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.Importer -s wikidb.xml wikidb

The Importer will import the xml dump and make a index snapshot (-s option). Index snapshot will be picked up by the lsearch daemon (which periodically checks for index snapshots) and working copy of the index updated. Indexes for lsearch daemon are stored in standard locations. If /usr/local/search/indexes is your root index path, then indexes/snapshot will contain snapshots, indexes/search will contain the current working copy of the index, indexes/update the previous working copies and index updates, etc..

And that's it, if you correctly setup the MW extension, you should be able to search and have the index updated.

Incremental updates
If you feel that perdically rebuilding the index puts too much load on you database, you can use the incremental updater. It requires some additional work: php maintenance/dumpBackup.php --current --quiet > wikidb.xml && java -cp LuceneSearch.jar org.wikimedia.lsearch.ranks.RankBuilder wikidb.xml wikidb
 * 1) Install OAI extension for MediaWiki. This extension enables the incremental updater to fetch the latest articles.
 * 2) Create a new mysql database, e.g. lsearchdb and make sure it's an utf-8 database. It's needed to store the article ranking data. This data is normally recalculated by the importer at each import
 * 3) Setup Storage section in local configuration (lsearch.conf). Supply user and admin passwords for mysql db, admin passwords are needed for creation of tables, etc.
 * 4) Rebuild article rank data, You can put it on a cron job once a week, or once a month (article ranks typically change very slowly):
 * 1) Create the initial version of the index - you can do this using the importer described in previous sections

Finally, setup OAI repository for the incremental updater, in global config (lsearch-global.conf), setup a mapping of dbname : host, and in local settings supply username/password in OAI.username/password if any. Start incremental updater with: java -cp LuceneSearch.jar org.wikimedia.lsearch.oai.IncrementalUpdater -n -d -s 600 -dt start_time wikidb The parameters are:
 * -n - wait for notification from indexer that articles has been successfully added
 * -d - daemonize, i.e. run updates in an infinite loop
 * -s 600 - after one round of updates sleep 10 minutes (600s)
 * -dt timestamp - default timestamp (e.g. 2007-06-17T15:00:00Z) - This is the timestamp of your initial index build. You need to pass this parameter the first time you start the incremental updater, so it knows from what time to start the updates. Afterward the incremental updater will keep the timestamp of last successfull update in indexes/status/wikidb.

Alternative to (2),(3) and (4) is not to use ranking. You can do this by passing --no-ranks parameter to the incremental updater, and it won't try to fetch ranks from the mysql database. If your wiki is small and has some hundreds of pages, you probably don't need any ranking. But if you have or plan to have hundreds of thousands of pages, you will definitely benefit from ranking data.

So far we have only manage to incrementally update the index. To instruct the indexer to make a snapshot of index periodically (which get picked up by searchers), put this into your cron job: curl http://indexerhost:8321/makeSnapshots

Indexer has a command http interface. Other commands are getStatus, flushAll, etc ...

Distributed architecture
A common distribution is many searcher/one indexer approach. By a quick look at global config file (lsearch-global.conf) it should be obvious how to distribute searching. You just need to add more host : dbname mappings and startup lsearchd at those hosts. However, searchers need to be able to fetch and update their index, so:
 * 1) Setup rsyncd.conf and start the rsync daemon (there is sample config file in SVN) on the indexer host
 * 2) Add rsync path on the indexer host to global configuration in Index-Path section.

Restart everything (searchers and indexer) and they should now be aware of each other, searcher will periodically check for updates of indexes they are assigned. You need to run Importer at the indexer host, but you can run IncrementalUpdater at any host, since it will from global config know where the indexer is.

Split index
If your index is too big, and cannot fit into memory, you might want to split it up in smaller parts. There are a couple of ways to do this. Simplest way is to do mainsplit. This is split index into two parts, one with all articles in main namespace, and one for everything else. You can also do a nssplit, which will let split index by any combination of namespaces. Finally, there is a split architecture which randomly assigns documents to one of the N index parts. From performance viewpoint it's best to split index by namespaces, if possible as mainsplit. This is best if we assume the user almost always wants to search only the main namespace.

If you split index to many hosts, the usage will be load-balanced. E.g. at every search different combination of hosts having the required index parts will be searched. The MediaWiki Lucene Search extension doesn't need to worry about this, just have to get the request to host that has some part of the index.

There are examples of using these index architectures in lsearch-global.conf in the package.

Performance tuning
Default values for lucene indexer facilitate minimal memory usage and minimal number of segments. However, indexing might be very slow because of this. The default is 10 bufferred documents, and merge factor of 2. You might want to increase these values, for instance to 500 buffered docs, and merge factor of 20. You can do this in global configuration, e.g. wikidb : (single,true,20,500). Beware however that increasing number of buffered docs will quickly eat up heap. It's best to try out different values and see what are the best value for your memory profile.

If you run the searcher at a multi-CPU host, you might want to adjust SearcherPool.size in local config file. The pool size corresponds to number of IndexSearchers per index. You need to set it at least to number of CPUs, or better number of CPUs+1. This prevents CPUs from locking each other by accessing the index via single RandomAccessFile instance.