Extension:LuceneSearch

What can this extension do?
Full-text search based on Lucene. Can distribute the search/indexing process, uses stemmers, some link analysis for ranking, and supports prefixed queries.

Installation
Requires: Linux, Java 5, Apache Ant 1.6, Rsync (for distributed architecture)

The software is split in two parts. First is the lsearch daemon, which does all the searching and indexing, the other is the MediaWiki extension which is only a MediaWiki interface for the daemon.

There are few typical installation scenarios, depending on the size of your system.

Single-host setup
In most cases a single host will be able to handle both indexing and searching. Searching is typically very memory-hungry, and it's a good practice to have at least half of the index buffered up in memory. If the index is 2x larger than available memory you'll probably experience some serious performance degradation, and should consider distributing search.

Typically, search index is around 3-5 time smaller than the corresponding xml database dump.

For easy maintainance of distributed architecture, configuration is split into two parts: global and local. In single-host install you also need to setup both of them:

Local configuration: For other properties you can leave default values.
 * 1) Obtain a copy of lsearch daemon, unpack it in e.g. /usr/local/search/ls2/
 * 2) Make a directory where the indexes will be stored, e.g. /usr/local/search/indexes
 * 3) Edit lsearch.conf file:
 * 4) * MWConfig.global - put here the URL of global configuration file (see below), e.g. file:///etc/lsearch-global.conf
 * 5) * MWConfig.lib - put here the local path to lib directory, e.g. /usr/local/search/ls2/lib
 * 6) * Indexes.path - base path where you want the deamon to store the indexes, e.g. /usr/local/search/indexes
 * 7) * Localization.url - url to MediaWiki message files, e.g. file:///var/www/html/wiki/phase3/languages/messages
 * 8) * Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch package has a sample log4j file you can use)

Global configuration tells the daemon about your databases, and your network setup.

Global configuration: For other properties you can leave default values.
 * 1) Add some databases (database names is $wgDBname in MediaWiki) in Database section. e.g. wikidb : (single) (language,en) - this declare that wikidb should be built as single (nondistributed) index and that it should use English as it's default language, and thus use English stemming.
 * 2) Declare your host as searching and indexing wikidb in sections Search-Group and Index sections
 * 3) Optionally, add your custom user namespace to Namespace-Prefix section

Next, build LuceneSearch.jar by invoking ant, or if you got binary relase, just start the daemon with ./lsearchd. Note that you need to have java in your path.

Building the index
Simplest way to keep the index up-to-date is to periodically rebuild it. You can put in a cronjob, or make a script that rebuild the index and then sleeps for some time.

To build the index, you would need an xml dump of database. Then you can use the helper tool Importer, to rebuild the index. Here is the sample code: (you might want to adjust the dump file path, etc.. ) php maintenance/dumpBackup.php --current --quiet > wikidb.xml && java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.Importer -s wikidb.xml wikidb

The Importer will import the xml dump and make a index snapshot (-s option). Index snapshot will be picked up by the lsearch daemon (which periodically checks for index snapshots) and working copy of the index updated. Indexes for lsearch daemon are stored in standard locations. If /usr/local/search/indexes is your root index path, then indexes/snapshot will contain snapshots, indexes/search will contain the current working copy of the index, indexes/update the previous working copies and index updates, etc..

And that's it, if you correctly setup the MW extension, you should be able to search and have the index updated.

Incremental updates
If you feel that perdically rebuilding the index puts too much load on you database, you can use the incremental updater. It requires some additional work: php maintenance/dumpBackup.php --current --quiet > wikidb.xml && java -cp LuceneSearch.jar org.wikimedia.lsearch.ranks.RankBuilder wikidb.xml wikidb
 * 1) Install OAI extension for MediaWiki. This extension enables the incremental updater to fetch the latest articles.
 * 2) Create a new mysql database, e.g. lsearchdb and make sure it's an utf-8 database. It's needed to store the article ranking data. This data is normally recalculated by the importer at each import
 * 3) Setup Storage section in local configuration (lsearch.conf). Supply user and admin passwords for mysql db, admin passwords are needed for creation of tables, etc.
 * 4) Rebuild article rank data, You can put it on a cron job once a week, or once a month (article ranks typically change very slowly):

Finally, setup OAI repository for the incremental updater, in global config (lsearch-global.conf), setup a mapping of dbname : host, and in local settings supply username/password in OAI.username/password if any. Start incremental updater with: java -cp LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME org.wikimedia.lsearch.oai.IncrementalUpdater -n -d -s 600 -t start_time wikidb The parameters are:
 * -n - wait for notification from indexer that articles has been successfully added
 * -d - daemonize, i.e. run updates in an infinite loop
 * -s 600 - after one round of updates sleep 10 minutes (600s)
 * -t timestamp - start updates from timestamp (e.g. 2007-06-17T15:00:00Z) - you need to supply this only the first time the incremental updater is run, so it knows from when to fetch updates - typically, if your database is small, you can rebuild the whole index by omitting this parameter (beware it's some 10 times slower than rebuilding directly from the xml dump!). Or better, you can rebuild first version of the index from a dump (see previous section), and then incrementaly update it afterwards. At all times, you will find the latest timestamp when incremental update was done in indexes/status/wikidb, so you can stop the incremental updater and rerun ih without the -t parameter and it will pick up from the time it was stopped.

Alternative to (2),(3) and (4) is not to use ranking. You can do this by passing --no-ranks parameter to the incremental updater, and it won't try to fetch ranks from the mysql database. If your wiki is small and has some hundreds of pages, you probably don't need any ranking. But if you have or plan to have hundreds of thousands of pages, you will definitely benefit from ranking data.

Changes to LocalSettings.php
require_once("$IP/extensions/ExtensionName/ExtensionName.php");