Extension:Lucene-search

Lucene-search is a search engine designed to index and search MediaWiki content on large websites. It is based on Lucene search API. It extends the API to provide ranking based on number on backlinks, distributed searching and indexing, parsing of wikitext, incremental updates etc.. This is the search engine currently being used on Wikimedia wikis.

MediaWiki can use Extension:LuceneSearch (pre 1.13) or Extension:MWSearch (1.13+) to fetch results from this search engine.

Installing LSearch daemon
Requires: Linux, Java 5, Apache Ant 1.6, Rsync (for distributed architecture) Note Windows users: LSearch daemon from version 2.0 doesn't support Windows platform (since it uses hard and soft file links). You can still use the old daemon written in C#. Here are the installation instructions: Installing lucene search.

There are few typical installation scenarios, depending on the size of your system.

Single-host setup
In most cases a single host will be able to handle both indexing and searching. Searching is typically very memory-hungry, and it's a good practice to have at least half of the index buffered up in memory. If the index is 2x larger than available memory you'll probably experience some serious performance degradation, and should consider distributing search.

Typically, search index is around 3-5 times smaller than the corresponding xml database dump.

For easy maintainance of distributed architecture, configuration is split into two parts: global and local. In single-host install you also need to setup both of them:

Local configuration:
For other properties you can leave default values.
 * 1) Obtain a copy of lsearch daemon, unpack it in e.g. /usr/local/search/ls2/ If you downloaded from SVN, you'll also need mwdumper.jar in e.g. /usr/local/search/ls2/lib
 * 2) Make a directory where the indexes will be stored, e.g. /usr/local/search/indexes
 * 3) Edit lsearch.conf file:
 * 4) * MWConfig.global - put here the URL of global configuration file (see below), e.g. file:///etc/lsearch-global.conf
 * 5) * MWConfig.lib - put here the local path to lib directory, e.g. /usr/local/search/ls2/lib
 * 6) * Indexes.path - base path where you want the deamon to store the indexes, e.g. /usr/local/search/indexes
 * 7) * Localization.url - url to MediaWiki message files, e.g. file:///var/www/html/wiki/phase3/languages/messages
 * 8) * Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch SVN has a sample log4j file you can use called lsearch.log4j-example)

Global configuration tells the daemon about your databases, and your network setup.

Global configuration
Edit lsearch-global.conf file. Each of these sections needs to be updated to use the correct host name and database.


 * [Database] section: Add some databases (where   is the database name set in $wgDBname in your MediaWiki LocalSettings.php file).

[Database] ' - declares that   is a single (nondistributed) index and that it should use English as it's default language, and thus use English stemming.}}
 * 1) wikilucene : (single) (language,en) (warmup,0)
 * 2) wikidev : (single) (language,sr)
 * 3) wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[])
 * 4) wikilucene : (language,en) (warmup,10)

[Search-Group] ''' :  with your local host name.}}
 * [Search-Group] section: Map your server hostname to the database that's being searched and indexed.
 * 1) oblak : wikilucene wikidev+


 * [Index] section: Change oblak to your host name like you did for [Search-Group].
 * [Namespace-Prefix] section: Optionally, add your custom user namespaces to this section.

For other properties you can leave default values.

Build the JAR file
Next, if you didn't download the binary release, build LuceneSearch.jar by invoking ant.

Start the Daemon
Start the daemon with ./lsearchd. Note that you need to have java in your path.

Building the index
Simplest way to keep the index up-to-date is to periodically rebuild it. You can put in a cronjob, or make a script that rebuild the index and then sleeps for some time.

To build the index, you will need an XML dump of database (use dumpBackup.php). To be able to make the XML dump you need to setup AdminSettings.php. Then use the helper tool Importer, to rebuild the index. Here is the sample code: (you might want to adjust the dump file path, etc.. ) php maintenance/dumpBackup.php --current --quiet > wikidb.xml && java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.Importer -s wikidb.xml wikidb

The Importer will import the xml dump and make a index snapshot (-s option). Index snapshot will be picked up by the lsearch daemon (which periodically checks for index snapshots) and working copy of the index updated. Indexes for lsearch daemon are stored in standard locations. If /usr/local/search/indexes is your root index path, then indexes/snapshot will contain snapshots, indexes/search will contain the current working copy of the index, indexes/update the previous working copies and index updates, etc..

And that's it, if you correctly set up the MW extension, you should be able to search and have the index updated.

Troubleshooting
Due to the several components involved, getting LuceneSearch up and running can be difficult. A few notes:


 * If you have curl installed in your PHP installation, they must work in order for the script to return results. Otherwise you will get a search failure notice.
 * The database ("wikidb" in the explanations above) must match the MySQL (or other) database in which the wiki to be indexed is stored.
 * If you do get a search failure notice, check the lsearchd output. This can be found on the console where you started the daemon, assuming you started the daemon with the default log4j configuration. If you get error messages, including Java exceptions (ArrayIndexOutOfBoundsException, NullPointerException), carefully check over all your configuration settings for inconsistencies or mistakes.
 * If nothing seems to be awry in the lsearchd output, turn on MediaWiki logging as explained on How to debug.
 * Internally, the LuceneSearch extension queries the Lucene daemon via an HTTP request. You'll be able to see the URL requested by looking in the MediaWiki log output. To test that the Lucene side is working, try typing this URL in a web browser and visiting it. If you get a text list of search results, it's working. If not, this should allow you to see what's going wrong.

Incremental updates
If you feel that periodically rebuilding the index puts too much load on you database, you can use the incremental updater. It requires some additional work: php maintenance/dumpBackup.php --current --quiet > wikidb.xml && java -cp LuceneSearch.jar org.wikimedia.lsearch.ranks.RankBuilder wikidb.xml wikidb
 * 1) Install OAI extension for MediaWiki. This extension enables the incremental updater to fetch the latest articles.
 * 2) Create a new mysql database, e.g. lsearchdb and make sure it's an utf-8 database. It's needed to store the article ranking data. This data is normally recalculated by the importer at each import
 * 3) Setup Storage section in local configuration (lsearch.conf). Supply user and admin passwords for mysql db, admin passwords are needed for creation of tables, etc.
 * 4) Rebuild article rank data, You can put it on a cron job once a week, or once a month (article ranks typically change very slowly):
 * 1) Create the initial version of the index - you can do this using the importer described in previous sections

Finally, setup OAI repository for the incremental updater, in global config (lsearch-global.conf), setup a mapping of dbname : host, and in local settings supply username/password in OAI.username/password if any. Start incremental updater with: java -cp LuceneSearch.jar org.wikimedia.lsearch.oai.IncrementalUpdater -n -d -s 600 -dt start_time wikidb The parameters are:
 * -n - wait for notification from indexer that articles has been successfully added
 * -d - daemonize, i.e. run updates in an infinite loop
 * -s 600 - after one round of updates sleep 10 minutes (600s)
 * -dt timestamp - default timestamp (e.g. 2007-06-17T15:00:00Z) - This is the timestamp of your initial index build. You need to pass this parameter the first time you start the incremental updater, so it knows from what time to start the updates. Afterward the incremental updater will keep the timestamp of last successfull update in indexes/status/wikidb.

Alternative to (2),(3) and (4) is not to use ranking. You can do this by passing --no-ranks parameter to the incremental updater, and it won't try to fetch ranks from the mysql database. If your wiki is small and has some hundreds of pages, you probably don't need any ranking. But if you have or plan to have hundreds of thousands of pages, you will definitely benefit from ranking data.

So far we have only manage to incrementally update the index. To instruct the indexer to make a snapshot of index periodically (which get picked up by searchers), put this into your cron job: curl http://indexerhost:8321/makeSnapshots

Indexer has a command http interface. Other commands are getStatus, flushAll, etc ...

Distributed architecture
A common distribution is many searcher/one indexer approach. By a quick look at global config file (lsearch-global.conf) it should be obvious how to distribute searching. You just need to add more host : dbname mappings and startup lsearchd at those hosts. However, searchers need to be able to fetch and update their index, so:
 * 1) Setup rsyncd.conf and start the rsync daemon (there is sample config file in SVN) on the indexer host
 * 2) Add rsync path on the indexer host to global configuration in Index-Path section.

Restart everything (searchers and indexer) and they should now be aware of each other, searcher will periodically check for updates of indexes they are assigned. You need to run Importer at the indexer host, but you can run IncrementalUpdater at any host, since it will from global config know where the indexer is.

Split index
If your index is too big, and cannot fit into memory, you might want to split it up in smaller parts. There are a couple of ways to do this. Simplest way is to do mainsplit. This is split index into two parts, one with all articles in main namespace, and one for everything else. You can also do a nssplit, which will let split index by any combination of namespaces. Finally, there is a split architecture which randomly assigns documents to one of the N index parts. From performance viewpoint it's best to split index by namespaces, if possible as mainsplit. This is best if we assume the user almost always wants to search only the main namespace.

If you split index to many hosts, the usage will be load-balanced. E.g. at every search different combination of hosts having the required index parts will be searched. The MediaWiki Lucene Search extension doesn't need to worry about this, just have to get the request to host that has some part of the index.

There are examples of using these index architectures in lsearch-global.conf in the package.

Performance tuning
Default values for lucene indexer facilitate minimal memory usage and minimal number of segments. However, indexing might be very slow because of this. The default is 10 bufferred documents, and merge factor of 2. You might want to increase these values, for instance to 500 buffered docs, and merge factor of 20. You can do this in global configuration, e.g. wikidb : (single,true,20,500). Beware however that increasing number of buffered docs will quickly eat up heap. It's best to try out different values and see what are the best value for your memory profile.

If you run the searcher at a multi-CPU host, you might want to adjust SearcherPool.size in local config file. The pool size corresponds to number of IndexSearchers per index. You need to set it at least to number of CPUs, or better number of CPUs+1. This prevents CPUs from locking each other by accessing the index via single RandomAccessFile instance.