Fulltext search engines

This is a list of Fulltext Search Engines, and technologies that could potentially be used to build them, for MediaWiki.

JODA
(ioda, because joda was already taken by some other project)
 * Download http://sourceforge.net/projects/ioda/
 * See live on http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (this page demonstrate only the indexer and is not intended as a mirror for wikipedia)

From a mailing list posting of Jochen Magnus: older versions of Joda are working since 1996 as news paper archive for the Rhein-Zeitung (Koblenz and Mainz, Germany). It's also used for archive and newsdesk purposes from several other european newspapers. At the moment it is going into action as full text index for europeans biggest magazine. It is also in use for the public index of the state archive of Rheinland-Pfalz (Germany).

Last year I created two mirrors of WikiPedia, one using MediaWiki for demonstration purposes and another - our public one - using our own read-only web frontend. Joda is integrated into both mirrors:
 * http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (MediaWiki)
 * http://lexikon.rhein-zeitung.de/ (our special Wikipedia interface)

At the suggestion of Magnus Manske (not related :-) I published Joda under LGPL and made serveral improvements for the Wikipedia task. I wrote tools for indexing a whole cur table either from MySQL or from a SQL dump (which is twice faster). Indexing the german Wikipedia cur table (>210.000 articles, 36 million words) lasts approx. 45 minutes. An optional database optimization lasts additional 25 minutes. Both on a dual Athlon 2800+ machine with 1 GB RAM (the indexer is a multi threaded perl program).

Joda can erase or update entries on the fly and can handle queries with parantheses and word distance operators like http://lexikon.rhein-zeitung.de/?((Albert OR Alfred) AND.1 Einstein) NEAR Quant*) NOT Gravitation. See more features under http://ioda.sourceforge.net/

Joda kernel is written with the Free Pascal compiler (http://sourceforge.net/projects/freepascal/). The tools are written in Perl. There a libraries for using joda directly from C, Perl, Python and PHP, all published under LGPL. The joda binaries are: command line program, TCP socket driven server and CGI.

Lucene-search

 * Extension:EzMwLucene
 * Lucene documentation
 * Lucene-search in MediaWiki's CVS
 * Brion's announcement that Lucene search is now being used

Lucene is a text search engine written in Java, sponsored by the Apache project.

A Lucene-based search server is now up and running experimentally to cover searches on the English Wikipedia. It is compiled with GCJ, so is free software and does not rely on Sun Java VM.

Using a separate search server like this instead of MySQL's fulltext index lets us take some load off the main databases.

To compare our options Brion did an experimental port to C# using dotlucene; some benchmarking showed that while the C# version running on Mono outpaced the Java version on GCJ for building the index, Java+GCJ did better on actual searches (even surpassing Sun's Java in some tests). Since searches are more time-critical (as long as updates can keep up with the rate of edits), we'll probably stick with Java.

More information on this implementation can be found on the Wikitech LiveJournal and at meta:User:Brion VIBBER/MWDaemon

At the moment the drop-down suggest-while-you-type box is disabled as GCJ and BerkeleyDB Java Edition really don't get along. Brion has said that he will either hack it to use the native library version of BDB or just rewrite the title prefix matcher to use a different backend.

Here are some step-by-step instructions on how to install this kind of search on a wiki.

Solr

 * http://lucene.apache.org/solr/
 * A lucene based search server with XML/HTTP interfaces, caching, replication, web admin.

DBSight

 * http://www.dbsight.net/
 * J2EE application
 * Database + Lucene + Display Template, with Scheduler
 * Scalable, online demo http://search.dbsight.com holds 1.2G data, 1.7 million records
 * Work on live systems, new or old legacy systems, without changing existing code.
 * Customizable crawl, customizable indexing, customizable searching, customizable results templates

Pylucene

 * http://pylucene.osafoundation.org/
 * can be GCJ-compiled which avoids the "non free" java issue above.

Plucene

 * perl port of lucene
 * http://search.cpan.org/perldoc?Plucene

Google Search Appliance

 * Hardware box made by google
 * http://www.google.com/enterprise/gsa/index.html
 * proprietary, closed-source, etc, etc.
 * but may be able to recieve this gratis.
 * but Kate says: "the current situation appears to be that non-free software is not allowed, but software contained on other embedded devices is okay (e.g. switch firmware). given this i don't think there would be an issue with using one of the google devices." (wikitech-l Wed, 30 Mar 2005 08:08:16) gmane official archives
 * According to Google, the basic single-slot GSA only does 500,000 documents (but can be licensed to search up to 1.5 million documents at a rate of 300 queries per minute). For perspective, here is the totals for each of the english projects hosted by WikiMedia:

!Project name !Number of pages !Grand total: !1,754,533
 * Wikipedia
 * 1,487,491
 * Wiktionary
 * 81,836
 * Wikibooks
 * 23,643
 * Wikiquote
 * 7,397
 * Wikisource
 * 26,565
 * Wikinews
 * 7,396
 * Wikispecies
 * 4,906
 * Commons
 * 100,207
 * Meta
 * 15,092
 * sep11
 * 1,627
 * 4,906
 * Commons
 * 100,207
 * Meta
 * 15,092
 * sep11
 * 1,627
 * sep11
 * 1,627
 * 1,627
 * }
 * Note that that's just English! I have not gathered stats on any other of the large languages. Considering that there are several languages in with 6-digit figures for articles, the total number of pages hosted by WikiMedia could easily be triple or quadtruple this number! I hope Google is willing to give you more than just hosting.

Lupy

 * http://www.divmod.org/projects/lupy

Sphinx

 * Very fast
 * Plugs directly into MySQL and Postgresql if desired.
 * Handles some major sites such as ljseek.com (> 100 million records, 120+GB database) and rss-spider.com


 * http://sphinxsearch.com/


 * Installation guide : http://www.mediawiki.org/wiki/Extension:SphinxSearch

ZEND Framework
Lucene Class of the Zend Framework (http://framework.zend.com/manual/en/zend.search.html).


 * 100% PHP
 * Lucene Binary Compatible Index
 * Extension:Woogle4MediaWiki

swish-e

 * Very fast
 * Easy to setup
 * Can index almost everything
 * Differential indexing capabilities
 * http://swish-e.org

Sphider
An easy to set up and install PHP web-application on top of MySQL that implements a web-spider for indexing and a flexible search page. Will index a complete wiki and can easily replace the built in search functionality.

Ksana Search For Wikipedia
Ksana Search For Wikipedia (剎那搜尋維基百科) is GPL.

points to consider

 * efficiency is key
 * we already have full text search, but it uses the databases and isn't efficient. any alternative needs to be sufficiently "cheaper" in terms of hardware to make it worthwhile


 * http://www.google.com/search?q=site:en.wikipedia.org+&q=search
 * we can link to google for free.
 * not as fresh, as google won't update as often as wikipedia does
 * not 100% coverage


 * do we want to be able to search across older versions / diffs?
 * if yes, this content should probably not be searched by default. Namely, default is to just search the current content


 * can we take the index off-line when we need to update entries?
 * swish-e 2.2.0 now supports this feature, lucene as well


 * do we want to update the index in small chunks (e.g. if only a single file has changed)?
 * swish-e can do this but its somewhat hackish (you would use mulitiple indexes) while Lucene is designed for this.

outstanding question

 * if we include a summary, like Google, for each result, what should be shown?
 * the google style : the section of the document that contains the search terms
 * some short meta description of the article
 * the first paragraph, or first N words


 * should titles be given more weighting?
 * namely, if I search for the term "red wine", and there are two identical documents, except one contains "red wine" as a section title while another simply mentions it in the text ... should we return the first doc first, or should they be truely equal?
 * is text in a title more important than other text


 * do we want a page rank style link analysis?
 * eg, a wikipedia article that is linked to more often within the context of wikipedia suggests it is more important


 * an alternative is length/edit-rank
 * article with more edits, or that are longer, get boosted in the results?

Discussion

 * Why not find an efficient database solution?
 * Because databases aren't the best solution for high volume free text search. In the same way Excel could do tax returns, but there is much better software for cracking that nut in many cases.
 * I don't agree with that. Keeping the searching as close to the data as possible makes sense, and there are plenty of solutions out there (e.g. tsearch2) that seem efficient enough. Most of them are basically applications that have been joined to the database already, which certainly reduces a step for us.
 * tsearch2 is a PostgreSQL feature, afaik. do you have an equivalent thing that works with MySQL?
 * MySQL surprisingly does full-text search. Many PHP-based bulletin boards make use of this. It's certainly convenient, but I don't think it's as powerful or flexible as an external engine like Lucene.
 * We already support MySQL's fulltext search. Its uselessness is mainly what inspired me to write the Lucene support :)

Thunderstone makes a product similar to Google's Search Appliance but it appears to be substantially less expensive. Another option to consider. --TidyCat 14:53, 9 December 2005 (UTC)