Fulltext search engines

List of Fulltext Search Engines, and technologies that could potentially be used to build them, for Mediawiki.

JODA

 * Download http://sourceforge.net/projects/ioda/
 * See live on http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (this page demonstrate only the indexer and is not intended as a mirror for wikipedia)

From a mailing list posting of Jochen Magnus: older versions of Joda are working since 1996 as news paper archive for the Rhein-Zeitung (Koblenz and Mainz, Germany). It's also used for archive and newsdesk purposes from several other european newspapers. At the moment it is going into action as full text index for europeans biggest magazine. It is also in use for the public index of the state archive of Rheinland-Pfalz (Germany).

Last year I created two mirrors of WikiPedia, one using MediaWiki for demonstration purposes and another - our public one - using our own read-only web frontend. Joda is integrated into both mirrors:
 * http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (MediaWiki)
 * http://lexikon.rhein-zeitung.de/ (our special Wikipedia interface)

At the suggestion of Magnus Manske (not related :-) I published Joda under LGPL and made serveral improvements for the Wikipedia task. I wrote tools for indexing a whole cur table either from MySQL or from a SQL dump (which is twice faster). Indexing the german Wikipedia cur table (>210.000 articles, 36 million words) lasts approx. 45 minutes. An optional database optimization lasts additional 25 minutes. Both on a dual Athlon 2800+ machine with 1 GB RAM (the indexer is a multi threaded perl program).

Joda can erase or update entries on the fly and can handle queries with parantheses and word distance operators like http://lexikon.rhein-zeitung.de/?((Albert OR Alfred) AND.1 Einstein) NEAR Quant*) NOT Gravitation. See more features under http://ioda.sourceforge.net/

Joda kernel is written with the Free Pascal compiler (http://sourceforge.net/projects/freepascal/). The tools are written in Perl. There a libraries for using joda directly from C, Perl, Python and PHP, all published under LGPL. The joda binaries are: command line program, TCP socket driven server and CGI.

Lucene-search

 * http://lucene.apache.org/java/docs/
 * concerns about requiring Java which some consider to be "not free"

From a mailing list posting of Kate Turner: look at the "lucene-search" module in CVS (http://cvs.sourceforge.net/viewcvs.py/wikipedia/lucene-search/). this is a (mostly) complete and functional Lucene (Java version) based search server for MediaWiki. i'm not sure how similar the Java version is to other versions, but you may be able to port the relevant bits without too much effort.

an experimental test of this on the live site showed that it was able to handle our search load on a single 3.0GHz P4 with very minimal CPU usage, as long as the typo suggestion feature isn't enabled (because that uses several slow searches to produce the result; it could almost certainly be reimplemented in a much more efficient manner).

Pylucene

 * http://pylucene.osafoundation.org/
 * can be GCJ-compiled which avoids the "non free" java issue above.

Plucene

 * perl port of lucene
 * http://search.cpan.org/perldoc?Plucene

Google Search Appliance

 * Hardware box made by google
 * http://www.google.com/enterprise/gsa/index.html
 * proprietary, closed-source, etc, etc.
 * but may be able to recieve this gratis.
 * but Kate says: "the current situation appears to be that non-free software is not allowed, but software contained on other embedded devices is okay (e.g. switch firmware). given this i don't think there would be an issue with using one of the google devices." (wikitech-l Wed, 30 Mar 2005 08:08:16) gmane official archives

Lupy

 * http://www.divmod.org/Home/Projects/Lupy/

points to consider

 * efficiency is key
 * we already have full text search, but it uses the databases and isn't efficient. any alternative needs to be sufficiently "cheaper" in terms of hardware to make it worthwhile


 * http://www.google.com/search?q=site:en.wikipedia.org+&q=search
 * we can link to google for free.
 * not as fresh, as google won't update as often as wikipedia does
 * not 100% coverage


 * do we want to be able to search across older versions / diffs?
 * if yes, this content should probably not be searched by default. Namely, default is to just search the current content

outstanding question

 * if we include a summary, like Google, for each result, what should be shown?
 * the google style : the section of the document that contains the search terms
 * some short meta description of the article
 * the first paragraph, or first N words


 * should titles be given more weighting?
 * namely, if I search for the term "red wine", and there are two identical documents, except one contains "red wine" as a section title while another simply mentions it in the text ... should we return the first doc first, or should they be truely equal?
 * is text in a title more important than other text


 * do we want a page rank style link analysis?
 * eg, a wikipedia article that is linked to more often within the context of wikipedia suggests it is more important


 * an alternative is length/edit-rank
 * article with more edits, or that are longer, get boosted in the results?

Discussion

 * Why not find an efficient database solution?
 * Because databases aren't the best solution for high volume free text search. In the same way Excel could do tax returns, but there is much better software for cracking that nut in many cases.
 * I don't agree with that. Keeping the searching as close to the data as possible makes sense, and there are plenty of solutions out there (e.g. tsearch2) that seem efficient enough. Most of them are basically applications that have been joined to the database already, which certainly reduces a step for us.