Fulltext search engines

This is a list of Fulltext Search Engines, and technologies that could potentially be used to build them, for MediaWiki.

Clusterpoint Server
Clusterpoint (from "cluster" and "point") Server is a hybrid database management system, representing server based transactional database storage, fast full text search engine and native clustering software; all functionalities implemented into a single cohesive software platform with open API. It is a high-performance, schema-free, XML document-oriented database server written in the C++ programming language. It manages collections of data objects that are stored in native XML data format. It allows many applications to store data in a natural human-readable way that matches their native data types and structures. All database content is indexed automatically and completely for fast structured, unstructured and semi-structured search. Clusterpoint combines into a single software platform several widely used but isolated software technologies so that database developers can substantially simplify their application software.

Development of Clusterpoint Server began in August 2006 by Clusterpoint Ltd. The first public release was in January 2008. The latest stable production version is 2.0.3, released in 2011.

Among the features are:


 * Fast full text search performance: delivers sub second query response times that do not depend on the total database size or particular data structure
 * Ad hoc search: supports free format user-friendly Internet-style search queries
 * Real-time full text index updates: adding new or modifying existing documents automatically updates database index used for full text search
 * Capacity for big data: stores, indexes and searches large databases without performance loss characteristic to SQL-based search solutions
 * Open data storage platform: uses only industry standard XML data format at database storage level, API and in all client-server transactions
 * Data structure agnostic database: handles custom schema-less free format XML documents as database objects, tolerates different data structure objects in the same database
 * Consistent UTF-8 encoding. Non-UTF-8 data can be saved, queried, and retrieved with a special binary data type.
 * Cross-platform support: binaries are available for Linux, FreeBSD and OS X. Clusterpoint can be compiled on almost any operating system.
 * Type-rich: supports unstructured data, dates, numbers, meta-data, binary data, and more (all XML types)
 * XML objects for query results: enables direct integration in programming languages supporting XML parsing, no client software required
 * Includes rich enterprise search functionality: eliminates the need to integrate database application software with 3rd party search software
 * Flexible data ranking at search: a customizable mechanism for programming database content ranking for the best search relevance
 * Transparent cluster software architecture: no single point of failure, any cluster node can serve as master
 * Horizontal scalability: scales out from a single server to hundreds of servers per database in bigger clusters
 * Security partitioning: users, administrators and access rights are based on groups and roles, granular to specific storages and API commands
 * Centralized web GUI based database administration: enables to create, manage and control all Clusterpoint databases, including clustered and replicated databases

Licensing:

Clusterpoint DBMS is available for free under the Clusterpoint DBMS Free Community License for use on a single hardware server or a virtual machine.

Community non-profit projects qualify for a free Clusterpoint DBMS Non-commercial License.

There are several commercial Clusterpoint DBMS software licenses available, starting from a 2-server cluster for Clusterpoint DBMS licensing, please see Clusterpoint DBMS Licensing Options.

Clusterpoint DBMS is commercially supported software with free basic support over email and paid premium technical support services for customers using the software in production environments, please see Clusterpoint DBMS Technical Support Options.

More information is here:
 * Official Clusterpoint Website
 * Clusterpoint DBMS Developer's Guide
 * Clusterpoint Developers WiKi Resource

JODA
(ioda, because joda was already taken by some other project)
 * Download http://sourceforge.net/projects/ioda/
 * See live on http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (this page demonstrate only the indexer and is not intended as a mirror for wikipedia)

From a mailing list posting of Jochen Magnus: older versions of Joda are working since 1996 as news paper archive for the Rhein-Zeitung (Koblenz and Mainz, Germany). It's also used for archive and newsdesk purposes from several other european newspapers. At the moment it is going into action as full text index for europeans biggest magazine. It is also in use for the public index of the state archive of Rheinland-Pfalz (Germany).

Last year I created two mirrors of WikiPedia, one using MediaWiki for demonstration purposes and another - our public one - using our own read-only web frontend. Joda is integrated into both mirrors:
 * http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (MediaWiki)
 * http://lexikon.rhein-zeitung.de/ (our special Wikipedia interface)

At the suggestion of Magnus Manske (not related :-) I published Joda under LGPL and made serveral improvements for the Wikipedia task. I wrote tools for indexing a whole cur table either from MySQL or from a SQL dump (which is twice faster). Indexing the german Wikipedia cur table (>210.000 articles, 36 million words) lasts approx. 45 minutes. An optional database optimization lasts additional 25 minutes. Both on a dual Athlon 2800+ machine with 1 GB RAM (the indexer is a multi threaded perl program).

Joda can erase or update entries on the fly and can handle queries with parantheses and word distance operators like http://lexikon.rhein-zeitung.de/?((Albert OR Alfred) AND.1 Einstein) NEAR Quant*) NOT Gravitation. See more features under http://ioda.sourceforge.net/

Joda kernel is written with the Free Pascal compiler (http://sourceforge.net/projects/freepascal/). The tools are written in Perl. There a libraries for using joda directly from C, Perl, Python and PHP, all published under LGPL. The joda binaries are: command line program, TCP socket driven server and CGI.

Lucene-search

 * Extension:EzMwLucene
 * Lucene documentation
 * Lucene-search in MediaWiki's CVS

Lucene is a text search engine written in Java, sponsored by the Apache project.

A Lucene-based search server is now up and running experimentally to cover searches on the English Wikipedia. It is compiled with GCJ, so is free software and does not rely on Sun Java VM.

Using a separate search server like this instead of MySQL's fulltext index lets us take some load off the main databases.

To compare our options Brion did an experimental port to C# using dotlucene; some benchmarking showed that while the C# version running on Mono outpaced the Java version on GCJ for building the index, Java+GCJ did better on actual searches (even surpassing Sun's Java in some tests). Since searches are more time-critical (as long as updates can keep up with the rate of edits), we'll probably stick with Java.

More information on this implementation can be found on the Wikitech LiveJournal and at meta:User:Brion VIBBER/MWDaemon

At the moment the drop-down suggest-while-you-type box is disabled as GCJ and BerkeleyDB Java Edition really don't get along. Brion has said that he will either hack it to use the native library version of BDB or just rewrite the title prefix matcher to use a different backend.

Here are some step-by-step instructions on how to install this kind of search on a wiki.

Solr

 * http://lucene.apache.org/solr/
 * A lucene based search server with XML/HTTP interfaces, caching, replication, web admin.

DBSight

 * http://www.dbsight.net/
 * J2EE application
 * Database + Lucene + Display Template, with Scheduler
 * Scalable, online demo http://search.dbsight.com holds 1.2G data, 1.7 million records
 * Work on live systems, new or old legacy systems, without changing existing code.
 * Customizable crawl, customizable indexing, customizable searching, customizable results templates

Pylucene

 * http://pylucene.osafoundation.org/
 * can be GCJ-compiled which avoids the "non free" java issue above.

Plucene

 * perl port of lucene
 * http://search.cpan.org/perldoc?Plucene

KinoSearch

 * Search engine library
 * http://search.cpan.org/perldoc?KinoSearch

Google Search Appliance

 * Hardware box made by google
 * http://www.google.com/enterprise/gsa/index.html
 * proprietary, closed-source, etc, etc.
 * but may be able to recieve this gratis.
 * but Kate says: "the current situation appears to be that non-free software is not allowed, but software contained on other embedded devices is okay (e.g. switch firmware). given this i don't think there would be an issue with using one of the google devices." (wikitech-l Wed, 30 Mar 2005 08:08:16) gmane official archives
 * According to Google, the basic single-slot GSA only does 500,000 documents (but can be licensed to search up to 1.5 million documents at a rate of 300 queries per minute). For perspective, here is the totals for each of the english projects hosted by WikiMedia:

!Project name !Number of pages !Grand total: !1,754,533
 * Wikipedia
 * 1,487,491
 * Wiktionary
 * 81,836
 * Wikibooks
 * 23,643
 * Wikiquote
 * 7,397
 * Wikisource
 * 26,565
 * Wikinews
 * 7,396
 * Wikispecies
 * 4,906
 * Commons
 * 100,207
 * Meta
 * 15,092
 * sep11
 * 1,627
 * 4,906
 * Commons
 * 100,207
 * Meta
 * 15,092
 * sep11
 * 1,627
 * sep11
 * 1,627
 * 1,627
 * }
 * Note that that's just English! I have not gathered stats on any other of the large languages. Considering that there are several languages in with 6-digit figures for articles, the total number of pages hosted by WikiMedia could easily be triple or quadtruple this number! I hope Google is willing to give you more than just hosting.

Lupy

 * http://www.divmod.org/projects/lupy

Sphinx

 * Very fast
 * Plugs directly into MySQL and Postgresql if desired.
 * Handles some major sites such as ljseek.com (> 100 million records, 120+GB database) and rss-spider.com


 * http://sphinxsearch.com/


 * Installation guide : Extension:SphinxSearch

ZEND Framework
Lucene Class of the Zend Framework (http://framework.zend.com/manual/en/zend.search.html).


 * 100% PHP
 * Lucene Binary Compatible Index
 * Extension:Zend_Search_Lucene_for_MediaWiki
 * Extension:Woogle4MediaWiki

swish-e

 * Very fast
 * Easy to setup
 * Can index almost everything
 * Differential indexing capabilities
 * http://swish-e.org

Sphider
An easy to set up and install PHP web-application on top of MySQL that implements a web-spider for indexing and a flexible search page. Will index a complete wiki and can easily replace the built in search functionality.

ScimoreDB
Use Lucene. Windows only. Supports data compression to reduce 4-10 times disk usage. Scalable (up to 1024 nodes), clustered and fault tolerant. SQL, T-SQL, Stored procedures, .NET provider targeting .NET4.0/.NET2.0.

Ksana Search For Wikipedia
Ksana Search For Wikipedia (剎那搜尋維基百科) is GPL.

points to consider

 * efficiency is key
 * we already have full text search, but it uses the databases and isn't efficient. any alternative needs to be sufficiently "cheaper" in terms of hardware to make it worthwhile


 * http://www.google.com/search?q=site:en.wikipedia.org+&q=search
 * we can link to google for free.
 * not as fresh, as google won't update as often as wikipedia does
 * not 100% coverage


 * do we want to be able to search across older versions / diffs?
 * if yes, this content should probably not be searched by default. Namely, default is to just search the current content


 * can we take the index off-line when we need to update entries?
 * swish-e 2.2.0 now supports this feature, lucene as well


 * do we want to update the index in small chunks (e.g. if only a single file has changed)?
 * swish-e can do this but its somewhat hackish (you would use mulitiple indexes) while Lucene is designed for this.

outstanding question

 * if we include a summary, like Google, for each result, what should be shown?
 * the google style : the section of the document that contains the search terms
 * some short meta description of the article
 * the first paragraph, or first N words


 * should titles be given more weighting?
 * namely, if I search for the term "red wine", and there are two identical documents, except one contains "red wine" as a section title while another simply mentions it in the text ... should we return the first doc first, or should they be truely equal?
 * is text in a title more important than other text


 * do we want a page rank style link analysis?
 * eg, a wikipedia article that is linked to more often within the context of wikipedia suggests it is more important


 * an alternative is length/edit-rank
 * article with more edits, or that are longer, get boosted in the results?

Discussion

 * Why not find an efficient database solution?
 * Because databases aren't the best solution for high volume free text search. In the same way Excel could do tax returns, but there is much better software for cracking that nut in many cases.
 * I don't agree with that. Keeping the searching as close to the data as possible makes sense, and there are plenty of solutions out there (e.g. tsearch2) that seem efficient enough. Most of them are basically applications that have been joined to the database already, which certainly reduces a step for us.
 * tsearch2 is a PostgreSQL feature, afaik. do you have an equivalent thing that works with MySQL?
 * MySQL surprisingly does full-text search. Many PHP-based bulletin boards make use of this. It's certainly convenient, but I don't think it's as powerful or flexible as an external engine like Lucene.
 * We already support MySQL's fulltext search. Its uselessness is mainly what inspired me to write the Lucene support :)

Thunderstone makes a product similar to Google's Search Appliance but it appears to be substantially less expensive. Another option to consider. --TidyCat 14:53, 9 December 2005 (UTC)