Requests for comment/CirrusSearch

Purpose
We'd like to replace our home grown search service (lsearchd) with one that is: * is more stable * is more actively worked on * for which we can get better support (there are more people we can beg for help if we are having trouble)

In the process we'll replace MWSearch2 because it is customize for lsearchd with CirrusSearch which we are currently building to be customized for SolrCloud aka Solr4.

Choice of Solr
There are really two big names in open source search that provide replication, sharding, and the degree of customization that we need: Solr and ElasticSearch. While ElasticSearch is really great and has more experience with sharding then Solr we chose Solr because it has already seen use in WMF and we have more Solr expertise. In some sense there isn't too much of a choice because Lucene powers both systems. In a sense Solr, ElasticSearch, and our own lsearchd are all fancy wrappers around Lucene.

Building the Solr Configuration
Solr must be configured up front with a few things. CirrusSearch generates the configuration using a maintenance script so that it can read $wg fields from MediaWiki and so it can be more easilly used outside of WMF.

Terms
These terms mostly come from Solr. Some of them are explicitly defined some of them are just mentioned in passing and we're inferring their meaning.


 * Core      :All data that a single Solr instance stores about a certain kind of document including stored fields and all of the indecies that it uses to make searching quick.
 * Collection :A group of cores running on multiple instances which can be queried just like a single core.
 * Cluster   :A group of Solr instances that replicate and shard Collections between eachother.