Requests for comment/CirrusSearch

Purpose
We'd like to replace our home grown search service (lsearchd) with one that is:
 * is more stable
 * is more actively worked on
 * for which we can get better support (there are more people we can beg for help if we are having trouble)

In the process we'll replace MWSearch2 because it is customize for lsearchd with CirrusSearch which we are currently building to be customized for SolrCloud aka Solr4.

We'll get a search backend that we can scale by starting new instances and asking them to join a particular search cluster (well, it is almost that easy).

Choice of Solr
There are really two big names in open source search that provide replication, sharding, and the degree of customization that we need: Solr and ElasticSearch. While ElasticSearch is really great and has more experience with sharding then Solr we chose Solr because it has already seen use in WMF and we have more Solr expertise. In some sense there isn't too much of a choice because Lucene powers both systems. In a sense Solr, ElasticSearch, and our own lsearchd are all fancy wrappers around Lucene.

Building the Solr Configuration
Solr must be configured up front with a few things. CirrusSearch generates the configuration using a maintenance script so that it can read $wg fields from MediaWiki and so it can be more easilly used outside of WMF.

What Must We Configure?

 * About field in documents
 * Should it be stored?
 * How should it be analyzed on import?
 * How should it be queried?
 * About the urls served by solr for each collection of documents
 * Where should the search url be and what are the defaults? (we just use the standard configuration but it must explicitly included)
 * Should we turn on the helpful admin console? (yes)
 * Should we turn on replication? (yes because that is how SolrCloud works)
 * Should we turn on analysis? (yes because it is useful for debugging)
 * Should we turn on some random other stuff that is required for SolrCloud but we probably wouldn't need otherwise? (yes!)

Can We Rebuild the Configuration on the Fly?
Yes, but:
 * I (manybubbles) am having trouble doing that right now. I know you can because I've done it in past lives with previous versions of Solr.
 * We'll have to be careful not to invalidate anything already done because rebuilding the search index takes a while. This mostly means that if you need to change a field the procedure is:
 * Create a new field how you need it to be built
 * Rebuild the whole search index with both the old and new field
 * Start using the new field
 * Blow away the old field

Getting Data Into the Index
CirrusSearch offers three ways to load pages into the search index.

In Process
With the flick of a global you can engage in process updates to the search index that happen right after the user makes the edit. With SolrCloud's soft auto commits and push updates these should be replicated and searchable in two seconds. What that does to our cache hit rate has yet to be seen but it is certainly possible.

Bootstrapping
CirrusSearch has a maintenance script that shoves all pages into the search index. It works pretty much the same was as the maintenance script that rebuilds the MySQL full text index from scratch.

Rebuilding Specific Time Windows
If you have to turn off in process indexing for any reason you'll have to rebuild the gap in time. The same maintenance script used for bootstrapping accepts a time window for document production but query that it needs to identify the documents is less efficient. It should still be a fair sight better than just rebuilding the whole index though.

Testing
Ho boy testing search stuff is always hard because you don't always expect the same results to come back as before.

Indexing
We'll need to make sure indexing is fast enough and light enough not to get us into trouble and if Solr goes down we don't want to barf at our users, just into our logs.

Prefix Search
We should expect similar matches to what we have now so we can test this no problem by loading a wiki in labs and just trying it.

Full Text Search
We can play with full text search but we really shouldn't expect the same results because we're not making an effort to match the current behavior exactly because, well, the current behavior isn't really what our users want, so far as we know.

Non English
Both prefix and full text have to be sensible for:
 * Non-English but still space delimited languages
 * Non space delimited languages (Japanese, Chinese, Thai, etc)
 * Right to left languages (Hebrew, Arabic)

How We're Going to Test Them All
Most stuff we'll test in labs but for "are these results better" kinds of questions we'll have to deploy it carefully and see what happens.

Performance Testing
???

Phased Roll Out
A phased roll out is how we're going to have to handle some of the "are these results better" questions. If people don't like the results, we'de like them to complain.
 * 1) Deploy to mediawiki.org
 * 2) Deploy to one or two non-English wikis that are part of WMF and have been complaining about search.

How Do We Deploy This?
There are instructions in the README.... In short we'll do this for each wiki:
 * 1) Deploy the plugin
 * 2) Build the search index configuration and get it into Solr
 * 3) Start in process indexing
 * 4) Bootstrap the index
 * 5) Cut searching over to our plugin
 * 6) Wait for folks to complain
 * 7) If no complaints uninstall MWSearch2 and remove the index from lsearchd

Hardware
Thankfully we're able to shard our search indecies so we won't need the monster machines that lsearchd needs but full text searching still loves RAM. We're looking at a clutch of machines to run Solr and a few machines to run ZooKeeper that can be shared across the entire organization. It is possible to run multiple Solr clusters using the same ZooKeeper and we'll probably want to do that.

Terms
These terms mostly come from Solr. Some of them are explicitly defined some of them are just mentioned in passing and we're inferring their meaning.


 * Core      :All data that a single Solr instance stores about a certain kind of document including stored fields and all of the indecies that it uses to make searching quick.
 * Collection :A group of cores running on multiple instances which can be queried just like a single core. The cores are split into shards and then those shards are replicated.
 * Cluster   :A group of Solr instances handle some cores.

Open Questions

 * How exactly do we upgrade the Solr configuration on the fly?
 * How exactly do we reclaim space from fields we don't use?
 * How are we going to do performance testing?