Requests for comment/CirrusSearch

Purpose
We'd like to replace our homegrown search backend (lsearchd) with one that is:
 * more stable
 * more actively worked on
 * better supported (we want more people we can beg for help if we are having trouble)

In the process we'll replace MWSearch because it is customized for lsearchd with CirrusSearch which we are currently building to be customized for SolrCloud aka Solr4.

We'll get a search backend that we can scale by starting new instances and asking them to join a particular search cluster (well, it is almost that easy).

We'd also like this replacement to be simpler to set up the the current MWSearch/lsearchd setup so it can be more easilly implemented by users outside of WMF.

Choice of solr
There are really two big names in open source search that provide replication, sharding, and the degree of customization that we need: Solr and ElasticSearch. While ElasticSearch is really great and has more experience with sharding than Solr we chose Solr because it has already been used in WMF and we have more Solr expertise. In some sense there isn't too much of a choice because Lucene powers both systems. In a sense Solr, ElasticSearch, and our own lsearchd are all fancy wrappers around Lucene.

Building the solr configuration
Solr must be configured up front with a few things. CirrusSearch generates the configuration using a maintenance script so that it can read $wg fields from MediaWiki and so it can be more easilly used outside of WMF.

What must we configure?

 * About field in documents
 * Should it be stored?
 * How should it be analyzed on import?
 * How should it be queried?
 * About the URLs served by solr for each collection of documents
 * Where should the search URL be and what are the defaults? (we just use the standard configuration but it must explicitly included)
 * Should we turn on the helpful admin console? (yes -- for shell users only, with limits. The API is mostly GET requests which makes it easy to mess something up)
 * Should we turn on replication? (yes because that is how SolrCloud works)
 * Should we turn on analysis? (yes because it is useful for debugging)
 * Should we turn on some random other stuff that is required for SolrCloud but we probably wouldn't need otherwise? (yes!)

Can we rebuild the configuration on the fly?
Yes, but:
 * We'll have to wait until a patch is released which looks to be in Solr version 4.4.
 * We'll have to be careful not to make breaking changes to the schema because rebuilding the search index takes a while. This mostly means that if you need to change a field the procedure is:
 * Create a new field how you need it to be built
 * Rebuild the whole search index with both the old and new field
 * Start using the new field
 * Blow away the old field

Getting data into the index
CirrusSearch offers three ways to load pages into the search index.

In process
With the flick of a global, you can engage in-process updates to the search index that happen right after the user makes the edit. With SolrCloud's soft auto commits and push updates, these should be replicated and searchable in two seconds. What that does to the cache hit rate has yet to be seen but it is certainly possible.

Bootstrapping
CirrusSearch has a maintenance script that shoves all pages into the search index. It works pretty much the same way as the maintenance script that rebuilds the MySQL full-text index from scratch.

Rebuilding specific time windows
If you have to turn off in-process indexing for any reason, you'll have to rebuild the gap in time. The same maintenance script used for bootstrapping accepts a time window for document production but query that it needs to identify the documents is less efficient. It should still be a fair sight better than just rebuilding the whole index though.

Testing
Testing search stuff is always hard because you don't always get the same results as before.

Indexing
We'll need to make sure indexing is fast enough and light enough not to get us into trouble and if Solr goes down we don't want to barf at our users, just into our logs.

Prefix search
We should expect similar matches to what we have now so we can test this no problem by loading a wiki in labs and just trying it.

Full-text search
We can play with full-text search but we really shouldn't expect the same results because we're not making an effort to match the current behavior exactly because, well, the current behavior isn't really what our users want, so far as we know.

Non english
Both prefix and full-text have to be sensible for:
 * Non-English but still space delimited languages
 * Non-space delimited languages (Japanese, Chinese, Thai, etc)
 * Right-to-left languages (Hebrew, Arabic)

How we're going to test them all
Most stuff we'll test in labs but for "are these results better" kinds of questions we'll have to deploy it carefully and see what happens.

Performance testing
???

Resiliency testing
Test the resiliency of the extension to cluster failures, both immediate failures (e.g. ECONNREFUSED) and slow/nonresponding backends (e.g. blackholes, respecting short explicitly set timeouts)
 * Haven't done it yet, but plan to wrap the calls to Solr in PoolCounter, just like we did with MWSearch. This should help keep the Apaches from stampeding an already-overloaded Solr.

Phased roll out
A phased roll out is how we're going to have to handle some of the "are these results better" questions. If people don't like the results, we'd like them to complain.
 * 1) Deploy to mediawiki.org
 * 2) Deploy to one or two non-English wikis that are part of WMF and have been complaining about search.

How do we deploy this?
There are instructions in the README.... In short we'll do this for each wiki:
 * 1) Deploy the plugin
 * 2) Build the search index configuration and get it into Solr
 * 3) Start in-process indexing
 * 4) Bootstrap the index
 * 5) Cut searching over to our plugin
 * 6) Wait for folks to complain
 * 7) Once we're super sure we're done with lsearchd, we'll delete all of its indices but not before

Hardware
Thankfully we're able to shard our search indecies so we won't need the monster machines that lsearchd needs but full-text searching still loves RAM. We're looking at a clutch of machines to run Solr and a few machines to run ZooKeeper that can be shared across the entire organization. It is possible to run multiple Solr clusters using the same ZooKeeper and we'll probably want to do that.

Simple to install
To make this simpler to install for those who don't want to maintain a large infrastructure:
 * The configuration files are generated by the plugin, not puppet.
 * The configuration files work just as well for a single node solr install as for a multinode install.
 * Bootstrapping the index is done by the extension and doesn't require any special dump
 * Try to keep our dependency list small (currently curl and the Solarium MediaWiki extension)

Terms

 * Core      :All data that a single Solr instance stores about a certain kind of document including stored fields and all of the indecies that it uses to make searching quick.
 * Collection :A group of cores running on multiple instances which can be queried just like a single core. The cores are split into shards and then those shards are replicated.
 * Cluster   :A group of Solr instances handle some cores.

Open questions

 * How exactly do we upgrade the Solr configuration on the fly?
 * Perhaps we could use git-deploy for this (would need Ryan's input). Basically we'd ssh to deployment host, run the config updater script, then kick off git-deploy to commit the changes and sync them out. It'd also give us the ability to do things like restart ZooKeeper or solr (on a rolling basis).
 * How exactly do we reclaim space from fields we don't use?
 * How are we going to do performance testing?
 * I think what we do is get performance to an acceptable level in labs; going ahead and tuning things for performance now (while remembering that it is labs, so performance will never match production). The only way to have real ideas of load is by the gradual rollout with proper logging/monitoring. I assume this will likely be iterative, and we'll learn lessons (and tweak things) as we move forward.
 * Do we actually want all of the wikis on the same cloud? Or would we split into a couple of clouds, like the DB clusters (s1-s7)
 * We probably want to split into multiple clouds because every member of a cloud could grab and core at any time, even becoming the master. Being the master for one of enwiki's shards will be a lot more work than being the master for a mediawiki.org.
 * Counter argument: you can split shards (the API for cleaning up the parent shard isn't there yet but is coming) so you might be able to just keep all the shards the same size by splitting them when they get too big.
 * Will transcluded pages be able to be indexed in situ especially where the pages are transcluded cross-namespace, or would this be part of a future build?
 * This is the plan for the first iteration, yes.