Requests for comment/CirrusSearch

Purpose
We'd like to replace our homegrown search backend (lsearchd) with one that is:
 * more stable
 * more actively worked on
 * better supported (we want more people we can beg for help if we are having trouble)

In the process we'll replace MWSearch because it is customized for lsearchd with CirrusSearch which we are currently building to be customized for Elasticsearch.

We'll get a search backend that we can scale by starting new instances and asking them to join a particular search cluster.

We'd also like this replacement to be simpler to set up the the current MWSearch/lsearchd setup so it can be more easilly implemented by users outside of WMF.

Choice of Elasticsearch
There are really two big names in open source search that provide replication, sharding, and the degree of customization that we need: Solr and Elasticsearch. After a few weeks of evaluation of each tool we've chosen Elasticsearch because of its simpler replication, more easily composable queries, and very nice suggester.

Building the Elasticsearch configuration
While Elasticsearch is officially schemaless it doesn't make perfect analysis or scaling decisions without some hints. CirrusSearch includes a maintenance that reads $wg and configures Elasticsearch with index specific parameters.

What must we configure?

 * Field analysis
 * Should it be stored?
 * How should it be analyzed on import?
 * How should it be queried?
 * Scaling hints
 * How many shards should the index be split into?
 * How many replicas should the index have?

Can we rebuild the configuration on the fly?
Yes but sometime reindexes are required. The maintenance script can (theoretically) perform these reindexes and they should be quite quick as they can be done by streaming documents from Elasticsearch back to itself.

Getting data into the index
CirrusSearch offers three ways to load pages into the search index.

In process
With the flick of a global, you can engage in-process updates to the search index that happen right after the user makes the edit. With SolrCloud's soft auto commits and push updates, these should be replicated and searchable in two seconds. What that does to the cache hit rate has yet to be seen but it is certainly possible.

Bootstrapping
CirrusSearch has a maintenance script that shoves all pages into the search index. It works pretty much the same way as the maintenance script that rebuilds the MySQL full-text index from scratch.

Rebuilding specific time windows
If you have to turn off in-process indexing for any reason, you'll have to rebuild the gap in time. The same maintenance script used for bootstrapping accepts a time window for document production but query that it needs to identify the documents is less efficient. It should still be a fair sight better than just rebuilding the whole index though.

We're entertaining replacing In process and rebuilding specific time windows with using the job queue
This has the advantage of offering a much simpler path for reindexing documents after a template update. This also might allow us to remove our custom reindex script and use the one built in to MediaWiki for bootstrapping. We'd probably have to expand it a bit to get nice batching and stuff, but that shouldn't be too hard because we'd mostly be porting it from our custom script.

Testing
Testing search stuff is always hard because you don't always get the same results as before.

Indexing
We'll need to make sure indexing is fast enough and light enough not to get us into trouble and if Solr goes down we don't want to barf at our users, just into our logs.

Prefix search
We should expect similar matches to what we have now so we can test this no problem by loading a wiki in labs and just trying it.

Full-text search
We can play with full-text search but we really shouldn't expect the same results because we're not making an effort to match the current behavior exactly because, well, the current behavior isn't really what our users want, so far as we know.

Non-English
Both prefix and full-text have to be sensible for:
 * Non-English but still space delimited languages
 * Non-space delimited languages (Japanese, Chinese, Thai, etc)
 * Right-to-left languages (Hebrew, Arabic)

How we're going to test them all
Most stuff we'll test in labs but for "are these results better" kinds of questions we'll have to deploy it carefully and see what happens.

Performance testing
We'll be setting up production hardware before rolling this out and doing performance testing with that.

Resiliency testing
Test the resiliency of the extension to search backend failures, both immediate failures (e.g. ECONNREFUSED) and slow/nonresponding backends (e.g. blackholes, respecting short explicitly set timeouts)

On a positive note all communication with Elasticsearch is wrapped in poolcounters so we can limit the number of simultaneous requests and fail fast if we've crushed Elasticsearch.

Phased roll out
A phased roll out is how we're going to have to handle some of the "are these results better" questions. If people don't like the results, we'd like them to complain.
 * 1) Deploy to beta.
 * 2) Deploy to mediawiki.org
 * 3) Deploy to one or two non-English wikis that are part of WMF and have been complaining about search.

How do we deploy this?
There are instructions in the README.... In short we'll do this for each wiki:
 * 1) Deploy the plugin
 * 2) Run the script to configure the Elasticsearch index
 * 3) Start in-process indexing
 * 4) Bootstrap the index
 * 5) Cut searching over to our plugin
 * 6) Wait for folks to complain
 * 7) Once we're super sure we're done with lsearchd, we'll delete all of its indices but not before

Hardware
Search loves ram. Even sharded, scalable search loves ram. Though I think Elasticsearch would actually prefer more machine with slightly fewer resources.

Simple to install
To make this simpler to install for those who don't want to maintain a large infrastructure:
 * The Elasticsearch configuration is generate by the plugin instead of puppet.
 * Bootstrapping the index is done by the extension and doesn't require any special dump.
 * Try to keep our dependency list small (currently curl and the Elastica library)

Terms
See Elasticsearch's glossary.

Open questions

 * How exactly do we reclaim space from fields we don't use?
 * How are we going to do performance testing?
 * I think what we do is get performance to an acceptable level in labs; going ahead and tuning things for performance now (while remembering that it is labs, so performance will never match production). The only way to have real ideas of load is by the gradual rollout with proper logging/monitoring. I assume this will likely be iterative, and we'll learn lessons (and tweak things) as we move forward.
 * Do we actually want all of the wikis on the same cloud? Or would we split into a couple of clouds, like the DB clusters (s1-s7)
 * We probably want to split into multiple clouds because every member of a cloud could grab and core at any time, even becoming the master. Being the master for one of enwiki's shards will be a lot more work than being the master for a mediawiki.org.
 * Counter argument: you can move nodes around if you need to
 * Will transcluded pages be able to be indexed in situ especially where the pages are transcluded cross-namespace, or would this be part of a future build?
 * This is the plan for the first iteration, yes.
 * Will an oversighting action cause the indexing action to trip on the immediate?
 * What do you mean immediate? As in faster, than a few minutes (which seems to be the current target), with a special queue for revert+oversight?