Wikimedia Discovery/2015 Notes on unbreaking and optimizing elasticsearch

The following notes were taken by James Douglas, in a meeting on 2015-07-01

=Unbreaking=

Two other failure modes we've had
We accidentally configured there to be zero replicas, then to upgrade a server's disks, we shut it down, removed its data, and turned it back on.

It's safer to command Elastic to move all replicas off a given shard -- this is how we decommission a node (see docs on the Search page).

The other outage was much more insidious. First, we decided to add nodes to the cluster on a Friday. We noticed increased I/O load, which we expected to an extent, but it was higher than expected; the geo search nodes use spinning disks while the rest of the nodes did not.

Lessons learned
There are lots of lessons to learn from this:

If we are going to get taken down by high load (vs. memory), it's probably going to be I/O load, rather than sopmething else like CPU load.

If you are goign to take down one node, you're probably going to take them all down.


 * Commands to an Elastic cluster are fanned out (e.g. enwiki is replicated four ways)
 * Elastic does not have tools to automatically remove a sick node from the cluster
 * If a node is sick, it is just sick, and slows down queries
 * This is "ok", because we don't want cascading failures

Normally in a system like this we would have nice timeouts, e.g. 500ms, and results would be degraded as necessary based on available shards. This works fine for systems like PHP, but Java threads aren't killable -- you can only pause them. * Could we s/Thread/forked JVM/, or is this too fundamental to Lucene? Nope.

The other issue with timeouts is that we don't actually know what good timeouts are. We've studied it, but it's a hard question to ask at the place where you'd set the timeout -- upfront in the PHP app. It's something like ten seconds right now, because sometimes we really do want queries to take several seconds.

Ideas for improvement
Some random thoughts for fixing this:


 * Reassess the queries that are slow
 * Integrate the results degredation/query continuation abstractions

A slow node in the cluster will slow down roughly a quarter of the queries. It will also, due to the pool counter, limit the number of queries that can be sent to the cluster. If you get enough slow queries gumming up the pool counter, we'll start rejecting queries.

Ideally we would get a distributed search expert who could advise e.g. safest algorithms for distributing queries, etc.

=Optimizing=

Phrase rescore
This is a big hit on performance, and is triggered any time there is a space in a query; when you want to rate clumps of words more highly than individual words.

Optimizations
(Something I didn't understand about ngrams.)

Most searches are hitting within content namespaces.

Get that disk I/O down -- rarely are we concerned about other types (e.g. CPU I/O). =Other thoughts=


 * Upgrading to 1.6 (in labs? beta?) caused a failure on a node, where the index became corrupted. It is possible (likely) that the index was already corrupt, and this simply exposed it. The right way to handle this is to recreate the indexes.


 * We need to upgrade one node at a time, and wait for the shards to settle down. Perhaps even only do one node per day.


 * Node intercommunication in 1.6 is better, but may not be perfect. Tread carefully.


 * Going theory is that corruption happens because serialization over the wire is uncarefully done/handled.


 * Labs and Beta are more likely to go wrong, because their architectures and topologies are janky.


 * TranslateWiki and API Feature Usage also rely on Elastic. We should probably get involved.


 * The way Lucene calculates the relevance of a particular word, is to find words that don't appear often across documents, but do appear often in this document. It's not very accurate, but it's very useful. It's fast, seeds results for things like phrase rescore.


 * Lucene includes deleted documents to calculate things, assuming that they're expunged frequently. This leads to dangerous (from an I/O load perspective) behavior, and garbage results.


 * The highlighter used to be a big I/O cost.


 * When upgrading the Cirrus mappings in an incompatible way, you can bite the bullet and move it over all at once, or use feature flags that wait until roll-out to turn on. The feature flag approach is brittle -- much more likely to cause failures in between deployments across the many wikis. Another way is to -1 any commit that is incompatible with the production index.