Extension:CirrusSearch/Tour

CirrusSearch is an open source MediaWiki plugin that powers on wiki search with Elasticsearch. It is designed for wikis with anywhere from tens of pages to tens of millions of pages. The reference installation at the Wikimedia Foundation powers search for wikidata.org, ca.wikipedia.org, and most wikis in Italian and is a "beta feature" on most wikis at WMF including en.wikipedia.org and commons.wikimedia.org. In addition to the project itself most everything used to maintain the installation is open source and open or viewing including the puppet scripts and the monitoring data. This tour will describe at a high level the bits that make the application go, how WMF maintains the search cluster with (almost) no downtime, and how to debug problems.

Why Replace Home Grown Search

 * Home grown search isn't maintained so any issues become archaeological expeditions or just don't get worked.
 * Solution: Use a general purpose search backend which is actively maintained and submit all customizations upstream.


 * Home grown search is updated once a day
 * Solution: Hook into MediaWiki jobs infrastructure which already kicks of a job whenever a page's contents change (think to clear Varnish cache)


 * Home grown search doesn't transclude templates
 * Solution: Hook into MediaWiki to use its parser. Since MediaWiki's syntax isn't a spec if can't be reliably reimplemented in Java.  It is hard enough for the Parsoid to do it in Javascript.


 * Home grown search isn't used much outside of WMF
 * Solution: Don't rely on unmaintained extensions or uncommon software.

Why Elasticsearch
We knew we needed to use an open source general purpose full text search backend. Our obvious front runners were Solr and Elasticsearch. We chose Elasticsearch because:
 * Installing Elasticsearch is very easy. They provide zip, tar.gz, deb, and rpm.  There is a puppet module and a chef cookbook.  There are even apt and rpm repositories.  And an msi installer.
 * The instructions for filing a bug report and contributing code are wonderful. We've found submitting patches upstream to be pleasant.
 * Elasticsearch's index settings and mappings APIs are very good. When we made the choice Solr's schema rest api was in its infancy.
 * We really like Elasticsearch's phrase suggester

Java vs REST vs Thrift
There are three ways to access Elasticsearch so I'll rank them in ascending order of my (User:Manybubbles) preference:
 * Thrift
 * I've heard of very few people using it and it requires installing an Elasticsearch plugin. I'd stay away due to lack of testing by the community.


 * Java
 * You can use this if you're application is JVM based. Just skip if not.  This API is the highest performance because it embeds part of Elasticsearch's cluster knowledge on the client so it can send requests right where they need to go.  Almost all the Elasticsearch integration tests are written against it.  It'd be clear win but for compatibility issues:  you really ought to run the same version of the Elasticsearch jar on the clients as is running on the server and you really ought to run the same JVM as the server.  You don't strictly have to but if the versions get out of sync then you'll need to look askance at any failures and think "is this a version thing?"
 * REST


 * The rest API is available to everyone, including browser plugins like Sense. It is also how bugs are filed against Elasticsearch (curl recreation).  For this reason its a safe choice, if higher overhead.  Compatibility across upgrades is simpler with curl.  For example, CirrusSearch currently supports both Elasticsearch 1.0 and 0.90 without needing prior knowledge of which version of Elasticsearch it is communicating with.  It needs to use some lowest common denominator style of requests but it makes upgrades simpler by requiring fewer parts to be changed at once.

Development process
TODO insert diagram

Create index with proper settings
We use a script to keep the index up to date. It is responsible both for creating the index in the first place and for changing the mapping.

Shove in documents
We have a bulk script which we use to start up a new wiki or when we change the document building logic and we hook into the MediaWiki job system to update pages after they've changed.

Bounce queries off the index
First write your queries manually and submit them against then index with something like sense. You can submit them with curl but I like sense better for development and debugging. Compared to using Lucene directly bounce any query you like off the index is a huge advantage. The documentation starts here and is pretty extensive.

Maintenance process
TODO insert diagram

Install Elasticsearch on some servers
Start with at least three servers or you won't have enough redundancy to be really "production ready". We use the debs and our own puppet module but you should use your favorite installation mechanism.

As far as picking the machine:
 * RAM
 * 64GB is normally the sweet spot because Java can do pointer compression if the heap is under ~32GB and you should save half your RAM for disk caching.


 * CPU
 * Depends on usage. We're handling about 85% of updates at WMF for about 10% load across a 12 node * 12 core cluster.  Searching can vary widely on CPU usage depending on what you do.


 * Disk
 * If you want raw indexing speed there is no substitute for an SSD. If you are ok with slower indexing then rotating is fine.  Since Elasticsearch handles redundancy at the application level super redundant RAID isn't exactly required.  I've seen recommendations to use RAID 0 over RAID 1.

Add production settings
You should really use Puppet or Chef for this kind of thing but if you want to be safe you need to set:
 * Elasticsearch's memory usage
 * Set  to   in your equivalent to


 * Memlockall
 * Uncomment  in your   equivalent
 * Uncomment  in
 * Add logging to see if it fails by adding  to


 * cluster and node name
 * Uncomment  and   in   and set them to a unique name for the cluster based on the environment (prod, test, dev, etc) and the machine's hostname, respectively.


 * Master nodes/Avoid split brain
 * Elasticsearch will suffer from a split brain if the network containing the nodes is partitioned and each partition thinks it has a quorum for cluster state decisions
 * Simple solution is to nominate three of your nodes as master eligible nodes and set  to.
 * You nominate the three nodes by leaving  set to true for that node (the default).  Best to explicitly set it on each node.
 * Set both of these in
 * If your nodes are very busy long GC pause times can knock out the master node, causing a new master election, worsening the problems. If you are worried about this then build three small master only nodes (no disk or ram usage and very little cpu) to ,  .   Set the others to  ,.
 * The whole master only thing is also a big deal if you have a ton of cluster state to maintain because you have a ton of nodes. Dozens or something.  Anyway, you should make your minimum master nodes as above in that case but with more CPU.  Disk and memory utilization should still be low.  At least, that is my understand.  Take it with a grain of salt because my cluster isn't like that.


 * Rack/row/zone awareness
 * Add  to and   to   if you have different racks/rows/zones.

TODO finish going through puppet changes

Java
Don't run Java 7 update 51, either OpenJDK or Oracle. It is known not to work with Lucene and Elasticsearch. Something is wrong in faceting/aggregation. (Current as of 0.90.x, may not still be true for 1.0.x?)

The OpenJDK is fine so long as you don't use 1.6. 1.6 doesn't work properly with Elasticsearch. WMF's mission requires that we use open source software unless we have a super compelling reason not to so we use the OpenJDK and it is working fine.

Do use the same version of Java on all the nodes in the cluster and on every client application using the Java API.

Updates
Elasticsearch's updates are quite quick but it is useful to be able to control the number of concurrent update processes so large batches of updates don't spike the CPU on the nodes. Remember, each Elasticsearch shard performs the update. Other systems (Solr's Master/Slave) have the update performed on the master and synced to the slave. They are more tolerant of update hammering because the master doesn't do any searching. Anyway, our solution to this was to move updates to a job queue system which also brings us retry on failure and makes sure that any intermittent Elasticsearch issues are transparent to the user. If we accidentally trash a shard we lose the search results and updates start failing on the job queue, but no one gets error messages while we frantically try to recover. If you end up on ActiveMQ or something like it make sure job failures don't lock the workers. ActiveMQ will do this to maintain some kind of in order delivery and updating Elasticsearch won't need it so long as you make sure to always update Elasticsearch from the most recent version of the data.

Timeouts
Elasticsearch has some pretty sweet timeouts and you should remember to use them. It has three kinds of timeouts built in:
 * Master shard availability for indexes
 * If the master shard for a particular documents isn't available then Elasticsearch will wait this long for the shard to come on line. Defaults to what feels to me like a long time.  If you use a job queue with retries for updates then set this to something small and let the job failure accumulate and retry rather than locking up a worker for the wait time.  Most of the time when master shards are lost it is because you trashed them somehow and it'll take manual action to bring them back online.  On that time scale the difference five milliseconds and two minutes isn't that large and five milliseconds is a lot less likely to jam up your job queue.  Documentation is here but you have to search for "timeout" because there isn't a deep link.


 * Shard timeouts that yield partial results from the shard after the timeout
 * The timeouts on searches do this for certain search phases. I know for the match finding phase.


 * Shard timeouts that yield no results from the shard after the timeout still let the search return partial results from the shards that didn't time out
 * As above this is set by setting the timeout on searches but I'm not 100% sure which search phases will do this rather than provide partial results. Regardless, you should know that both are possible.

It is wrong to think: "Cool!  I'll put timeouts on updates because I can retry those but I'll leave the timeouts off of searches because the user asked for the results so I may as well wait for them." You heart would be in the right place but this is wrong because: 1. The user isn't likely to wait a long time for results any way. They'll just give up and move on. Better to show them partial results then wait forever. 2. If a rogue node joins the cluster, gets a shard, and starts hanging every time you search it then without timeouts all the request queues in Elasticsearch will fill up very quickly making your cluster close to worthless. Search timeouts should help with this.

For reason number 2 above it is important for your application to set a request timeout on the http connection but to keep it longer then the Elasticsearch timeouts. If your application chops the connection Elasticsearch will not be able to release the resources associated until it times out.