Extension:CirrusSearch/Tour

CirrusSearch is an open source MediaWiki plugin that powers on wiki search with Elasticsearch. It is designed for wikis with anywhere from tens of pages to tens of millions of pages. The reference installation at the Wikimedia Foundation powers search for wikidata.org, ca.wikipedia.org, and most wikis in Italian and is a "beta feature" on most wikis at WMF including en.wikipedia.org and commons.wikimedia.org. In addition to the project itself most everything used to maintain the installation is open source and open or viewing including the puppet scripts and the monitoring data. This tour will describe at a high level the bits that make the application go, how WMF maintains the search cluster with (almost) no downtime, and how to debug problems.

Why Replace Home Grown Search

 * Home grown search isn't maintained so any issues become archaeological expeditions or just don't get worked.
 * Solution: Use a general purpose search backend which is actively maintained and submit all customizations upstream.


 * Home grown search is updated once a day at best.
 * Solution: Hook into MediaWiki jobs infrastructure which already kicks of a job whenever a page's contents change (think to clear Varnish cache).


 * Home grown search doesn't transclude templates
 * Solution: Hook into MediaWiki to use its parser. Since MediaWiki's syntax isn't a spec if can't be reliably reimplemented in Java.  It is hard enough for the Parsoid to do it in Javascript.


 * Home grown search isn't used much outside of WMF.
 * Solution: Don't rely on unmaintained extensions or uncommon software.


 * Debugging home grown search is difficult because requests are fixed.
 * Solution: Elasticsearch's searches are JSON so you can play with them quite easily

Why Elasticsearch
We knew we needed to use an open source general purpose full text search backend. Our obvious front runners were Solr and Elasticsearch. We chose Elasticsearch because:
 * Installing Elasticsearch is very easy. They provide zip, tar.gz, deb, and rpm.  There is a puppet module and a chef cookbook.  There are even apt and rpm repositories.  And an msi installer.
 * The instructions for filing a bug report and contributing code are wonderful. We've found submitting patches upstream to be pleasant.
 * Elasticsearch's index settings and mappings APIs are very good. When we made the choice Solr's schema rest api was in its infancy.
 * We really like Elasticsearch's phrase suggester

Create index with proper settings
We use a script to keep the index up to date. It is responsible both for creating the index in the first place and for changing the mapping.

Shove in documents
We have a bulk script which we use to start up a new wiki or when we change the document building logic and we hook into the MediaWiki job system to update pages after they've changed.

Bounce queries off the index
First write your queries manually and submit them against then index with something like sense. You can submit them with curl but I like sense better for development and debugging. Compared to using Lucene directly bounce any query you like off the index is a huge advantage. The documentation starts here and is pretty extensive.

Reindex
While not strictly a step in the process, it is super important that you can rebuild the index quickly. There are two kinds of rebuild:
 * Reindex
 * When you want to change index settings (analysis config, shard count, stuff like that)


 * Repopulate
 * When you want to change how your documents are built at your source system

Reindexing can be done with a transparent swap. See this blog post for the basics.

At least for us repopulating the index cannot be done with a transparent swap. We make the changes and just readd all the documents. You'll need some process to make sure that new fields get the right configuration. In our case we do nothing with them until we reindex everything, then they show up all at once. Seriously, the operation is atomic across the cluster.

Much of what you do with Elasticsearch will do sensible stuff if a field doesn't exist so it is often safe to just deploy code that uses the field and it'll magically switch on when it is created. Not everything works this way so you should be careful.

Java vs REST vs Thrift/Memcached
There are three ways to access Elasticsearch so I'll rank them in ascending order of my (User:Manybubbles) preference:
 * Thrift/Memcached
 * I've heard of very few people using them and they require installing an Elasticsearch plugin. I'd stay away.


 * Java
 * You can use this if you're application is JVM based. Just skip if not.  This API is the highest performance because it embeds part of Elasticsearch's cluster knowledge on the client so it can send requests right where they need to go.  Almost all the Elasticsearch integration tests are written against it.  It'd be clear win but for compatibility issues:  you really ought to run the same version of the Elasticsearch jar on the clients as is running on the server and you really ought to run the same JVM as the server.  You don't strictly have to but if the versions get out of sync then you'll need to look askance at any failures and think "is this a version thing?"


 * REST
 * The rest API is available to everyone, including browser plugins like Sense. It is also how bugs are filed against Elasticsearch (curl recreation).  For this reason its a safe choice, if higher overhead.  Compatibility across upgrades is simpler with curl.  For example, CirrusSearch currently supports both Elasticsearch 1.0 and 0.90 without needing prior knowledge of which version of Elasticsearch it is communicating with.  It needs to use some lowest common denominator style of requests but it makes upgrades simpler by requiring fewer parts to be changed at once.

Install Elasticsearch on some servers
Start with at least three servers or you won't have enough redundancy to be really "production ready". We use the debs and our own puppet module but you should use your favorite installation mechanism.

As far as picking the machine:
 * RAM
 * 64GB is normally the sweet spot because Java can do pointer compression if the heap is under ~32GB and you should save half your RAM for disk caching.


 * CPU
 * Depends on usage. We're handling about 85% of updates at WMF for about 8% load across a 16 node * 12 core cluster.  Searching can vary widely on CPU usage depending on what you do.


 * Disk
 * If you want raw indexing speed there is no substitute for an SSD. If you are ok with slower indexing then rotating is fine.  Since Elasticsearch handles redundancy at the application level super redundant RAID isn't exactly required.  I've seen recommendations to use RAID 0 over RAID 1.  We just use a single disk per Elasticsearch node.

Add production settings
You should really use Puppet/Chef/Salt/whatever for this kind of thing. Here are the settings you should look at or modify:
 * Elasticsearch's memory usage
 * Set  to   in your equivalent to


 * Memlockall
 * Uncomment  in your   equivalent
 * Uncomment  in
 * Add logging to see if it fails by adding  to


 * Cluster and node name
 * Uncomment  and   in   and set them to a unique name for the cluster based on the environment (prod, test, dev, etc) and the machine's hostname, respectively.


 * Master nodes/Avoid split brain
 * Elasticsearch will suffer from a split brain if the network containing the nodes is partitioned and each partition thinks it has a quorum for cluster state decisions
 * Simple solution is to nominate three of your nodes as master eligible nodes and set  to.
 * You nominate the three nodes by leaving  set to true for that node (the default).  Best to explicitly set it on each node.
 * Set both of these in
 * If your nodes are very busy long GC pause times can knock out the master node, causing a new master election, worsening the problems. If you are worried about this then build three small master only nodes (no disk or ram usage and very little cpu) to ,  .   Set the others to  ,.
 * The whole master only thing is also a big deal if you have a ton of cluster state to maintain because you have a ton of nodes. Dozens or something.  Anyway, you should make your minimum master nodes as above in that case but with more CPU.  Disk and memory utilization should still be low.  At least, that is my understand.  Take it with a grain of salt because my cluster isn't like that.


 * Rack/row/zone awareness
 * Add  to and   to   if you have different racks/rows/zones.


 * Node discovery
 * Some folks don't like multicast discovery. We run all kinds of multicast at WMF so we use it.  There is a special plugin for EC2 discovery.  I don't use EC2 so I don't know anything about it.


 * Full cluster restart settings
 * To help Elasticsearch cleanly restart you should set three settings:

gateway: recover_after_nodes: 10 recover_after_time: 20m expected_nodes: 16
 * The process goes like this: Wait until   have show up.  Then wait   or until   show up.  Then start the cluster again.  Without setting these the cluster won't know that more nodes are coming and so will start frantically copying shards after a few nodes have shown up, trying to restore redundancy.  When the remaining nodes show up it'll things will calm down.  Setting these prevents or shortens the frantic copying phase.

Java
Don't run Java 7 update 51, either OpenJDK or Oracle. It is known not to work with Lucene and Elasticsearch. It crashes. Bad.

Do use the same version of Java on all the nodes in the cluster and on every client application using the Java API.

You may use either the OpenJDK or Oracle JDK.

Updates
Elasticsearch's updates are quite quick but it is useful to be able to control the number of concurrent update processes so large batches of updates don't spike the CPU on the nodes. Remember each Elasticsearch replica performs the update. Other systems (Solr's Master/Slave) have the update performed on the master and synced to the slave. They are more tolerant of update hammering because the master doesn't do any searching. Anyway, our solution to this was to move updates to a job queue system which also brings us retry on failure and makes sure that any intermittent Elasticsearch issues are transparent to the user. If we accidentally trash a shard we lose the search results and updates start failing on the job queue, but no one gets error messages while we frantically try to recover. If you end up on ActiveMQ or something like it make sure job failures don't lock the workers. ActiveMQ used to do this to maintain some kind of in order delivery and updating Elasticsearch won't need it so long as you make sure to always update Elasticsearch from the most recent version of the data. I'm not sure if it still does that.

Timeouts
Elasticsearch has some pretty sweet timeouts and you should remember to use them. It has three kinds of timeouts built in:
 * Master shard availability for indexes
 * If the master shard for a particular documents isn't available then Elasticsearch will wait this long for the shard to come on line. Defaults to what feels to me like a long time.  If you use a job queue with retries for updates then set this to something small and let the job failure accumulate and retry rather than locking up a worker for the wait time.  Most of the time when master shards are lost it is because you trashed them somehow and it'll take manual action to bring them back online.  On that time scale the difference five milliseconds and two minutes isn't that large and five milliseconds is a lot less likely to jam up your job queue.  Documentation is here but you have to search for "timeout" because there isn't a deep link.


 * Shard timeouts that yield partial results from the shard after the timeout
 * The timeouts on searches do this for certain search phases. I know for the match finding phase.


 * Shard timeouts that yield no results from the shard after the timeout still let the search return partial results from the shards that didn't time out
 * As above this is set by setting the timeout on searches but I'm not 100% sure which search phases will do this rather than provide partial results. Regardless, you should know that both are possible.

It is wrong to think: "Cool!  I'll put timeouts on updates because I can retry those but I'll leave the timeouts off of searches because the user asked for the results so I may as well wait for them." Your heart would be in the right place but: 1. The user isn't likely to wait a long time for results any way. They'll just give up and move on. Better to show them partial results then wait forever. 2. If a rogue node joins the cluster, gets a shard, and starts hanging every time you search it then without timeouts all the request queues in Elasticsearch will fill up very quickly making your cluster close to worthless. Search timeouts should help with this.

For reason number 2 above it is important for your application to set a request timeout on the http connection but to keep it longer then the Elasticsearch timeouts. If your application chops the connection Elasticsearch will not be able to release the resources associated until it times out.

Relative speeds
I think it is useful to get a sense of the relative speeds of the various components. So I tried to make a chart. This isn't scientific. I really just ran the same query over and over again subtracting parts and eyeballing how long each one took. Actual run times varied quite a bit so these are really more useful in an order of magnitude sense.