Extension:CirrusSearch/Tour

CirrusSearch is an open source MediaWiki plugin that powers on wiki search with Elasticsearch. It is designed for wikis with anywhere from tens of pages to tens of millions of pages. The reference installation at the Wikimedia Foundation powers search for wikidata.org, ca.wikipedia.org, and most wikis in Italian and is a "beta feature" on most wikis at WMF including en.wikipedia.org and commons.wikimedia.org. In addition to the project itself most everything used to maintain the installation is open source and open or viewing including the puppet scripts and the monitoring data. This tour will describe at a high level the bits that make the application go, how WMF maintains the search cluster with (almost) no downtime, and how to debug problems.

Why Replace Home Grown Search

 * Home grown search isn't maintained so any issues become archaeological expeditions or just don't get worked.
 * Solution: Use a general purpose search backend which is actively maintained and submit all customizations upstream.


 * Home grown search is updated once a day
 * Solution: Hook into MediaWiki jobs infrastructure which already kicks of a job whenever a page's contents change (think to clear Varnish cache)


 * Home grown search doesn't transclude templates
 * Solution: Hook into MediaWiki to use its parser. Since MediaWiki's syntax isn't a spec if can't be reliably reimplemented in Java.  It is hard enough for the Parsoid to do it in Javascript.


 * Home grown search isn't used much outside of WMF
 * Solution: Don't rely on unmaintained extensions or uncommon software.

Why Elasticsearch
We knew we needed to use an open source general purpose full text search backend. Our obvious front runners were Solr and Elasticsearch. We chose Elasticsearch because:
 * Installing Elasticsearch is very easy. They provide zip, tar.gz, deb, and rpm.  There is a puppet module and a chef cookbook.  There are even apt and rpm repositories.  And an msi installer.
 * The instructions for filing a bug report and contributing code are wonderful. We've found submitting patches upstream to be pleasant.
 * Elasticsearch's index settings and mappings APIs are very good. When we made the choice Solr's schema rest api was in its infancy.
 * We really like Elasticsearch's phrase suggester

Development Process
TODO insert diagram

Create index with proper settings
We use a script to keep the index up to date. It is responsible both for creating the index in the first place and for changing the mapping.

Shove in documents
We have a bulk script which we use to start up a new wiki or when we change the document building logic and we hook into the MediaWiki job system to update pages after they've changed.

Bounce queries off the index
First write your queries manually and submit them against then index with something like sense. You can submit them with curl but I like sense better for development and debugging. Compared to using Lucene directly bounce any query you like off the index is a huge advantage. The documentation starts here and is pretty extensive.

Install Elasticsearch on some servers
Start with at least three servers or you won't have enough redundancy to be really "production ready". We use the debs and our own puppet module but you should use your favorite installation mechanism.

As far as picking the machine:
 * RAM
 * 64GB is normally the sweet spot because Java can do pointer compression if the heap is under ~32GB and you should save half your RAM for disk caching.


 * CPU
 * Depends on usage. We're handling about 85% of updates at WMF for about 10% load across a 12 node * 12 core cluster.  Searching can vary widely on CPU usage depending on what you do.


 * Disk
 * If you want raw indexing speed there is no substitute for an SSD. If you are ok with slower indexing then rotating is fine.  Since Elasticsearch handles redundancy at the application level super redundant RAID isn't exactly required.  I've seen recommendations to use RAID 0 over RAID 1.

Add production settings
You should really use Puppet or Chef for this kind of thing but if you want to be safe you need to set:
 * Elasticsearch's memory usage
 * Set  to   in your equivalent to


 * Memlockall
 * Uncomment  in your   equivalent
 * Uncomment  in
 * Add logging to see if it fails by adding  to


 * cluster and node name
 * Uncomment  and   in   and set them to a unique name for the cluster based on the environment (prod, test, dev, etc) and the machine's hostname, respectively.


 * Master nodes/Avoid split brain
 * Elasticsearch will suffer from a split brain if the network containing the nodes is partitioned and each partition thinks it has a quorum for cluster state decisions
 * Simple solution is to nominate three of your nodes as master eligible nodes and set  to.
 * You nominate the three nodes by leaving  set to true for that node (the default).  Best to explicitly set it on each node.
 * Set both of these in
 * If your nodes are very busy long GC pause times can knock out the master node, causing a new master election, worsening the problems. If you are worried about this then build three small master only nodes (no disk or ram usage and very little cpu) to ,  .   Set the others to  ,.
 * The whole master only thing is also a big deal if you have a ton of cluster state to maintain because you have a ton of nodes. Dozens or something.  Anyway, you should make your minimum master nodes as above in that case but with more CPU.  Disk and memory utilization should still be low.  At least, that is my understand.  Take it with a grain of salt because my cluster isn't like that.


 * Rack/row/zone awareness
 * Add  to and   to   if you have different racks/rows/zones.