Extension:CirrusSearch/Tour

CirrusSearch is an open source MediaWiki plugin that powers on wiki search with Elasticsearch. It is designed for wikis with anywhere from tens of pages to tens of millions of pages. The reference installation at the Wikimedia Foundation powers search for wikidata.org, ca.wikipedia.org, and most wikis in Italian and is a "beta feature" on most wikis at WMF including en.wikipedia.org and commons.wikimedia.org. In addition to the project itself most everything used to maintain the installation is open source and open or viewing including the puppet scripts and the monitoring data. This tour will describe at a high level the bits that make the application go, how WMF maintains the search cluster with (almost) no downtime, and how to debug problems.

Why Replace Home Grown Search

 * Home grown search isn't maintained so any issues become archaeological expeditions or just don't get worked.
 * Solution: Use a general purpose search backend which is actively maintained and submit all customizations upstream.


 * Home grown search is updated once a day
 * Solution: Hook into MediaWiki jobs infrastructure which already kicks of a job whenever a page's contents change (think to clear Varnish cache)


 * Home grown search doesn't transclude templates
 * Solution: Hook into MediaWiki to use its parser. Since MediaWiki's syntax isn't a spec if can't be reliably reimplemented in Java.  It is hard enough for the Parsoid to do it in Javascript.


 * Home grown search isn't used much outside of WMF
 * Solution: Don't rely on unmaintained extensions or uncommon software.

Why Elasticsearch
We knew we needed to use an open source general purpose full text search backend. Our obvious front runners were Solr and Elasticsearch. We chose Elasticsearch because:
 * Installing Elasticsearch is very easy. They provide zip, tar.gz, deb, and rpm.  There is a puppet module and a chef cookbook.  There are even apt and rpm repositories.  And an msi installer.
 * The instructions for filing a bug report and contributing code are wonderful. We've found submitting patches upstream to be pleasant.
 * Elasticsearch's index settings and mappings APIs are very good. When we made the choice Solr's schema rest api was in its infancy.
 * We really like Elasticsearch's phrase suggester

Development Process
TODO insert diagram

Create index with proper settings
We use a script to keep the index up to date. It is responsible both for creating the index in the first place and for changing the mapping.

Shove in documents
We have a bulk script which we use to start up a new wiki or when we change the document building logic and we hook into the MediaWiki job system to update pages after they've changed.

Bounce queries off the index
First write your queries manually and submit them against then index with something like sense. You can submit them with curl but I like sense better for development and debugging. Compared to using Lucene directly bounce any query you like off the index is a huge advantage. The documentation starts here and is pretty extensive.

Install Elasticsearch on some servers
Start with at least three servers or you won't have enough redundancy to be really "production ready". We use the debs and our own puppet module but you should use your favorite installation mechanism.

Add production settings
You should really use Puppet or Chef for this kind of thing but if you want to be safe you need to set:
 * Elasticsearch's memory usage
 * min(30gb, physical memory/2)


 * Uncomment  in your   equivalent
 * Uncomment  in
 * Add logging to see if it fails by adding  to