CirrusSearch/BlogDraft

This is a draft of a blog post.

We recently replaced the on site search for English Wikipedia with CirrusSearch, a MediaWiki plugin we developed to serve searches from Elasticsearch. During the first two weeks of operation we've served about 408 million full text searches and 1.5 billion find as you type searches across the 900 wikis we host.

The full list of features is documented here but here are some highlights:
 * Near real time indexing (< a minute from page edit to index update on average)
 * Semi-curated search results (featured articles and picture of the day given a community configured boost)
 * A large compliment of filters and exact matching techniques useful for editors
 * Optionally prefer recently edited articles

The old on-site search was quite quite good for its time but hadn't seen much love in many many years and was beginning to show its age by crashing, blowing up on certain search terms, and otherwise waking people up in the middle of the night. The open source search world has moved on to the point where little of its code was still useful and the Wikimedia Foundation wanted to get out of the business of maintaining a custom search solution.

So we ultimately chose to rebuild the system using a MediaWiki extension (CirrusSearch) to act as a go between for MediaWiki and Elasticsearch. We chose Elasticsearch we liked how we could configure everything about how data was indexed and searched over HTTP. Elasticsearch does a good job of making search seem simple most of the time but doesn't hide the complexity when you genuinely need to do complex things. We appreciate that queries, documents, and analysis configuration are all JSON that can be sent and fetched over HTTP. We also appreciate that you can get tons of metrics relating to Lucene internals and write plugins to extend the query language if you need to.

We chose to make a MediaWiki extension so we could get access to the wealth of information that MediaWiki builds when it parses wikitext and so we could easily notice changes to the articles and index them immediately after update.

At the Wikimedia Foundation we try to keep things as open as possible. You can see what data we have for a given page by adding a special URL parameter like this. You can see the JSON that we use for a query by adding a parameter to the query page like this. Our server metrics are public. And so is the puppet module we use and of course so is the source for CirrusSearch and all of our plugins.

We've really enjoyed working with Elasticsearch and think that the way to get the most out of it is to start simple and iterate. Download it, run it, POST some documents, try searching for them. Then decide something is wrong and go back some steps. Maybe make different searches or different documents or maybe setup a mapping rather than using the default one or maybe a different analyzer. If all else fails you can install a plugin. Or write one! Or contribute to Elasticsearch. Or Lucene. Each step is a bit harder and rarer than the last. Most bug fixes for CirrusSearch involve changing the query and which is simple to develop and deploy.

How we got here
The name of the game for this project was "don't break anything important" and "deploy early and often". "Don't break anything important" is important because of the scale and reach of the project. While we consider Wikipedia a grand experiment (similar to Sesame Street) its still an important thing that lots of people rely on and breaking it would be bad (again, like Sesame Street). "Deploy early and often" was important because the old search system didn't have a specification. How it worked was documented across many wiki pages in many languages but even that documentation wasn't complete. As such we had no choice but to take a very educated guess at the specification, allow users to test it, and fix anything that users complained about.

The basic plan was:
 * Deploy the extension so updates were indexed
 * Perform the initial import
 * Expose the new search as a Beta Feature
 * Fix anything users didn't like
 * Performance testing and fixes
 * Make the new search the primary and remove the Beta Feature.

We did this as hardware permitted over the course of about a year, finally finishing up with English Wikipedia. For the most part this went super smoothly. Here are some milestones in the process:
 * 1) Improved both performance, correctness, and the documentation of the Phrase Suggester.  The reasons for this were pretty simple: we needed to be able to catch both spelling mistakes and context mistakes.  The phrase suggester gave us that power.
 * 2) Wrote our own highlighter.  We liked some things about all three of the builtin highlighters and thought "wouldn't it be nice if you could mix all the best parts?"  So we did.  That allowed us to save a ton of disk space and disk IO.
 * 3) Wrote a plugin for trigram accelerated regex search similar to [wp:PostgreSQL]'s pg_trgm.  We did this because some of the behavior of the old search system was too difficult to replicate without forking Lucene and Elasticsearch.  So we offered an olive branch: a slower but more powerful alternative.
 * 4) Scripted, noop detecting updates.  We implemented these in Groovy against vanilla Elasticsearch.  They were super important because the way we detect and queue changes to the wikis can cause lots of false positives.  Updates against Elasticsearch are basically deletes and reindexes so it was important to be as careful about them as possible.  We endeded up implementing noop detection natively in Elasticsearch but it wasn't powerful enough for us.
 * 5) Purchased really nice disks.  To handle the query rate on enwiki (~a million an hour) we had to purchase better SSDs.  Ultimately the cost of keeping the index constantly up to date is constant writes and handling deleted documents.  That takes disk.  And RAM.