CirrusSearch/BlogDraft

This is a draft of a blog post.

Title ideas

New search infrastructure is now live
Why we gave Wikipedia a new search infrastructure, and how
....

Text[edit]

The Wikimedia Foundation recently completed our year-long project to replace the underlying search infrastructure with an Elasticsearch powered MediaWiki search plugin called CirrusSearch. During the month of December it served 872 million full text searches and 3.1 billion find-as-you-type searches across the 900 wikis we host. The full list of features is documented on the CirrusSearch help page, but here are some highlights:

Near real time indexing - it now takes less than a minute from page edit to index update on average
Semi-curated search results - ranking can be raised and lowered based on membership in a category. For example, featured articles and picture of the day are considered more relevant.
A large compliment of filters and exact matching techniques useful for editors
Optionally prefer recently edited articles

The old on-site search was quite quite good for its time but hadn't seen much love in many many years and was beginning to show its age by crashing, returning errors on certain search terms, and otherwise waking engineers up in the middle of the night. Unfortunately, it implemented all the search functionality using a version of Apache Lucene released so long ago that to get any new functionality we'd have had to rewrite most of it against a new version with very very different APIs. That was pretty daunting and the Wikimedia Foundation wanted to get out of the business of maintaining a custom search solution so we made the decision to replace the system entirely rather than attempt to incrementally improve it.

We ultimately chose to rebuild the system using Elasticsearch. As an active open source project in its own right, we think they will do a better job keeping up with Lucene development than we can. So we built the CirrusSearch MediaWiki extension to mediate between MediaWiki and Elasticsearch. It hooks the appropriate spots to direct updates and searches to Elasticsearch. We like Elasticsearch for a ton of reasons:

The query language is super flexible and expressive. This is really nice compared to writing a low level wrapper around Lucene because it allows us to change queries on the fly just by redeploying CirrusSearch.
Almost all maintenance tasks can be done over HTTP from defining a new index and configuring how fields are analyzed to changing recovery rate limits. This made it easy to script maintenance tasks like building new indices and changing how fields are analyzed.
Which servers hold what data is dynamic within constraints you specify. This is powerful enough for us to specify rules like "don't put two English Wikipedia shards on the same server" but for the most part we just leave it alone and it makes sensible decisions.
Tons of metrics are available for reading, graphing, and alerting fun over HTTP both in JSON and easily grep-able form. On the off chance something does go sideways we can track it down by combining these metrics with the slow query log and more logging that we do on the MediaWiki side.
The plugin system is robust enough for us to write new queries, filters, and a new result summarizer. This gave us a ton of flexibility and most importantly it let us iterate quickly in when we couldn't wait for Elasticsearch to cut a new release.

Overall Elasticsearch does a good job of making search seem simple most of the time but doesn't hide the complex bits when you genuinely need them. It feels like you get the most out of it when you start simple and iterate. Download it, run it, POST some documents, and try searching for them. Then decide something is wrong and go back some steps. Maybe make different searches or different documents or maybe setup a mapping rather than using the default one or maybe a different analyzer. If all else fails you can install a plugin. Or write one! Or contribute to Elasticsearch. Or Lucene. Not that you'll have to do that on every project. The point is that you can start simple an go deeper.

Finally one of the really cool (to us, anyway) things about the Wikimedia Foundation is that we're very very open source and open in general. We run Elasticsearch using the OpenJDK. Mediawiki, CirrusSearch, and our Elasticsearch extensions are open source. So is our Puppet configuration. The contents of the search index are available for all articles by adding a url parameter. The JSON-formatted query format is similarly available as are the index mapping , configuration , and metrics.

Still curious? Read more about more about how we went about the project and milestones, both internal and contributed to Elasticsearch. <---- Links to the below. Or expands or something.

How we got here[edit]

The name of the game for this project was "don't break anything important" and "deploy early and often". "Don't break anything important" is important because of the scale and reach of the project. While we consider Wikipedia a grand experiment (similar to how the people at Sesame Street consider what they do an experiment) its still an important thing that lots of people rely on and breaking it would be bad (like Sesame Street). "Deploy early and often" was important because the old search system didn't have a specification. How it worked was documented across many wiki pages in many languages but even that documentation wasn't complete. As such we had no choice but to take a very educated guess at the specification, allow users to test it, and fix anything that users complained about.

The basic plan was:

Deploy the extension so updates were indexed
Perform the initial import
Expose the new search as a Beta Feature
Fix anything users didn't like
Performance testing and fixes
Make the new search the primary and remove the Beta Feature.

We did this as hardware permitted over the course of about a year, finally finishing up with English Wikipedia. For the most part this went super smoothly. Here are some milestones in the process:

Improved performance, correctness, and the documentation of the Phrase Suggester. The reasons for this were pretty simple: we needed to be able to catch both spelling mistakes and context mistakes. The phrase suggester gave us that power.
Wrote our own highlighter. We liked some things about all three of the builtin highlighters and thought "wouldn't it be nice if you could mix all the best parts?" So we did. That allowed us to save a ton of disk space and disk IO.
Wrote a plugin for trigram accelerated regex search similar to PostgreSQL's pg_trgm. We did this because some of the behavior of the old search system was too difficult to replicate without forking Lucene and Elasticsearch. So we offered an olive branch: a slower but more powerful alternative.
We replaced our blind bulk updates with scripted, noop detecting updates. They were super important because the way we detect and queue changes to the wikis can cause lots of false positives. Updates against Elasticsearch are basically deletes and reindexes so it was important to be as careful about them as possible. We ended up implementing noop detection natively in Elasticsearch but it wasn't powerful enough for us. Elasticsearch's transition to Groovy as the scripting language was instrumental in this because the old scripting language, MVEL, had concurrency issues when run at the scale at which we receive updates.
Purchased really nice disks. To handle the query rate on English Wikipedia (over a million an hour) we had to purchase better SSDs. Just as freedom costs vigilance so keeping the index constantly up to date is constant writes and handling deleted documents. That takes disk. And RAM.

In the end we needed more hardware than for our previous search system, mostly in the disk and RAM department. We'd wanted to do it in the same hardware but it just wasn't possible given the new features we added in the process of implementing CirrusSearch. The flexibility and real time updates just takes more power. Blame Wirth's Law, all the cool kids do. We're happy with the outcome anyway. Not only did get our new features but Elasticsearch doesn't wake us up in the middle of the night. Its exactly the right kind of boring.