CirrusSearch/BlogDraft

This is a draft of a blog post.


 * Title ideas
 * New search infrastructure is now live
 * Why we gave Wikipedia a new search infrastructure, and how

Text
The Wikimedia Foundation recently completed our year-long project to replace the underlying search infrastructure with an Elasticsearch powered MediaWiki search plugin call CirrusSearch. During the first two weeks of operation we served about 408 million full text searches and 1.5 billion find-as-you-type searches across the 900 wikis we host. The full list of features is documented here, but here are some highlights:
 * Near real time indexing (< a minute from page edit to index update on average)
 * Semi-curated search results (featured articles and picture of the day given a community configured boost)
 * A large compliment of filters and exact matching techniques useful for editors
 * Optionally prefer recently edited articles

The old on-site search was quite quite good for its time but hadn't seen much love in many many years and was beginning to show its age by crashing, returning errors on certain search terms, and otherwise waking engineers up in the middle of the night. Unfortunately it implemented all the search functionality using a version of Apache Lucene released so long ago that to get any new functionality we'd have had to rewrite most of it against a new version with very very different APIs. That was pretty daunting and the Wikimedia Foundation wanted to get out of the business of maintaining a custom search solution so we made the decision to replace the system entirely rather than attempt to incrementally improve it.

We ultimately chose to rebuild the system using Elasticsearch. We reasoned that as an active open source project they would do a better job keeping up with Lucene development than we would. So we built a MediaWiki extension, CirrusSearch, to mediate between MediaWiki and Elasticsearch. It hooks the appropriate spots to direct updates and searches to Elasticsearch. We chose Elasticsearch because we liked for a ton of reasons: Overall Elasticsearch does a good job of making search seem simple most of the time but doesn't hide the complex bits when you genuinely need them. It feels like you get the most out of it when you start simple and iterate. Download it, run it, POST some documents, and try searching for them. Then decide something is wrong and go back some steps. Maybe make different searches or different documents or maybe setup a mapping rather than using the default one or maybe a different analyzer. If all else fails you can install a plugin. Or write one! Or contribute to Elasticsearch. Or Lucene. Not that you'll have to do that on every project. The point is that you can start simple an go deeper.
 * The query language is super flexible and expressive. This is really nice compared to writing a low level wrapper around Lucene because it allows us to change queries on the fly just by redeploying CirrusSearch.
 * Almost all maintenance tasks can be done over HTTP from defining a new index and configuring how fields are analyzed to changing recovery rate limits. This made it easy to script maintenance tasks like building new indices and changing how fields are analyzed.
 * Which servers hold what data is dynamic within constraints you specify. This is powerful enough for us to specify rules like "don't put two English Wikipedia shards on the same server" but for the most part we just leave it alone and it makes sensible decision.
 * Tons of metrics are available for reading, graphing, and alerting fun over HTTP both in JSON and easily grep-able form. On the off chance something does go sideways we can track it down by combining these metrics with the slow query log and more logging that we do on the MediaWiki side.
 * The plugin system is robust enough for us to write new queries, filters, and a new result summarizer. This gave us a ton of flexibility and most importantly it let us iterate quickly in when we couldn't wait for Elasticsearch to cut a new release.

Finally one of the really cool (to us, anyway) things about the Wikimedia Foundation is that we're very very open source and open in general. We run Elasticsearch using the OpenJDK. Mediawiki, CirrusSearch, and our Elasticsearch extensions are open source. So is our puppet configuration. The contents of the search index are available for all articles by adding a url parameter like this. The query JSON is similarly available as are the index mapping and configuration and the metrics.

Still curious? Read more about more about how we went about the project and milestones, both internal and contributed to Elasticsearch, here. < Links to the below. Or expands or something.

How we got here
The name of the game for this project was "don't break anything important" and "deploy early and often". "Don't break anything important" is important because of the scale and reach of the project. While we consider Wikipedia a grand experiment (similar to Sesame Street) its still an important thing that lots of people rely on and breaking it would be bad (again, like Sesame Street). "Deploy early and often" was important because the old search system didn't have a specification. How it worked was documented across many wiki pages in many languages but even that documentation wasn't complete. As such we had no choice but to take a very educated guess at the specification, allow users to test it, and fix anything that users complained about.

The basic plan was:
 * Deploy the extension so updates were indexed
 * Perform the initial import
 * Expose the new search as a Beta Feature
 * Fix anything users didn't like
 * Performance testing and fixes
 * Make the new search the primary and remove the Beta Feature.

We did this as hardware permitted over the course of about a year, finally finishing up with English Wikipedia. For the most part this went super smoothly. Here are some milestones in the process:
 * 1) Improved both performance, correctness, and the documentation of the Phrase Suggester.  The reasons for this were pretty simple: we needed to be able to catch both spelling mistakes and context mistakes.  The phrase suggester gave us that power.
 * 2) Wrote our own highlighter.  We liked some things about all three of the builtin highlighters and thought "wouldn't it be nice if you could mix all the best parts?"  So we did.  That allowed us to save a ton of disk space and disk IO.
 * 3) Wrote a plugin for trigram accelerated regex search similar to [wp:PostgreSQL]'s pg_trgm.  We did this because some of the behavior of the old search system was too difficult to replicate without forking Lucene and Elasticsearch.  So we offered an olive branch: a slower but more powerful alternative.
 * 4) We replaced our blind bulk updates with scripted, noop detecting updates.  They were super important because the way we detect and queue changes to the wikis can cause lots of false positives.  Updates against Elasticsearch are basically deletes and reindexes so it was important to be as careful about them as possible.  We endeded up implementing noop detection natively in Elasticsearch but it wasn't powerful enough for us.  Elasticsearch's transition to Groovy as the scripting language was instrumental in this because the old scripting language, MVEL, had concurrency issues when run at the scale at which we receive updates.
 * 5) Purchased really nice disks.  To handle the query rate on enwiki (>a million an hour) we had to purchase better SSDs.  Just as freedom costs vigilance so keeping the index constantly up to date is constant writes and handling deleted documents.  That takes disk.  And RAM.

In the end we needed more hardware than for our previous search system, mostly in the disk and RAM department. We'd wanted to do it in the same hardware but it just wasn't possible given the new features we added in the process of implementing CirrusSearch. The flexibility and real time updates just takes more power. Blame Conway's Law, all the cool kids do.[This is a bit cryptic, even if one knows what Conway's law is.] We're happy with the outcome anyway. Not only did get our new features but Elasticsearch doesn't wake us up in the middle of the night. Its exactly the right kind of boring.