CirrusSearch/BlogDraft

This is a draft of a blog post.


 * Title ideas
 * New search infrastructure is now live
 * Why we gave Wikipedia a new search infrastructure, and how

With the conversion of the English Wikipedia's on-site search function, we recently completed our year-long project to replace the underlying search infrastructure with CirrusSearch, a MediaWiki plugin we developed to serve searches from Elasticsearch, a widely used open source search engine. During the first two weeks of operation we served about 408 million full text searches and 1.5 billion find-as-you-type searches across the 900 wikis we host.

The full list of features is documented here, but here are some highlights:
 * Near real time indexing (< a minute from page edit to index update on average)
 * Semi-curated search results (featured articles and picture of the day given a community configured boost)
 * A large compliment of filters and exact matching techniques useful for editors
 * Optionally prefer recently edited articles

The old on-site search was quite quite good for its time but hadn't seen much love in many many years and was beginning to show its age by crashing, blowing up on certain search terms, and otherwise waking people up in the middle of the night. The open source search world had moved on to the point where little of its code was still useful and the Wikimedia Foundation wanted to get out of the business of maintaining a custom search solution.

We ultimately chose to rebuild the system using Elasticsearch. We built a MediaWiki extension, CirrusSearch, to mediate between MediaWiki and Elasticsearch. It hooks the appropriate spots to direct updates and searches to Elasticsearch. We chose Elasticsearch because we liked how everything could be done over HTTP [unclear - what's "everything"?] including configuring how documents are indexed. We appreciated the flexibility and expressivity of the query language. Overall, Elasticsearch does a good job of making search seem simple most of the time, but doesn't hide the complex bits when you genuinely need them. We also appreciate that you can get tons of metrics relating to Lucene internals and write plugins to extend the query language if you need to.

The Wikimedia Foundation is militantly open and militantly open source. We run Elasticsearch against the OpenJDK. Mediawiki, CirrusSearch, and our Elasticsearch extensions are open source. So is our puppet configuration. The contents of the search index are available for all articles by adding a url parameter like this. The query JSON is similarly available as are the index mapping and configuration and the overall metrics.

We've really enjoyed working with Elasticsearch and think that the way to get the most out of it is to start simple and iterate. Download it, run it, POST some documents, try searching for them. Then decide something is wrong and go back some steps. Maybe make different searches or different documents or maybe setup a mapping rather than using the default one or maybe a different analyzer. If all else fails you can install a plugin. Or write one! Or contribute to Elasticsearch. Or Lucene. Each step is a bit harder and rarer than the last. Most bug fixes for CirrusSearch involve changing the query and which is simple to develop and deploy.

Still curious? Read more about more about how we went about the project and milestones, both internal and contributed to Elasticsearch, here. < Links to the blow. Or expands or something.

How we got here
The name of the game for this project was "don't break anything important" and "deploy early and often". "Don't break anything important" is important because of the scale and reach of the project. While we consider Wikipedia a grand experiment (similar to Sesame Street) its still an important thing that lots of people rely on and breaking it would be bad (again, like Sesame Street). "Deploy early and often" was important because the old search system didn't have a specification. How it worked was documented across many wiki pages in many languages but even that documentation wasn't complete. As such we had no choice but to take a very educated guess at the specification, allow users to test it, and fix anything that users complained about.

The basic plan was:
 * Deploy the extension so updates were indexed
 * Perform the initial import
 * Expose the new search as a Beta Feature
 * Fix anything users didn't like
 * Performance testing and fixes
 * Make the new search the primary and remove the Beta Feature.

We did this as hardware permitted over the course of about a year, finally finishing up with English Wikipedia. For the most part this went super smoothly. Here are some milestones in the process:
 * 1) Improved both performance, correctness, and the documentation of the Phrase Suggester.  The reasons for this were pretty simple: we needed to be able to catch both spelling mistakes and context mistakes.  The phrase suggester gave us that power.
 * 2) Wrote our own highlighter.  We liked some things about all three of the builtin highlighters and thought "wouldn't it be nice if you could mix all the best parts?"  So we did.  That allowed us to save a ton of disk space and disk IO.
 * 3) Wrote a plugin for trigram accelerated regex search similar to [wp:PostgreSQL]'s pg_trgm.  We did this because some of the behavior of the old search system was too difficult to replicate without forking Lucene and Elasticsearch.  So we offered an olive branch: a slower but more powerful alternative.
 * 4) We replaced our blind bulk updates with scripted, noop detecting updates.  They were super important because the way we detect and queue changes to the wikis can cause lots of false positives.  Updates against Elasticsearch are basically deletes and reindexes so it was important to be as careful about them as possible.  We endeded up implementing noop detection natively in Elasticsearch but it wasn't powerful enough for us.  Elasticsearch's transition to Groovy as the scripting language was instrumental in this because the old scripting language, MVEL, had concurrency issues when run at the scale at which we receive updates.
 * 5) Purchased really nice disks.  To handle the query rate on enwiki (>a million an hour) we had to purchase better SSDs.  Just as freedom costs vigilance so keeping the index constantly up to date is constant writes and handling deleted documents.  That takes disk.  And RAM.

In the end we needed more hardware than for our previous search system, mostly in the disk and RAM department. We'd wanted to do it in the same hardware but it just wasn't possible given the new features we added in the process of implementing CirrusSearch. The flexibility and real time updates just takes more power. Blame Conway's Law, all the cool kids do. We're happy with the outcome anyway. Not only did get our new features but Elasticsearch doesn't wake us up in the middle of the night. Its exactly the right kind of boring.