CirrusSearch/BlogDraft

This is a draft of a blog post.

We recently replaced the on site search for English Wikipedia with CirrusSearch, a MediaWiki plugin we developed to serve searches from Elasticsearch. During the first two weeks of operation we've served about 408 million full text searches and 1.5 billion find as you type searches across the 900 wikis we host.

The full list of features is documented here but here are some highlights:
 * Near real time indexing (< a minute from page edit to index update)
 * Semi-curated search results (featured articles and picture of the day given a community configured boost)
 * A large compliment of filters and exact matching techniques useful for editors
 * Optionally prefer recently edited articles

The old on site search was quite quite good for its time but hadn't seen much love in many many years and was beginning to show its age by crashing, blowing up on certain search terms, and otherwise paging people in the middle of the night. The open source search world has moved on to the point where little of its code was still useful and the Wikimedia Foundation wanted to get out of the business of maintaining a custom search solution anyway.

So we ultimately chose to rebuild the system using a MediaWiki extension (CirrusSearch) to act as a go between for MediaWiki and Elasticsearch. We chose Elasticsearch we liked how we could configure everything about how data was indexed and searched over HTTP. Elasticsearch does a good job of making search seem simple most of the time but doesn't hide the complexity when you genuinely need to do complex things. We appreciate that queries, documents, and analysis configuration are all JSON that can be sent and fetched over HTTP. We also appreciate that you can get tons of metrics relating to Lucene internals and write plugins to extend the query language if you need to.

We chose to make a MediaWiki extension so we could get access to the wealth of information that MediaWiki builds when it parses wikitext and so we could easily notice changes to the articles and index them immediately after update.

At the Wikimedia Foundation we try to keep things as open as possible. You can see what data we have for a given page by adding a special URL parameter like this. You can see the JSON that we use for a query by adding a parameter to the query page like this. Our server metrics are public. And so is the puppet module we use and of course so is the source for CirrusSearch and all of our plugins.

We've really enjoyed working with Elasticsearch and think that the way to get the most out of it is to start simple and iterate. Download it, run it, POST some documents, try searching for them. Then decide something is wrong and go back some steps. Maybe make different searches or different documents or maybe setup a mapping rather than using the default one or maybe a different analyzer. If all else fails you can install a plugin. Or write one! Or contribute to Elasticsearch. Or Lucene. Each step is a bit harder and rarer than the last. Most bug fixes for CirrusSearch involve changing the query and which is simple to develop and deploy.