CirrusSearch/Presentation

This is a presentation for use in one hour sessions talking about CirrusSearch.

Presentation Notes
Goal of presentation is education. Stop me if I'm not making sense. I'll take questions any time - just stop me.

The sum total of the requirements
Make the search engine not page people. Users have to like it.

Oh yeah, and it'd be nice if it were kept up to date in real time.

It took 18 months
Why 18 months? Many many months spent on "users have to like it." "Users have to like it" really means:
 * Can't take any features away (mostly)
 * You have to figure out what features the old system actually had in the first place to do this
 * Old search didn't have tests or specs or anything fancy like that
 * Have to add some shiny features
 * You have to figure out what users actually want

Solution: slowly roll out. Phased to different communities and as an opt in feature before its the default.
 * Hit communities that are underserved by search first (zh-yue wikipedia, wikisource, wiktionary, wikidata)
 * Hit change averse communities last (enwiki, dewiki)
 * Give power users lots and lots and lots of time to try feature before dumping it on them

And build a running set of regression tests.

Only once we had all the features sorted out could we start to predict hardware requirements. So we had to wait right at the end to order hardware.

The traffic volume
870 million full text searches, 3.1 billion find as you type searches a month.

The solution
Replace the search system that had heroically powered search for years from the ground up with one based on Elasticsearch.

Why?
The old search engine, lsearchd, was a Java application based on Lucene 2.3 Lucene 5.0 is coming out soon. The world of open source search moved on and we never kept up. We simply don't have the manpower to maintain our own search system. Solr and Elasticsearch have a significant community behind them keeping them up to date. Both have people full time working on them.

We chose Elasticsearch over Solr for lots of reasons:
 * Contributor pipeline was good and maintainers were nice
 * At the time Elasticsearch had better support for configuring the schema over HTTP
 * Elasticsearch's REST api just *felt* better
 * Deb and rpm packages available that work well

Openness
Lots of links go here.