Search/Old

This page describes the Wikimedia Foundation's activities surrounding our sites' search functionality.

Rationale
The Wikimedia search infrastructure hasn't had significant development work for many years. The current system is based on homegrown layer on top of Lucene (lsearchd) that has since been tackled by much larger projects such as Solr. The search system frequently breaks in ways that are difficult to diagnose, and generally makes our Operations staff sad.

Goals for our current effort:
 * Make our existing tools more robust
 * Improve logging in our existing tools to make problems easier to diagnose
 * Migrate away from lsearchd to Solr (or something similar)

Our current search infrastructure is highly outdated and difficult to manage due to tons of custom code. We'd like to replace it with Solr (also based on Lucene), as it's very stable, contains many of the features we need, and doesn't require nearly as much custom code to support.

Solr implementation plan
We don't yet have a firm timeline for a Solr migration. A few considerations.


 * 1) Solr is web-based and has its own query syntax (Solr query syntax)
 * 2) We have a rather complex set of search modes that we support in lsearchd (user documentation)
 * 3) As an initial step, we need to decide how much of the lsearchd syntax we want to support in the Solr implementation and if we want to enhance it in some way to take advantage of newer Solr capabilities (e.g. RegEx search). This will have a strong impact on the rest of the architecture since it determines what indices are generated.
 * 4) Based on this, we need to map out how MWSearch extension needs to change for Solr.
 * 5) For a new implementation, some sort of incremental approach seems best where we deploy Solr for smaller wikis first, and learn from that experience for the larger wikis.

Requirements

 * A solid PHP library
 * Translation memory and GeoData both use Solarium, which is widely used and very robust.
 * Solr library in Pecl is poorly maintained, incomplete.

GeoData
The index is relatively small (so no need to make it distributed), but requires a lot of computational power to work with. Full-text search is not currently used. Currently, data from all the wikis is stored in the same core, in the future we will need to split data to many cores (the puppet changes for using multiple cores with shared configuration/schema are here, needs more work).
 * Load expectations: unclear, but will be high if we start using it heavily e.g. for maps display.
 * Backups: not really needed - if master is down just switch to a slave. If all servers are down, reindexing from scratch is quick.
 * Note: because GeoData's schema is very stripped-down, /admin/ping doesn't work - should be remebered if someone wants to rewrite the current monitoring.

Nice to haves

 * A pony

In progress

 * Index pages when their templates change.
 * https://gerrit.wikimedia.org/r/#/c/75151/
 * Setup enwiki in labs and play with it.
 * Nik is restoring enwiki right now. It'll be a process though.
 * This is super on hold now due to lack of useful hardware.
 * Waiting on Peter to hook me up with other ways to do this.
 * Automated test suite?
 * https://gerrit.wikimedia.org/r/#/c/75793/
 * Empty string
 * Just namespace:
 * Just intitle:
 * Just incategory:
 * Namespace: with other search terms
 * Intitle: with other search terms
 * Incategory: with other search terms
 * Bug 47770 - make sure appropriate characters aren't stripped
 * Create a page
 * Delete a page
 * Edit a page
 * Include a template
 * Edit an included template

Must

 * Build puppet configuration for machines in beta.
 * Figure out monitoring.

Maybe

 * Automated test suite?
 * And another round of tests that are lower priority but still cool:
 * Highlighting
 * Suggestions
 * Document routing probably by namespace somehow - we really only need to do this if we find we're taking too much disk space or our queries are too slow.
 * After some digging it might be better to have an index for the article namespace and another for all other namespaces
 * Since most queries are just against the main namespace this cuts the search space pretty significantly
 * Allows us to configure more slaves for the main namespace to handle load
 * Allows us to configure a minimum number of slaves for the non-main namespace for redundancy.
 * We can (and should) use namespaces to make this relatively invisible on the querying end
 * This will have side effects for search suggestions. They might be positive.  They might not.  I haven't thought them through enough to know.
 * Figure out how we want to secure elasticsearch and do it.
 * Downgraded after talking with Peter

Done

 * Package JMXTrans
 * Puppetize JMXTrans
 * Pool Counter for Solr Updates
 * Give ElasticSearch a shot.
 * Search redirects to a page somehow.
 * Indexed in a separate field with highlighting, etc.
 * Work done in Elasticsearch branch.
 * Use Pool Counter for Searches
 * Using upstream deb files for installation and default configuration.
 * Plan machines running in beta.
 * Prefix search uses edgengrams
 * Puppetize installation of elasticsearch.

Rejected

 * Caching results from Solr.
 * We'll wait and see if we need this.
 * Accoring to the mailing list folks tend to cache Solr using Varnish. Lucky for us we understand and like Varnish.

Documents

 * Search documentation on Wikitech: Search
 * Ram's setup instructions: wikitech:User:Ram/Search
 * Some notes from Brion in 2008
 * The MWSearch extension provides a SearchEngine subclass which contacts Wikimedia's Lucene-based search server. This replaces the older LuceneSearch extension which reimplemented the entire Special:Search page.
 * /2013-02 discussion - Discussion with Rainman about how the current system works