Search/Old/2013-02 discussion

Notes from a discussion with several WMF staff developers and Rainman. Original notes copied from http://etherpad.wmflabs.org/pad/p/SearchDiscussion (authors: Tim, aaron, dsc, robla + 2 unnamed authors)

lsearchd deep dive
What processes are running?
 * searchfrontend & searchbackend

Indexer:
 * index is a collection of files in a certian format
 * one index daemon (per server?) (avoids synchronizaton/locking)
 * RMI is used as a wrapper for searching the indexes that manages local/foreign index shard access transparently
 * Indexes are sharded on namespace and further into smaller parts (each checked on query, e.g. map/reduced)

Index updates:
 * Initial index building for a wiki is viia an XML dump using an indexbuilder tool
 * Incremental updates work via polling OAI
 * There used to be a synchronous update triggered by the searchupdate hook on article edit

Misc notes:
 * /db/searchterm request format to daemon, responses with one of opensearch/xml/json format
 * "prefix format" use for "lists of suggestions"
 * search daemon using 80 threads (class SearchServer) (can run 80/'sec search requests in parallel, higher than normal load (~10?))
 * one daemon running on each server

Possible things to fix: The pool avoids synchronization around Files which would curtail concurrency. Solr already makes optimizations for resource sharing. Current code is in searchpool (searchcache?) in the search package.
 * better error handling? (e.g. on timeout)
 * index opened multiple times and handles pooled. Searchers check locally and then check foreign servers (index is partitioned).
 * RMI load balancing is not smart, just random (using solr probably would deal with this)
 * XMLRPC not used anymore (not since the switch to OAI)
 * Fix bugs in disaled interwiki search code that caused it to hang

Solr
Lucene features to make sure Solr has:
 * Custom ranking metric (we have custom MW logic for determining hit score)
 * "Did You Mean?" engine that can handle multi-word queries (e.g. for spellchecking)

...potentially related Solr features: http://lucene.apache.org/solr/features.html (Query) Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores. (Core) Pluggable user functions for Function Query (Query) Auto-suggest functionality for completing user queries (Query) Dynamic search results clustering using Carrot2 (Schema) Many additional text analysis components including word splitting, regex and sounds-like filters

Solr Links - http://lucene.apache.org/solr/ -- single-node frontend for index query/update - http://lucene.apache.org/solr/4_1_0/tutorial.html - 4.1.0 tutorial - http://wiki.apache.org/solr/SolrCloud -- Sharding indices and using a federated group of solr instances to serve query responses

Ram's Prepared Questions
1. Assuming everything is configured and running, overview of what is running where. 2. Quick summary of use of RMI. 3. Design/philosophy of error handling in both Java and PHP. 4. Are there significant parts of code/functionality that are currently unused ? 5. We have 25 servers dedicated to search; rationale for this number. 6. The search and index servers have thread-pools of 80 and 25 threads respectively; rationale for these numbers. 7. Track a search query from browser to PHP to LuceneSearch back to PHP and to browser (at any level of detail). 8. Seems like we have a single indexing server at each DC where updates to the index happen. Is this correct ? 9. When an article is edited and saved, a PHP hook triggers to update index; track the series of steps that happen (at any level of detail).

OAI: http://www.mediawiki.org/wiki/Extension:OAIRepository