User:Ram (WMF)/Search

NOTE: Areas that I'm still unclear about are marked by a bracketed comment: [?Add ...]

Overview of MediaWiki search
[? Add details on how clustering works in our case.] ssh://xyz@gerrit.wikimedia.org:29418/operations/puppet.git The config files are under templates/lucene; similarly, LVS clustering is configured viaPuppet using files under templates/lvs. [?Add details on how Puppet uses these files.]
 * The apache project Lucene provides search capabilities in MediaWiki. The lucene daemon, whichis a Java program, runs identically configured on a cluster of 25 machines at the ourdata center in Ashburn, Virginia (a.k.a. eqiad) with a similar cluster running on the data center at Tampa, Florida (a.k.a. pmtpa) as a hot failover standby.
 * Each search server listens on port 8123 for search queries.
 * Clustering uses LVS (Linux Virtual Server); further details about that tool at: http://www.linuxvirtualserver.org/whatis.html
 * Each server is configured automatically using Puppet; the Puppet code can be cloned from (replace xyz with your user name):
 * The status of the various servers can be seen at: http://ganglia.wikimedia.org/latest/. From the Choose source dropdown, select Search eqiad. You can click on thePhysical View button at top right to see details like the amount of RAM, number of cores, etc.
 * The MediaWiki extension MWSearch (PHP code) receives search queries and routes them to a search server [?Add details on how this is done.]
 * The file operations/mediawiki-config/wmf-config/lucene.php defines a number of globals to configure search; these include the port number, LVS cluster IP addresses, timeout, cache-expiry, etc. The main search file extensions/MWSearch/MWSearch.php is also require'd here. NOTE: The timeout is defined as 10s: $wgLuceneSearchTimeout = 10; which may be too small when servers are busy.

Search Details (PHP)
ApiQuerySearch seems to be the main class handling search requests. Its inheritence hiearchy looks like this: ApiQuerySearch → ApiQueryGeneratorBase → ApiQueryBase → ApiBase → ContextSource/IContextSource.

ApiQuerySearch.run starts query processing ''[? Where is this called from ?]'' and does the following:


 * Invokes ApiBase.extractRequestParams to get parameter list.
 * Creates a new LuceneSearch object and invokes searchText method on it, which invokes LuceneSearchSet::newFromQuery.
 * That routine does the following:
 * creates the search URL like this: $searchUrl = "http://$host:$wgLucenePort/$method/$wgDBname/$enctext?" to which a few parameters are appended like namespaces, etc.
 * Invokes Http.get which invokes MWHttpRequest::factory to get a new request object to get, probably, a CurlHttpRequest object and invokes execute on it.
 * That method uses the native PHP functions curl_init, curl_setup, curl_exec, curl_close to make the HTTP call (which goes to the Java engine); the results are saved in the request object.

Search Details (Java)
Most of the code is in subdirectories of src/org/wikimedia/lsearch/. The main class dealing with search itself is search/SearchServer.java; classes interfacing with PHP are in frontend, those dealing with networking in interoperability and the main entry point is config/StartupManager.java.

Some important classes are described below.
 * StartupManager
 * Performs these steps:
 * Get local and global configurations and retrieve various parameters (language codes, localization data, etc.)
 * Invoke static methods createRegistry and bindRMIObjects in RMIServer (see below for more on this class).
 * If this is an indexer machine, start new HTTPIndexServer [default] or RPCIndexServer [? Is this ever used?].
 * If this is an search machine:
 * Start new SearchServer.
 * Create singleton SearcherCache.
 * Start singleton threads UpdateThread and NetworkStatusThread.


 * HttpHandler
 * This is an abstract class (with processRequest the only abstract method) that extends Thread; it is extended by HTTPIndexDaemon (handles index update requests) and SearchDaemon (handles search requests).


 * SearchDaemon
 * Extends HttpHandler; one of these is created for each incoming search request and run by the thread-pool in SearchServer (see below). Provides a definition of processRequest which does the following:
 * If non-search request (e.g. /robots.txt, /stats, /status), return relevant data.
 * Otherwise:
 * Create new SearchEngine (top-level search class) and invoke it to get search results.
 * Return results in one of 3 formats: Standard, JSON, or OPENSEARCH.


 * HTTPIndexDaemon
 * Similar to SearchDaemon (above); extends HttpHandler; one of these is created for each incoming index request and run by the thread-pool in HTTPIndexServer (see below). Provides a definition of processRequest which does the following:


 * SearchServer
 * Extends Thread. Though not defined as a singleton, appears to be so in practice. Started by StartupManager (see above). Does the following:
 * Create a Statistics and StatisticsThread objects to supply stats to Ganglia.
 * Create thread-pool of maxThreads [default: 80] threads.
 * Listen on ServerSocket [default port: 8123]; when a connection is made, create new SearchDaemon object and run it in the pool if pool is not full. If pool is full, log an error and simply close socket ! [?NOTE There may be an off-by-one error in the check to see if the pool is full.]


 * HTTPIndexServer
 * Similar to SearchServer above; extends Thread. Though not defined as a singleton, appears to be so in practice. Started by StartupManager. Does the following:
 * Create thread-pool of 25 (hardcoded) threads.
 * Listen on ServerSocket [default port: 8321]; when a connection is made, create new HTTPIndexDaemon object and run it in the pool if pool is not full. If pool is full, log an error and simply close socket ! [?NOTE There may be an off-by-one error in the check to see if the pool is full.]


 * IndexDaemon
 * Simple class that functions as interface adapter to present a much simpler interface to clients of the somewhat complex IndexThread class. Not clear why this is done via a concrete class rather than an interface implemented by IndexThread.


 * HttpMonitor
 * Coming soon


 * RPCIndexDaemon
 * Coming soon


 * RPCIndexServer
 * Coming soon

Installing MediaWiki and lucene-search-2 for debugging
These instructions are targeted at developers who want to setup an instance of MediaWiki and the Lucene based search functionality for testing and debugging; it is not the intent here to setup a production system.

Details on how to install MediaWiki are at: http://www.mediawiki.org/wiki/Installation A summary appears below along with some additional details.

archive; then rename the top-level directory to 'core' or something similar for ease of typing, e.g.      cd ~/src tar xvf mediawiki-1.20.2.tar.gz      mv mediawiki-1.20.2 core with corresponding MySql packages):    list="php5 php5-sqlite sqlite3 apache2 git default-jdk ant debhelper javahelper"     list="$list liblog4j1.2-java libcommons-logging-java libslf4j-java "     sudo apt-get install $list     cd ~/src     mkdir extensions; cd extensions     git checkout https://gerrit.wikimedia.org/r/p/mediawiki/extensions/MWSearch.git containing:     Now point your browser (or use wget/curl to fetch the page) at http://localhost/info.php You should see lots of tables with PHP configuration info (replace localhost with appropriate host name or IP if necessary). create an sqlite DB file under it). Also make sure that all the directories from your home directory to the MediaWiki root are world readable and searchable (otherwise you'll get errors from Apache as it searches for .htaccess files), e.g.    mkdir ~/data; chmod 777 ~/data chmod 755 ~ ~/src ~/src/core stuff and also set DocumentRoot to point to the freshly unpacked MediaWiki root above. Something close to this should work (replace xyz by a proper user name):  ServerAdmin webmaster@localhost DocumentRoot /home/xyz/src/core Alias /extensions /home/xyz/src/extensions ErrorLog ${APACHE_LOG_DIR}/error.log LogLevel warn CustomLog ${APACHE_LOG_DIR}/access.log combined php_admin_flag engine on        php_admin_flag engine off   AllowOverride All   sudo /etc/init.d/apache2 reload on-screen instructions to configure MediaWiki; at the end you'll be prompted to download the generated LocalSettings.php file and place it in the MediaWiki root directory to complete the configuration step. This step can also be done from the commandline as described next. php core/maintenance/install.php --help Documentation on the various parameters is at: http://www.mediawiki.org/wiki/Manual:Config_script You can edit the generated LocalSettings.php file manually to add additional configuration options as needed; for example, some of these may be useful: require( "$IP/../extensions/MWSearch/MWSearch.php" ); $wgLuceneHost = 'lucene-test1'; $wgLucenePort = 8123; $wgLuceneSearchVersion = '2.1'; $wgLuceneUseRelated = true; $wgEnableLucenePrefixSearch = false; $wgSearchType = 'LuceneSearch'; cd ~/src git clone https://gerrit.wikimedia.org/r/p/operations/debs/lucene-search-2.git There is a top-level README.txt file that describes how to build it; we summarize the steps below. cd lucene-search-2; ant git checkout. Create it to contain: #!/bin/bash dir=`cd $1; pwd` java -cp LuceneSearch.jar org.wikimedia.lsearch.util.Configure $dir Now run it with the full path to the MediaWiki root directory as an argument, e.g.: bash configure ~/src/core It will examine your MediaWiki configuration and generate these matching configuration files for search: lsearch.log4j lsearch-global.conf  lsearch.conf (without them you'll get Java exceptions when you run the lsearchd daemon); one way to get around this is the remove those references and use a RollingFileAppender: log4j.rootLogger=INFO, R       log4j.appender.R=org.apache.log4j.RollingFileAppender log4j.appender.R.File=logs/test.log log4j.appender.R.MaxFileSize=10MB log4j.appender.R.MaxBackupIndex=2 log4j.appender.R.layout=org.apache.log4j.PatternLayout log4j.appender.R.layout.ConversionPattern=%d{ISO8601} %-5p %c %m%n log4j.logger.org.wikimedia.lsearch.interoperability=DEBUG wget http://dumps.wikimedia.org/simplewiktionary/20130113/simplewiktionary-20130113-pages-meta-current.xml.bz2 and build Lucene indexes from it: java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.BuildAll simplewiktionary-20130113-pages-meta-current.xml.bz2 This last command is equivalent to running the build script mentioned README.txt; it creates a new directory named indexes and a number of files and directories under it. ./lsearchd & It listens for search queries on port 8123, so you can test it like this: wget http://localhost:8123/search/my_wiki/hello Logs can be found under the logs directory.
 * Download the latest release from: http://www.mediawiki.org/wiki/Download and extract the
 * Install prerequisites (if you prefer MySql to SQLite3 replace the sqlite packages below
 * Checkout the MWSearch extension from the git repository, e.g.:
 * Make sure the apache/php combo is working by creating a file named info.php at /var/www
 * Create a data directory somewhere and make it world read/write (this is to allow apache to
 * Reconfigure apache by editing /etc/apache2/sites-available/default to remove unnecessary
 * Reload apache configuration with:
 * You should now be able to point your browser at http://localhost/index.php and follow
 * The previous step can also be done from the commandline, e.g.
 * Checkout lucene-search-2:
 * Run ant to build everything; the result should be a local file named LuceneSearch.jar:
 * The README.txt file mentions running the configure script but that script is missing in the
 * The generated lsearch.log4j uses ScribeAppender which requires installation of additional packages
 * Now get an XML dump:
 * Finally, you can run the search daemon: