User:Smalyshev (WMF)/test

This file provides documentation for CirrusSearch configuration variables.

It should be updated each time a new configuration parameter is added or changed.

Configuration

 * $wgCirrusSearchDefaultCluster

Default: $wgCirrusSearchDefaultCluster = 'default';

Default cluster for read operations. This is an array key mapping into $wgCirrusSearchClusters. When running multiple clusters this should be pointed to the closest cluster, and can be pointed at an alternate cluster during downtime.

As a form of backwards compatibility the existence of $wgCirrusSearchServers will override all cluster configuration.

Each key is the name of an elasticsearch cluster. The value is a list of addresses to connect to. If no port is specified it defaults to 9200.

All writes will be processed in all configured clusters by the ElasticaWrite job, unless $wgCirrusSearchWriteClusters is configured (see below).


 * $wgCirrusSearchClusters

Default: $wgCirrusSearchClusters = [ 'default' => [ 'localhost' ], ];

List of search clusters, the key being cluster name and the value is a list of cluster hostnames.

Example: $wgCirrusSearchClusters = array(       'eqiad' => array( 'es01.eqiad.wmnet', 'es02.eqiad.wmnet' ),        'codfw' => array( 'es01.codfw.wmnet', 'es02.codfw.wmnet' ),    );


 * $wgCirrusSearchWriteClusters

Default: $wgCirrusSearchWriteClusters = null;

List of clusters that can be used for writing. Must be a subset of keys from $wgCirrusSearchClusters. By default or when set to null, all keys of $wgCirrusSearchClusters are available for writing.


 * $wgCirrusSearchConnectionAttempts

Default: $wgCirrusSearchConnectionAttempts = 1;

How many times to attempt connecting to a given server. If you're behind LVS and everything looks like one server, you may want to reattempt 2 or 3 times.


 * $wgCirrusSearchShardCount

Default: $wgCirrusSearchShardCount = [ 'content' => 4, 'general' => 4, 'titlesuggest' => 4 ];

Number of shards for each index.

You can also set this setting for each cluster: $wgCirrusSearchShardCount = array(       'cluster1' => array( 'content' => 2, 'general' => 2 ),        'cluster2' => array( 'content' => 3, 'general' => 3 ),    );


 * $wgCirrusSearchReplicas

Default: $wgCirrusSearchReplicas = '0-2';

Number of replicas Elasticsearch can expand or contract to. This allows for easy development and deployment to a single node (0 replicas) to scale up to higher levels of replication. You if you need more redundancy you could adjust this to '0-10' or '0-all' or even 'false' (string, not boolean) to disable the behavior entirely. The default should be fine for most people. You can also set this setting for each cluster: $wgCirrusSearchReplicas = array(       'cluster1' => array( 'content' => '0-1', 'general' => '0-2' ),        'cluster2' => array( 'content' => '0-2', 'general' => '0-3' ),    );

You can also specify this as an array of index type to replica count. If you do then you must specify all index types. For example: $wgCirrusSearchReplicas = array( 'content' => '0-3', 'general' => '0-2' );


 * $wgCirrusSearchMaxShardsPerNode

Default: $wgCirrusSearchMaxShardsPerNode = [];

Number of shards allowed on the same elasticsearch node. Set this to 1 to prevent two shards from the same high traffic index from being allocated onto the same node.

Example: $wgCirrusSearchMaxShardsPerNode[ 'content' ] = 1;


 * $wgCirrusSearchSlowSearch

Default: $wgCirrusSearchSlowSearch = 10.0;

How many seconds must a search of Elasticsearch be before we consider it slow? Default value is 10 seconds which should be fine for catching the rare truly abusive queries. Use Elasticsearch query more granular logs that don't contain user information.


 * $wgCirrusSearchUseExperimentalHighlighter

Default: $wgCirrusSearchUseExperimentalHighlighter = false; Should CirrusSearch attempt to use the "experimental" highlighter. It is an Elasticsearch plugin that should produce better snippets for search results. Installation instructions are here: https://github.com/wikimedia/search-highlighter If you have the highlighter installed you can switch this on and off so long as you don't rebuild the index while $wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true. Setting it to true without the highlighter installed will break search.


 * $wgCirrusSearchOptimizeIndexForExperimentalHighlighter = false;

Default: $wgCirrusSearchOptimizeIndexForExperimentalHighlighter = false; Should CirrusSearch optimize the index for the experimental highlighter. This will speed up indexing, save a ton of space, and speed up highlighting slightly. This only takes effect if you rebuild the index. The downside is that you can no longer switch $wgCirrusSearchUseExperimentalHighlighter on and off - it has to stay on.


 * $wgCirrusSearchWikimediaExtraPlugin

Default: $wgCirrusSearchWikimediaExtraPlugin = [];

Should CirrusSearch try to use the wikimedia/extra plugin? An empty array means don't use it at all.

Here is an example to enable faster regex matching:

$wgCirrusSearchWikimediaExtraPlugin[ 'regex' ] = array( 'build', 'use', 'max_inspect' => 10000 );

The 'build' value instructs Cirrus to build the index required to speed up regex queries. The 'use' value instructs Cirrus to use it to power regular expression queries. If 'use' is added before the index is rebuilt with 'build' in the array then regex will fail to find anything. The value of the 'max_inspect' key is the maximum number of pages to recheck the regex against. Its optional and defaults to 10000 which seems like a reasonable compromise to keep regexes fast while still producing good results.

This turns on noop-detection for updates and is compatible with wikimedia-extra versions 1.3.1, 1.4.2, 1.5.0, and greater: $wgCirrusSearchWikimediaExtraPlugin[ 'super_detect_noop' ] = true;

This turns on document level noop-detection for updates based on revision ids and is compatible with wikimedia-extra versions 2.3.4.1 and greater: $wgCirrusSearchWikimediaExtraPlugin[ 'documentVersion' ] = true

This allows forking on reindexing and is compatible with wikimedia-extra versions 1.3.1, 1.4.2, 1.5.0, and greater:

$wgCirrusSearchWikimediaExtraPlugin[ 'id_hash_mod_filter' ] = true;


 * $wgCirrusSearchEnableRegex

Default: $wgCirrusSearchEnableRegex = true;

Should CirrusSearch try to support regular expressions with insource:? These can be really expensive, but mostly ok, especially if you have the extra plugin installed. Sometimes they still cause issues though.


 * $wgCirrusSearchRegexMaxDeterminizedStates

Default: $wgCirrusSearchRegexMaxDeterminizedStates = 20000;

Maximum complexity of regexes. Raising this will allow more complex regexes use the memory that they need to compile in Elasticsearch. The default allows reasonably complex regexes and doesn't use too much memory.


 * $wgCirrusSearchQueryStringMaxDeterminizedStates

Default: $wgCirrusSearchQueryStringMaxDeterminizedStates = null;

Maximum complexity of wildcard queries. Raising this value will allow more wildcards in search terms. 500 will allow about 20 wildcards. Setting a high value here can cause the cluster to consume a lot of memory when compiling complex wildcards queries. This setting requires elasticsearch 1.4+. With elasticsearch 1.4+ if this setting is disabled the default value is 10000. With elasticsearch 1.3 this setting must be disabled.

Example: $wgCirrusSearchQueryStringMaxDeterminizedStates = 500;


 * $wgCirrusSearchNamespaceMappings

Default: $wgCirrusSearchNamespaceMappings = [];


 * $wgCirrusSearchExtraIndexes

Default: $wgCirrusSearchExtraIndexes = [];

By default, Cirrus will organize pages into one of two indexes (general or content) based on whether a page is in a content namespace. This should suffice for most wikis. This setting allows individual namespaces to be mapped to specific index suffixes. The keys are the namespace number, and the value is a string name of what index suffix to use. Changing this setting requires a full reindex (not in-place) of the wiki. If this setting contains any values then the index names must also exist in $wgCirrusSearchShardCount.

Extra indexes (if any) you want to search, and for what namespaces? The key should be the local namespace, with the value being an array of one or more indexes that should be searched as well for that namespace.

NOTE: This setting makes no attempts to ensure compatibility across multiple indexes, and basically assumes everyone's using a CirrusSearch index that's more or less the same. Most notably, we can't guarantee that namespaces match up; so you should only use this for core namespaces or other times you can be sure that namespace IDs match 1-to-1.

NOTE Part Two: Adding an index here is cause cirrus to update spawn jobs to update that other index, trying to set the local_sites_with_dupe field. This is used to filter duplicates that appear on the remote index. This is always done by a job, even when run from forceSearchIndex.php. If you add an image to your wiki but after it is in the extra search index you'll see duplicate results until the job is done.


 * $wgCirrusSearchUpdateShardTimeout

Default: $wgCirrusSearchUpdateShardTimeout = '1ms';

Shard timeout for index operations. This is the amount of time Elasticsearch will wait around for an offline primary shard. Currently this is just used in page updates and not deletes. It is defined in Elasticsearch's time format which is a string containing a number and then a unit which is one of d (days), m (minutes), h (hours), ms (milliseconds) or w (weeks). Cirrus defaults to a very tiny value to prevent job executors from waiting around a long time for Elasticsearch. Instead, the job will fail and be retried later.


 * $wgCirrusSearchClientSideUpdateTimeout

Default: $wgCirrusSearchClientSideUpdateTimeout = 120;

Client side timeout for non-maintenance index and delete operations and in seconds. Set it long enough to account for operations that may be delayed on the Elasticsearch node.


 * $wgCirrusSearchClientSideConnectTimeout

Default: $wgCirrusSearchClientSideConnectTimeout = 5;

Client side timeout when initializing connections. Useful to fail fast if elasticsearch is unreachable. Set to 0 to use Elastica defaults (300 sec) You can also set this setting for each cluster: $wgCirrusSearchClientSideConnectTimeout = array(     'cluster1' => 10,      'cluster2' => 5,      )


 * $wgCirrusSearchSearchShardTimeout

Default: $wgCirrusSearchSearchShardTimeout = [ 'default' => '20s', 'regex' => '120s', ];

The amount of time Elasticsearch will wait for search shard actions before giving up on them and returning the results from the other shards. Defaults to 20s for regular searches which is about twice the slowest queries we see. Some shard actions are capable of returning partial results and others are just ignored. Regexes default to 120 seconds because they are known to be slow at this point.