Manual:Searchindex table

<- MediaWiki architecture < Database layout

The searchindex table is used to provide full text searches. Those can only be done with the MyISAM table type and the cur table uses the InnoDB type to improve concurrency, so a copy is required.

"DESCRIBE searchindex" gives the following:

+--+-+--+-+-+---+ +--+-+--+-+-+---+ +--+-+--+-+-+---+
 * Field   | Type            | Null | Key | Default | Extra |
 * si_page | int(8) unsigned |      | PRI | 0       |       |
 * si_title | varchar(255)   |      | MUL |         |       |
 * si_text | mediumtext      |      |     |         |       |

The fields are:
 * si_page: page ID from cur.
 * si_title: page title from cur.
 * si_text: the indexed text from cur.

Indexes for this, using en wikipedia for cardinality: +-++--+--+-+---+-+--++--++-+ +-++--+--+-+---+-+--++--++-+ +-++--+--+-+---+-+--++--++-+
 * Table      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
 * searchindex |         0 | si_page  |            1 | si_page     | A         |      797275 |     NULL | NULL   |      | BTREE      |         |
 * searchindex |         1 | si_title |            1 | si_title    | A         |      265758 |     NULL | NULL   |      | FULLTEXT   |         |

A common search query is of the form:

SELECT cur_id,cur_namespace,cur_title,cur_text FROM cur,searchindex WHERE cur_id=si_page AND (MATCH(si_text) AGAINST('+searchterm' IN BOOLEAN MODE) AND cur_is_redirect=0) AND cur_namespace IN (0,9,11)  LIMIT 0, 20;

Explain shows the following: +-++---+---+-+++-+ +-++---+---+-+++-+ +-++---+---+-+++-+
 * table      | type   | possible_keys                             | key           | key_len | ref        | rows   | Extra       |
 * cur        | range  | cur_id,cur_namespace,name_title_timestamp | cur_namespace |       1 | NULL       | 317499 | Using where |
 * searchindex | eq_ref | si_page                                  | si_page       |       4 | cur.cur_id |      1 | Using where |

Known problems
A range scan with namespaces containing a large number of records requires that a large portion of the cur table be loaded into RAM. This is very costly, in part because the cur table contains the full text of the article. A long term plan is to remove the text from the cur table. Temporary measures may involve adding fields to searchindex and indexing them (and using a self join to allow both that index and the fulltext index to be used) or adding a cur index on cur_namespace and cur_is_redirect to decrease the number of complete cur records which must be read to match the query.