Extension:CirrusSearch/Query Construction

This page describes how the user query is manipulated to be reconstructed as a structured Elasticsearch query.

Overview
CirrusSearch interacts with MediaWiki core by extending SearchEngine. This class exposes 3 main ways to query the index and find pages (called SearchEngine entry points in CirrusSearch):
 * fulltext: the classic full text search provided by Special:Search or the search module of API:Query
 * near match: not directly exposed though an interface nor an API, this call is responsible for the "go feature", when typing a text that matches nearly perfectly a page it goes directly to that page instead of Special:Search.
 * completion: used by all autocomplete (search as you type).

When the query string and its associated metadata enter Cirrus it undergoes various transformation steps:
 * 1) Parsing
 * 2) Profile selection
 * 3) Elasticsearch query building
 * 4) Elasticsearch responses transformation
 * 5) Fallback methods evaluation
 * 6) CrossProject searches

Parsing
Parsing is responsible for extracting features from the user query string. Note that while parsing is particularly important for fulltext search queries it is also present for other search entry points, for instance the namespace prefix extraction is present in all searches and can be considered a parsing step.

Parsing produces a  instance that contains all the information known about the query and its context.
 * the search engine entry point
 * all its metadata (size, offset, ...)
 * contextual filters (e.g. the prefix option provided by Extension:InputBox)
 * the parsed query (AST)

The  is immutable.

Profile selection
Profile is the process responsible for deciding what are the best profiles to use for a given.

Elasticsearch query building
This is the process of building the Elasticsearch search request body.

Retrieval query
Meant to extract all the documents that match the user query. This Elasticsearch query is split into two parts.

Scoring part
Elements of the query that affect scoring. Changing something here should not change the set of hits found by the retrieval query. This section of the query must only affect the initial ranking of the results. The scoring part of a query is controlled by a FullTextQueryBuilder currently only supported by the fulltext SearchEngine entry point.

Filtering part
Elements of the query that do not affect ranking. Changing something here does not affect ranking but changes the set of hits found by the retrieval query. Filtering is also controlled by FullTextQueryBuilder but will change similarly to have a  as input.

Rescore query
Fine-tuning of the ranking. Depending on the need, multiple rescore queries can be combined, their scores can also be combined. Some searches may prefer to combine the score from the scoring part of the retrieval query with some rescore components.

Fetch phase configuration
This is the part of the search request that instructs Elasticsearch what data to extract for every hit we display to the user. This phase of the query building process is not yet fully designed and the current way of doing things is not optimal. A ResultsType is chosen early in the process and is responsible from selecting the fields to extract and the fields to highlight.

It is tricky as it is directly connected to the way we display the search hits in Special:Search. Some extension may want to extract and display specific data that it stored using a custom mapping and a custom ContentHandler. Some keywords may want to tell the user that they matched a particular part of the document. Some extension may want to completely transform the data and aspect displayed using hooks like Manual:Hooks/ShowSearchHit.

Currently:
 * Fields are extracted using ResultsType::getSourceFiltering and ResultsType::getStoredFields
 * Highlighting is setup by ResultsType::getHighlightingConfiguration
 * Special:Search's look can be changed using Manual:Hooks/ShowSearchHit or Manual:Hooks/ShowSearchHitTitle
 * Extensions can register extra data into SearchResult using Manual:Hooks/SearchResultsAugment

Drawbacks are:
 * None of these techniques can be strongly coupled but they are highly inter-dependent
 * ResultsType is not driven by profile and it's unclear when it should be constructed. Cirrus decides the ResultsType before anything else, but some FullTextQueryBuilder may override it
 * Manual:Hooks/ShowSearchHit being a hook gives no guarantee that it'll be executed in the right order (not have its values overridden) nor that it has all the required context to know what to do.
 * Keywords are unable to cleanly add new highlighting hints