Extension:CirrusSearch/Query Construction

This page describes how the user query is manipulated to be reconstructed as a structured Elasticsearch query.

Overview
CirrusSearch interacts with MediaWiki core by extending SearchEngine. This class exposes 3 main ways to query the index and find pages (called SearchEngine entry points in CirrusSearch):
 * full text: the classic full text search provided by Special:Search or the search module of API:Query
 * near match: not directly exposed though an interface nor an API, this call is responsible for the "go feature", when typing a text that nearly perfectly matches a page it goes directly to that page instead of Special:Search.
 * completion: used by all autocomplete (search as you type).

When the query string and its associated metadata enter Cirrus it undergoes various transformation steps:
 * 1) Parsing
 * 2) Profile selection
 * 3) Elasticsearch query building
 * 4) Elasticsearch responses transformation
 * 5) Fallback methods evaluation
 * 6) CrossProject searches

Parsing
Parsing is responsible for extracting features from the user query string. Note that while parsing is particularly important for full text search queries it is also present for other search entry points, for instance the namespace prefix extraction is present in all searches and can be considered a parsing step.

Parsing produces a  instance that contains all the information known about the query and its context.
 * the search engine entry point
 * all its metadata (size, offset, ...)
 * contextual filters (e.g. the prefix option provided by Extension:InputBox)
 * the parsed query (AST)

The  is immutable.

Profile selection
Profile is the process responsible for deciding what are the best profiles to use for a given. This component is currently under discussion.

Elasticsearch query building
This is the process of building the Elasticsearch search request body.

Retrieval query
Meant to extract all the documents that match the user query. This Elasticsearch query is split into two parts.

Scoring part
Elements of the query that affect scoring. Changing something here should not change the set of hits found by the retrieval query. This section of the query must only affect the initial ranking of the results. The scoring part of a query is controlled by a FullTextQueryBuilder currently only supported by the full text SearchEngine entry point.

Filtering part
Elements of the query that do not affect ranking. Changing something here does not affect ranking but changes the set of hits found by the retrieval query. Filtering is also controlled by FullTextQueryBuilder but will change similarly to have a  as input.

Rescore query
Fine-tuning of the ranking. Depending on the need, multiple rescore queries can be combined, their scores can also be combined. Some searches may prefer to combine the score from the scoring part of the retrieval query with some rescore components.

Fetch phase configuration
This is the part of the search request that instructs Elasticsearch what data to extract for every hit we display to the user. This phase of the query building process is not yet fully designed and the current way of doing things is not optimal. A ResultsType is chosen early in the process and is responsible for selecting the fields to extract and the fields to highlight.

It is tricky as it is directly connected to the way we display the search hits in Special:Search. Some extension may want to extract and display specific data that it stored using a custom mapping and a custom ContentHandler. Some keywords may want to tell the user that they matched a particular part of the document. Some extension may want to completely transform the data and aspect displayed using hooks like Manual:Hooks/ShowSearchHit.

In general, ShowSearchHit is currently used in a dual capacity: as a hook for some extension to incrementally tweak search results (i.e. add some widget or formatting), and as means to completely override the result display, like Wikibase is doing (both with and without CirrusSearch enabled). The challenge here is that some scenarios - like Wikibase without CirrusSearch - may call for complete display override without actually involving custom result type, thus the only way to implement such customization now is the hook.

Currently:
 * Fields are extracted using ResultsType::getSourceFiltering and ResultsType::getStoredFields
 * Highlighting is setup by ResultsType::getHighlightingConfiguration
 * Special:Search's look can be changed using Manual:Hooks/ShowSearchHit or Manual:Hooks/ShowSearchHitTitle
 * Extensions can register extra data into SearchResult using Manual:Hooks/SearchResultsAugment

Drawbacks are:
 * None of these techniques can be strongly coupled but they are highly interdependent
 * ResultsType is not driven by profile and it's unclear when it should be constructed. Cirrus decides the ResultsType before anything else, but some FullTextQueryBuilder may override it
 * Manual:Hooks/ShowSearchHit being a hook gives no guarantee that it'll be executed in the right order (not have its values overridden) nor that it has all the required context to know what to do.
 * Keywords are unable to cleanly add new highlighting hints

Elasticsearch responses transformation
The process of reading the Elasticsearch response and returning a:
 * SearchResultSet for the full text search engine entry point
 * SearchSuggestionSet for the completion search engine entry point
 * Title for the near match search engine entry point

The process responsible for doing this transformation is through CirrusSearch ResultsType.

Fallback methods evaluation
Fallback methods are only used (for now) in full text searches. It's a process that spans the entirety of the query construction up to the results evaluation. It is meant to repair a query that may not produce desirable results (e.g., at least 3 results to display).

Phrase suggester
Attach an Elasticsearch suggest request to the main search query and display the suggestion if a title is not highlighted. May rewrite the entire result set using the suggestion as the query if the initial result set did not produce any results. It is supposed to detect typos and fix them.

TextCat language detection
This process runs language detection on the user query and runs a second search on the corresponding wiki in the detected language. The results are appended to the first ones.

Cross-project searches
This process runs  for every sister project of the same language. The search request is attached to the main one using the  feature.