Extension:CirrusSearch/Scoring

This page aims to provide some insights on the scoring functions and techniques used by CirrusSearch to rank search results.

Basics
Cirrus follows a very basic concept used by many search engines, a document score combines two types of sub-scores:
 * 1) A score that computes the similarity of the query with the document
 * 2) Scores that depend only on the document metadata (e.g recency, number of incoming links, language...)

Query Architecture
The whole purpose of CirrusSearch is to parse the user query into an ElasticSearch Query using the functionalities available in the ElasticSearch Query DSL.

ElasticSearch Query
This is the query we send to the cluster in order to retrieve ranked results. The full query is rather large even for a single word query (e.g. single word query). The query components can be grouped into several components which serves different purpose. Note that the small number etiquette (1 or 2) on the diagram indicates if this component produces a sub-score of the types mentioned in Basics section.

Retrieval
The purpose of this step is to retrieve documents in the index that match the user query. There are 2 different way to retrieve documents :
 * the full-text queries that computes a score for each document.
 * the filters that do not compute any score.

The fulltext query in cirrus is currently composed of a specific query on the title and redirects. The query must contain all the words title or redirect in the same order to match. Its impact on the score is very high. This query part is important to make sure that if the user types a query that matches perfectly a title/redirect its likely to be in the top search results.

The second part of the full text query can use the QueryString. It uses a default AND operator between words meaning that all the words in the query must appear in the document.

An alternative to QueryString is also available in Cirrus. It uses Common Terms Query as a base component and is useful for long search queries like questions. It works by separating words into two groups, the frequent words (common) and non-common words allowing to set various criteria.

In the end a document must match either the NearMatch or QueryString/CommonTermsQuery (if the query matches to the NearMatch it's very likely that it will match the second part). The score computed will determine the order in which the documents will be rescored. This score is extremely important as it is the main participant in the type 1 score.

Filters are nearly the same whith the exception that they do not participate in the score. A very common filter in Cirrus is the namespace filter but you can activate other filters by using a special syntax.

It is essential for the retrieval step to be extremely efficient.

Rescoring
The rescoring step allows cirrus to rescore the top-N documents returned by the retrieval phase. By working oin a limited sub-set of documents it's possible to do "costly" operations that could not be done in the retrieval step.
 * Phrase rescore: when there are more than two words (generally a phrase) in the query this rescore function tries to rank higher documents that have the same phrase. This function is very costly so it is applied only on the top-512 docs by default while other methods are applied to the top-8196 docs.
 * Incoming links: applies a boost factor depending on the number of incoming to the page.
 * Recency: applies an exponential decay on document timestamp, this is useful to rank high recent pages.