Extension:CirrusSearch/Schema

CirrusSearch uses Elasticsearch as the underlying search engine. The schema used by CirrusSearch is defined through Elasticsearch index settings and mappings. Both the settings and mappings can be requested from any wiki running CirrusSearch to retrieve the current configuration. Attempts are made to keep the documentation here up to date, but the api responses contain the source of truth.

Analysis Chains Used
CirrusSearch defines a variety of analysis chains that are used throughout the schema to allow search text fields in different way. These are exposed as sub properties when querying elasticsearch. For example the near_match analysis of the title field is typically exposed as title.near_match. There are no strict guarantees about the sub-property naming, but convention is for the property to share the name of the analyzer.

The results of using an analysis chain can be checked with the elasticsearch analyze API. This can be queried on the cloudelastic servers or by importing the settings provided by the cirrus-settings-dump api call into a local elasticsearch instance.

keyword
Strict matching of the property text with the queried text. The text is not split into words, the whole text must match from beginning to end. The property text is truncated to 5000 characters, nothing after the first 5000 characters is taken into consideration when matching.

lowercase_keyword
Identical to keyword, but with icu normalization and folding applied.

near_match
Identical to keyword, but with additional flattening of various space-like tokens to spaces. This is used to power the "Go" functionality of CirrusSearch.

near_match_asciifolding
Identical to lowercase_keyword, but with additional flattening of various space-like tokens to spaces.

plain
Applied to textual content to represent the words in a method very close to the original words. Minimal transformations are applied. This only represents words, various special characters (quotes, commas, etc.) are removed in the tokenization step.

prefix
Generates all possible prefixes of a keyword. ICU normalization is applied along with flattening of various space-like tokens to spaces. Any matching against a prefix must start from the very first character of the field.

prefix_asciifolding
Similar to prefix, but with icu folding applied as well.

trigram
Generates trigrams, or three character sequences, of the textual content. This is primarily used to accelerate regex search. For example the string "example text" will yield the tokens: "exa", "xam", "amp", "mpl", "ple", "le ", "e t", " te", "tex", "ext"

text
Standard analyzer for text content. This is similar to the plain analyzer but with more aggressive normalization applied to the content. These normalizations may include stop word filtering, stemming, and other language specific handling.

short_text
Similar to the text analyzer, but specialized for short text strings such as headings and titles.

source_text_plain
Analyzer primarily used against wikitext to provide word level queries. Uses only icu normalization along with some special rules to help separate words seen in wikitext.

suggest
Shingled analzer used to power search suggestions (aka did you mean). Shingles are similar to trigrams, but operate on the word level instead of the character level. This analyzer is configured to emit 1, 2 and 3-grams. For example the string "cats with hats" will emit the tokens: "cats", "cats with", "cats with hats", "with", "with hats", "hats"

token_count
Reports the number of tokens in a field, rather than the textual content.

Native Document Properties
These properties are calculated in the CirrusSearch extension and provided to elasticsearch when sending updates.

version
The revision id that was indexed

wiki
The dbname of the wiki this document belongs to

namespace
The integer namespace the document is in

namespace_text
The textual representation of the namespace the document is in. This is in the wiki's content language

title
The title of the page this document represents. The title uses the text format, where spaces in the title are preserved.

timestamp
The timestamp the most recently indexed revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ

create_timestamp
The timestamp the first revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ.

category
A list of categories the page belongs to. The categories use the text format, where spaces in the title are preserved.

external_link
A list of external url's this page links to.

outgoing_link
A list of wiki pages that are linked from this page. The wiki pages are in dbkey format, where spaces are replaced with underscores.

template
A list of templates that are used in this page, as reported by the MediaWiki wikitext parser. The template names are in text format, where spaces in the title are preserved.

text
The textual content of the page. This is roughly constructed by running the wikitext through the parser to generate html, removing non-text content, and stripping all html. Content removed from this field such as tables, captions, and hatnotes, are moved to the auxiliary_text field.

source_text
The source wikitext of the page.

text_bytes
The size of the content as reported by the associated mediawiki Content implementation. For wikitext this is the number of bytes in the wikitext.

content_model
String representing the name of the content model for this page.

wikibase_item
String containing the wikidata Q-item this page is associated with.

coordinates
List of coordinates associated with this page. Each coordinate has the following structure:

Properties of each coordinate:
 * coord - elasticsearch geo_point. Represented as object with two properties: lat/lon. Both contain a floating point number in the domain (-180, 180)
 * country - country code
 * dim - dimension. Integer radius, in meters, of the item being referenced
 * globe - The globe the coordinates are on. Typically "earth".
 * name - Name of the item referenced. Often null
 * primary - Boolean representing if this is the primary coordinate for the article. Only one coordinate can be primary.
 * region - Sub-region of country this coordinate is within. For example if country code is US region will be a two letter US State code.
 * type - ???. Same value as gt_type field of GeoData table in mysql

language
The language code this page is in

heading
List of headings on this page

opening_text
Text content of the page prior to the first heading. The content is also available in the text property.

auxiliary_text
List of strings removed from the text property. The content that is moved from the text property to this one is controlled by the WikiTextStructure::$auxiliaryElementSelectors property in mediawiki core.

display_title
Contains the display title of the page if it differs from the regular page title in ways other than casing. If the display title is prefixed with the translated namespace of the page in the pages language the namespace name is stripped.

file_bits
Contains the integer bit depth of the media represented by this page

file_height
Contains the integer height of the media represented by this page

file_media_type
Contains the media type of the media represented by this page.

file_mime
Contains the mime type of the media represented by this page.

file_resolution
Contains an integer representation of the resolution of the media represented by this page. This is calculated as floor(square_root(file_width * file_height)).

file_size
Contains the size of the media represented by this page in bytes.

file_text
Contains the text content of the media represented by this page for mime types that mediawiki knows how to extract the content of, such as PDF and DJVU.

file_width
Contains the width of the media represented by this page in pixels

incoming_links
Contains an integer representing the number of pages on the same wiki that link to this page.

redirect
List of redirects on the same wiki that redirect to this page. Each redirect is represented by an object with two properties: The integer namespace in the namespace property, and the title of the redirect in

local_sites_with_dupe
Only found on commonswiki in the File namespace. Contains list of wiki dbnames that have an uploaded file with the exact same name as this file.

External Document Properties
These properties are calculated external to CirrusSearch and populated within the production search clusters

popularity_score
A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.

ores_articletopic
Contains classification predictions about the page from various sources, including ORES models and link recommendations. While the name says articletopic, this will be renamed to something semantically appropriate, perhaps  or even , in the future.

Predictions are provided in the source documents in an array with per-model prefixes and a suffixed integer in [0,1000] representing the confidence. The analysis chain interprets this value as the term frequency. For legacy reasons unprefixed predictions (without a ) belong to the ORES articletopic model. For example:

[       "STEM.Computing|780", "drafttopic/STEM.STEM*|988", "link_recommend/exists|1", ]

copy_to Document Properties
These properties are not provided directly by CirrusSearch, rather the elasticsearch mapping is instructed to create these fields by copying content from other fields.

all
Contains all text content copied to a single field. This consolidation into a single field is an optimization, semantically it shouldn't be important. The general idea is to use as a first-pass filter that removes most irrelevant results, leaving the individual field queries to only effect scoring.

all_near_match
Contains both titles and redirects in a single field for filtering with the near_match analyzer.

suggest
The suggest field is populated by the copy_to section of the title and redirect fields. The suggest field uses shingles (word ngrams) which provides phrase matching in a way that doesn't have to be restricted to the rescore window for performance reasons.

labels_all
Only generated on wikis containing wikibase repo. Contains a copy of all labels in all languages.