Extension:CirrusSearch/Schema

Analysis Chains Used
CirrusSearch defines a variety of analysis chains that are used throughout the schema to allow search text fields in different way. These are exposed as sub properties when querying elasticsearch. For example the near_match analysis of the title field is exposed as title.near_match.

keyword
Strict matching of the property text with the queried text. The text is not split into words, the whole text must match from beginning to end. The property text is truncated to 5000 characters, nothing after the first 5000 characters is taken into consideration when matching.

Native Document Properties
These properties are calculated in the CirrusSearch extension and provided to elasticsearch when sending updates.

version
The revision id that was indexed

wiki
The dbname of the wiki this document belongs to

namespace
The integer namespace the document is in

namespace_text
The textual representation of the namespace the document is in. This is in the wiki's content language

title
The title of the page this document represents. The title uses the text format, where spaces in the title are preserved.

timestamp
The timestamp the most recently indexed revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ

create_timestamp
The timestamp the first revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ.

category
A list of categories the page belongs to. The categories use the text format, where spaces in the title are preserved.

external_link
A list of external url's this page links to.

outgoing_link
A list of wiki pages that are linked from this page. The wiki pages are in dbkey format, where spaces are replaced with underscores.

template
A list of templates that are used in this page, as reported by the MediaWiki wikitext parser. The template names are in text format, where spaces in the title are preserved.

text
The textual content of the page. This is roughly constructed by running the wikitext through the parser to generate html, removing non-text content, and stripping all html. Content removed from this field, such as tables captions and hatnotes, are copied to the auxiliary_text field.

source_text
The source wikitext of the page.

text_bytes
The size of the content as reported by the associated mediawiki Content implementation. For wikitext this is the number of bytes in the wikitext.

content_model
String representing the name of the content model for this page.

wikibase_item
String containing the wikidata Q-item this page is associated with.

coordinates
List of coordinates associated with this page. Each coordinate has the following structure:

coord
geo_point

language
The language code this page is in

heading
List of headings on this page

opening_text
Text content of the page prior to the first heading. The content is also available in the text property.

auxiliary_text
List of strings removed from the text property. The content that is moved from the text property to this one is controlled by the WikiTextStructure::$auxiliaryElementSelectors property in mediawiki core.

display_title
Contains the display title of the page if it differs from the regular page title in ways other than casing. If the display title is prefixed with the translated namespace of the page in the pages language the namespace name is stripped.

file_bits
Contains the integer bit depth of the media represented by this page

file_height
Contains the integer height of the media represented by this page

file_media_type
Contains the media type of the media represented by this page.

file_mime
Contains the mime type of the media represented by this page.

file_resolution
Contains an integer representation of the resolution of the media represented by this page. This is calculated as floor(square_root(file_width * file_height)).

file_size
Contains the size of the media represented by this page in bytes.

file_text
Contains the text content of the media represented by this page for mime types that mediawiki knows how to extract the content of, such as PDF and DJVU.

file_width
Contains the width of the media represented by this page in pixels

incoming_links
Contains an integer representing the number of pages on the same wiki that link to this page.

redirect
List of redirects on the same wiki that redirect to this page. Each redirect is represented by an object with two properties: The integer namespace in the namespace property, and the title of the redirect in

local_sites_with_dupe
Only found on commonswiki in the File namespace. Contains list of wiki dbnames that have an uploaded file with the exact same name as this file.

External Document Properties
These properties are calculated external to CirrusSearch and populated within the production search clusters

popularity_score
A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.

copy_to Document Properties
These properties are not provided directly by CirrusSearch, rather the elasticsearch mapping is instructed to create these fields by copying content from other fields.

all
Contains all text content copied to a single field. This primary use case of this field is for filtering.

labels_all
Only generated on wikis containing wikibase repo. Contains a copy of all labels in all languages.