Requests for comment/expose structured data to the search engine

Your comments here.

Background
It would be useful to give the ContentHandler (resp. the Content object) for a specific content model the ability to control what fields are exposed to the search engine index. For wikitext, this would be the main "text" field, but could also include "html" for rendered html, "links" for outgoing links, etc. For Wikibase entities, this would include fields like "label" (which would be a multi-lingual field), "alias", "description", "sitelink", "property-value", etc.

Currently, Content::getTextForSearchIndex exposes flat text to the search index, assuming word based full text indexing is applied.

Proposal
This RFC proposes to add the following:

The type and structure of the index value must correspond to the type of the field as declared by getSearchIndexFieldDefinitions, see below. At least the "text" field should be returned. It would be populated by calling the old getTextForSearchIndex method.

This information should be used by the search engine when defining indexes. TextContent would implement this to return a definition for the "text" field, defining it to be plain text eligible for word-based full text indexing.

If two content handlers declare the same fields with a different type, an exception is thrown.


 * : returns the field name. Fields with the same name may be used by different content models, but they must have the same declaration.
 * : returns the field type (see below)
 * : returns true if the field is a list of values of the said type.

Field types:
 * FIELD_TYPE_TEXT: String. Allow (word based) full text search if possible.
 * FIELD_TYPE_MULTILINGUAL: Associative array of language code mapping to FIELD_TYPE_TEXT values. Allow (word based) full text search if possible.
 * FIELD_TYPE_IDENTIFIER: String. Allow prefix matches if possible.
 * FIELD_TYPE_QUANTITY: Signed float. Allow range queries of possible.
 * FIELD_TYPE_GEOPOINT: A pair of longitude and latitude, represented as floats. Allow special queries if feasible.
 * FIELD_TYPE_DATETIME: A timestamp (in a format wfTimestamp understands). Allow range queries if possible.

Search engines may ignore fields that have unsupported types, or may treat values of such types as plain strings or text.

So, for text, TextContentHandler::getSearchIndexFieldDefinitions would return

array(   new SearchIndexFieldDefinition( 'text', FIELD_TYPE_TEXT )  )

And TextContent::getFieldsForSearchIndex would return

array(   'text' => $this->getTextForSearchIndex  )

(Alternatively,  would call  )

Pending Issues
In the next step, SearchEngine should be modified to make use of the new information. In particular,  should be deprecated, and replaced by a   method.

To make full use of having multiple fields indexed for search, these fields should be accessible in the SearchResult. This ties in with Brion's proposal for  T78011.