Requests for comment/expose structured data to the search engine

From mediawiki.org
Request for comment (RFC)
Component General
Creation date
Author(s) Daniel Kinzler (WMDE)
Document status implemented
See Phabricator.

Your comments here.

Background[edit]

It would be useful to give the ContentHandler (resp. the Content object) for a specific content model the ability to control what fields are exposed to the search engine index. For wikitext, this would be the main "text" field, but could also include "html" for rendered html, "links" for outgoing links, etc. For Wikibase entities, this would include fields like "label" (which would be a multi-lingual field), "alias", "description", "sitelink", "property-value", etc.

Currently, Content::getTextForSearchIndex() exposes flat text to the search index, assuming word based full text indexing is applied.

Proposal[edit]

This RFC proposes to add the following:

Content::getFieldsForSearchIndex(): This would return an associative array mapping field names to index values. The type and structure of the index value must correspond to the type of the field as declared by getSearchIndexFieldDefinitions, see below. At least the "text" field should be returned. It would be populated by calling the old getTextForSearchIndex() method.

ContentHandler::getSearchIndexFieldDefinitions(): This returns a list of SearchIndexFieldDefinition objects, representing the fields that Content::getFieldsForSearchIndex() may return for the handler's content model. This information should be used by the search engine when defining indexes. TextContent would implement this to return a definition for the "text" field, defining it to be plain text eligible for word-based full text indexing.

ContentHandler::getAllSearchIndexFieldDefinitions(): Static methiod that calls getSearchIndexFieldDefinitions() on all registered content handlers, and combines the results. If two content handlers declare the same fields with a different type, an exception is thrown.

class SearchIndexFieldDefinition: This is a value object with the following methods:

  • getName(): returns the field name. Fields with the same name may be used by different content models, but they must have the same declaration.
  • getType(): returns the field type (see below)
  • isMultiValue(): returns true if the field is a list of values of the said type.

Field types:

  • FIELD_TYPE_TEXT: String. Allow (word based) full text search if possible.
  • FIELD_TYPE_MULTILINGUAL: Associative array of language code mapping to FIELD_TYPE_TEXT values. Allow (word based) full text search if possible.
  • FIELD_TYPE_IDENTIFIER: String. Allow prefix matches if possible.
  • FIELD_TYPE_QUANTITY: Signed float. Allow range queries of possible.
  • FIELD_TYPE_GEOPOINT: A pair of longitude and latitude, represented as floats. Allow special queries if feasible.
  • FIELD_TYPE_DATETIME: A timestamp (in a format wfTimestamp understands). Allow range queries if possible.

Search engines may ignore fields that have unsupported types, or may treat values of such types as plain strings or text.

So, for text, TextContentHandler::getSearchIndexFieldDefinitions() would return

 array(
   new SearchIndexFieldDefinition( 'text', FIELD_TYPE_TEXT )
 )

And TextContent::getFieldsForSearchIndex() would return

 array(
   'text' => $this->getTextForSearchIndex()
 )

(Alternatively, getTextForSearchIndex() would call getFieldsForSearchIndex())

Pending Issues[edit]

In the next step, SearchEngine should be modified to make use of the new information. In particular, SearchEngine::getTextFromContent should be deprecated, and replaced by a getFieldsFromContent method.

To make full use of having multiple fields indexed for search, these fields should be accessible in the SearchResult. This ties in with Brion's proposal for SearchResult::getMetadata() T78011.