Help:CirrusSearch/Logical operators

'  currently does not support'' classic boolean searching, and the logical operators   and   should be used with great care, if at all. '''

Negation and parentheses
CirrusSearch does support several ways of indicating negation. The following queries are all equivalent:   (minus sign),   (exclamation point), and   (  operator).

CirrusSearch does not support parentheses, and they are removed from the query.

Lucene,  , and <tvar|2> </>
CirrusSearch is built on top of Elasticsearch, which in turn is built on Lucene. Our Lucene implementation does not support the classic boolean <tvar|1> </> or <tvar|2> </> operators, though it does offer those keywords as binary operators.

Instead Lucene converts <tvar|1> </> and <tvar|2> </> to a different formalism—unary <tvar|3> </> and <tvar|4> </> operators—giving results that sometimes mimic the expected boolean results, but which can also be very divergent from them. ( Note that CirrusSearch does not currently support <tvar|1> </> or <tvar|2> </> operators in user queries.  They are used here only to demonstrate the internal workings of Lucene. )

In Lucene,  indicates that a search term is required and must be present in any results. So, a query like <tvar|1> </> would only return results that contain some form of <tvar|2>dog</> in them (note that this would also be equivalent to just searching for <tvar|3> </>).

On the other hand, <tvar|1> </> terms are optional but should be present if possible; while they are not strictly required, they do effect ranking. So <tvar|1> </> would require <tvar|2>dog</> in every result, but would generally rank those that also contain <tvar|3>cat</> as better matches.

The one exception to <tvar|1> </> terms being optional is that if there are zero <tvar|2> </> terms, then at least one <tvar|3> </> term would be present in each result. Thus, <tvar|1> </> would actually give results that have at least one of <tvar|2>dog</>, <tvar|3>cat</>, or <tvar|4>fish</> present—though any results with all three would generally rank higher.

Classic boolean search often has an implicit <tvar|1> </>, meaning that any query terms without an explicit boolean operator between them are assumed to have an <tvar|1> </> between them. In Lucene, any query term without an explicit <tvar|1> </> or <tvar|2> </> is assumed to have an implicit <tvar|1> </> applied to it.

Converting <tvar|1> </> and <tvar|2> </>
Lucene converts <tvar|1> </> and <tvar|2> </> to <tvar|3> </> and <tvar|4> </> in a way that sometimes gives the expected results, but often leads to very unexpected results.

When Lucene encounters <tvar|1> </>, it applies <tvar|2> </> to the terms before and after the <tvar|3> </>. When it encounters <tvar|1> </>, it applies <tvar|2> </> to the terms before and after the <tvar|3> </>. The query is processed left to right, and later <tvar|1> </> or <tvar|2> </> operators override earlier ones (see examples below).

This effectively gives an unusual "backward order precedence" to the operators, and the results can be quite unexpected compared to classic boolean searching.

Examples that go wrong
Below are some worked examples where the conversion from <tvar|1> / </> to <tvar|2> / </> gives divergent results from the expectations of classic boolean operators.


 * convert <tvar|1> </> to <tvar|2> </> before and after, giving:
 * convert <tvar|1> </> to <tvar|2> </> before and after (in this case overriding the previously applied <tvar|3> </>), giving:
 * The result set is thus the same as <tvar|1> </>, with <tvar|2> </> being optional (and only affecting ranking).
 * convert <tvar|1> </> to <tvar|2> </> before and after (in this case overriding the previously applied <tvar|3> </>), giving:
 * The result set is thus the same as <tvar|1> </>, with <tvar|2> </> being optional (and only affecting ranking).
 * The result set is thus the same as <tvar|1> </>, with <tvar|2> </> being optional (and only affecting ranking).


 * convert <tvar|1> </> to <tvar|2> </> before and after, giving:
 * apply an implicit <tvar|1> </> to any term without an explicit <tvar|2> </> or <tvar|3> </>, giving:
 * In a classic boolean system with implicit <tvar|1> </>, we would expect that <tvar|2> </> and <tvar|3> </> to be the same, but compare this to the example above to see the difference—only <tvar|4> </> is required here, while <tvar|5> </> and <tvar|6> </> are both required above.
 * apply an implicit <tvar|1> </> to any term without an explicit <tvar|2> </> or <tvar|3> </>, giving:
 * In a classic boolean system with implicit <tvar|1> </>, we would expect that <tvar|2> </> and <tvar|3> </> to be the same, but compare this to the example above to see the difference—only <tvar|4> </> is required here, while <tvar|5> </> and <tvar|6> </> are both required above.
 * In a classic boolean system with implicit <tvar|1> </>, we would expect that <tvar|2> </> and <tvar|3> </> to be the same, but compare this to the example above to see the difference—only <tvar|4> </> is required here, while <tvar|5> </> and <tvar|6> </> are both required above.


 * convert <tvar|1> </> to <tvar|2> </> before and after, giving:
 * convert <tvar|1> </> to <tvar|2> </> before and after, giving:
 * The result set is thus the same as simply searching for <tvar|1> </>, with <tvar|2> </> and <tvar|3> </> only affecting ranking. This also means that if there are zero documents with either <tvar|1> </> or <tvar|2> </> in them, you will get the same results searching for <tvar|3> </> as you would for just searching for <tvar|4> </>, which is not what you would expect from a classic boolean system.
 * convert <tvar|1> </> to <tvar|2> </> before and after, giving:
 * The result set is thus the same as simply searching for <tvar|1> </>, with <tvar|2> </> and <tvar|3> </> only affecting ranking. This also means that if there are zero documents with either <tvar|1> </> or <tvar|2> </> in them, you will get the same results searching for <tvar|3> </> as you would for just searching for <tvar|4> </>, which is not what you would expect from a classic boolean system.
 * The result set is thus the same as simply searching for <tvar|1> </>, with <tvar|2> </> and <tvar|3> </> only affecting ranking. This also means that if there are zero documents with either <tvar|1> </> or <tvar|2> </> in them, you will get the same results searching for <tvar|3> </> as you would for just searching for <tvar|4> </>, which is not what you would expect from a classic boolean system.

In general, mixing <tvar|1> </> with <tvar|2> </>, including implicit <tvar|2> </> in one query gives results that are unintuitive in a classic boolean framework. It can also be very difficult to detect these cases where the boolean logic goes awry, unless you already know exactly how many documents contain each possible positive and negative combination of your query terms.

Common use cases
If you have no explicit operators, then the boolean default is <tvar|1> </> and the Lucene default is <tvar|2> </>, which are equivalent if they are the only operators present in the query:


 * — user intent: all three terms must be present in any results
 * — explicit classic boolean query: all three terms must be present in any results
 * — Lucene interpretation: all three terms must be present in any results

However, since <tvar|1> </> is implicit, nothing is gained by making it explicit by using <tvar|2> </>, other than the potential for later boolean confusion.

If the only operator in the query is <tvar|1> </>—crucially meaning that there is no implicit <tvar|2> </>, then it is the same as everything having a <tvar|3> </> (recall that if a query has <tvar|3> </> terms but no <tvar|4> </> terms, than at least one of the <tvar|3> </> terms will be present in any result):


 * — classic boolean query: at least one of the three terms must be present in any results
 * — Lucene interpretation: at least one of the three terms must be present in any results

Be very careful with implicit <tvar|1> / </>! In the example above, <tvar|1> </> the implicit <tvar|2> </> applied to <tvar|3> </> means that neither <tvar|4> </> nor <tvar|5> </> are strictly required to be in the results.

Booleans, keywords, and prefixes
<tvar|1> </> and <tvar|2> </> do not interact predictably with special keywords (like <tvar|3> </> or <tvar|4> </>) or with namespaces (like <tvar|5> </> or <tvar|6> </>) and probably should not be used in conjunction with either.

Future plans
Of course, the 1>Special:MyLanguage/Wikimedia Search Platform</>|Search Platform team is not very happy with this state of affairs.

In the short term we are creating this document and updating the <tvar|1></> documentation to reflect the reality of our current system.

Longer term, we plan to implement a new layer in CirrusSearch that will properly construct a Lucene <tvar|1> / </> query that is equivalent to a given classic boolean query, including proper support for parentheses and return the expected results. (It is possible to specify in Lucene that at least one of a set of query terms or clauses is a required to match, which is equivalent to a boolean <tvar|1> </>; requiring that all of a set of query terms or clauses match is the same as a boolean <tvar|2> </>.)

Beyond that, we may also make explicit the <tvar|1> </> and <tvar|2> </> operators, possibly using the unary syntax shown in this document, but also possibly using some other syntax, as yet to be determined.