Wikimedia Search Platform/Search/Glossary

This is a place for us to provide definitions, context, and links for terms we use that other people may not be familiar with. If you come across a term on the mailing list, IRC, or elsewhere that should be defined here, add a note on the Talk page!

Discernatron
Discernatron is a tool that allows participants to judge the relevance of search results to help the Search team be able to test changes before making them available on-wiki. In particular, we are using the Discernatron to gather data so that we can calculate DCG scores for CirrusSearch results (current and for proposed changes).

Discounted Cumulative Gain (DCG, nDCG, IDCG, DCG@5, etc.)
Discounted cumulative gain (DCG) is a measure of ranking quality. It requires human judgements to be made on rankings (hence the need for the Discernatron to gather that data).

F-Score (F0.5, F1, F2)
The F-Score is the harmonic mean (used to average rates) or recall and precision, and is used to provide a single number for comparison purposes. Trey doesn't like it, but admits it makes ranking recall and precision numbers easier. Fβ is a weighted harmonic mean that is used to favor recall (β > 1) or precision (β < 1). The weighting isn't linear, so F0.5 and F2 are complementary.

PaulScore
"PaulScore" is the name we've given to a metric proposed by Paul Nelson in a talk he gave at Elasticon (and has given in many other places). "NelsonScore" was also tossed around as an alternative and may occur in very early discussions. We use PaulScore to evaluate the quality of results provided by CirrusSearch or proposed modifications to CirrusSearch, based on historical click data. A big advantage of the PaulScore is that it relies on user click history to award points, so it is easy to compute. A disadvantage is that radical improvements may provide results in the lab (vs in production) that are better, but which earn no points because they were never seen by an actual user. As such, the PaulScore is great for (and recommended by Nelson) for tracking live performance, though it is still useful for evaluating changes in the lab.

A PaulScore can be calculated for the result set returned for a query. PaulScores can then be aggregated to a user session, and then across user sessions. Since PaulScore is not a common term of art (yet, Paul's working on it) and there's little external documentation beyond videos of his presentations, extra detail is provided here. We are putting together a dashboard that will automatically approximate the daily PaulScore for full-text and auto-completion queries on CirrusSearch, using a range of factors (0.1-0.9 to start).
 * Users queries are grouped into sessions (in our case, queries that occur within ten minutes of the previous query are in the same session).
 * All results (across queries within a session) are considered potentially relevant to all queries within that session; this is not entirely realistic, but it is a good heuristic.
 * Imagine that a user searches for query Q and gets result R in 8th place, but doesn't click on it (possibly because they didn't look that far down the list). They then search for Q' and the same result R is now in 1st place and they click on it. Query Q' gets credit for result R at position 1, but query Q also gets credit for result R in position 8, even though that instance wasn't clicked on. The justification for the heuristic is that queries with overlapping results are probably on the same or related topics; queries on completely unrelated topics generally won't have overlapping results.
 * For each result clicked on by a given user for a given query, accumulate a score such that query_score += Fk, where F is a scoring factor (see below) and k is the 0-based position of the result.
 * The factor F is a value between 0 and 1 that gives more or less weight to resutls farther down the result list. Nelson suggests values between 0.8 ("low") and 0.99 ("high"), though we are investigating values between 0.1 and 0.9. The credit for results decreases exponentially as you go down the results list. Higher factors give more weight to lower-ranked results. We know from previous research that most users on English Wikipedia, for example, are much more likely to click on the top 3 results.
 * As an aid to debugging, it is also good to note that the numerically maximum score for a query is 1/(1-F), though such a score is extremely unlikely. The minimum score would be zero (no clicks).
 * For each user session, the scores are averaged across all queries into a user session score. Thus, hyperactive users (i.e., bots) that run 20,000 queries in a session don't skew the results, and a person who only searched once is as important as someone who searched many times.
 * For the whole data set (e.g., a day's worth of data), all user session session scores are averaged to give an engine score.

Nelson suggest that a score of 0.25 is "very good", though of course the specific F factor and other vagaries of search will affect the score.

For auto-completion suggestions, we expect a much lower score since most queries get no clicks (i.e., while typing many results are shown and ignored) and most users will only click on one results, whereas full-text searchers can more easily go back to the results page or open multiple results in other windows.

Precision
When measuring accuracy (or classification, or search results, etc.) precision is the fraction of retrieved instances that are relevant. When you focus on precision, you care more about every answer you return being correct, and worry less about missing some correct answers. Improving precision decreases false positives.

Recall
When measuring accuracy (or classification, or search results, etc.) recall  is the fraction of relevant instances that are retrieved. When you focus on recall, you care more about returning every correct answer you can, and worry less about having some additional wrong answers. Improving recall decreases false negatives.

Relevance Forge (RelForge)
The primary purpose of the Relevance Forge is to allow us to experiment with proposed modifications to our search process and gauge their effectiveness and impact before releasing them into production, and even before doing any kind of user acceptance or A/B testing.