Wikimedia Discovery/Search/Glossary

Jump to navigation Jump to search

This is a place for us to provide definitions, context, and links for terms we use that other people may not be familiar with. If you come across a term on the mailing list, IRC, or elsewhere that should be defined here, add a note on the Talk page!


Discernatron is a tool that allows participants to judge the relevance of search results to help the Search team be able to test changes before making them available on-wiki. In particular, we are using the Discernatron to gather data so that we can calculate DCG scores for CirrusSearch results (current and for proposed changes).

Discounted Cumulative Gain[edit]

Also known as DCG, nDCG, IDCG, DCG@5, etc.
Discounted cumulative gain (DCG) is a measure of ranking quality. It requires human judgement to be made on rankings (hence the need for the Discernatron to gather that data).


Also known as F0.5, F1, and F2
The F-Score is the harmonic mean (used to average rates) or recall and precision, and is used to provide a single number for comparison purposes. It makes ranking recall and precision numbers easier. Fβ is a weighted harmonic mean that is used to favor recall (β > 1) or precision (β < 1). The weighting isn't linear, so F0.5 and F2 are complementary.


"PaulScore" is the name we've given to a metric proposed by Paul Nelson in a talk he gave at Elasticon (and has given in many other places). "NelsonScore" was also tossed around as an alternative and may occur in very early discussions.
We use PaulScore to evaluate the quality of results provided by CirrusSearch or proposed modifications to CirrusSearch, based on historical click data.
A big advantage of the PaulScore is that it relies on user click history to award points, so it is easy to compute. A disadvantage is that radical improvements may provide results in the lab (vs in production) that are better, but which earn no points because they were never seen by an actual user. As such, the PaulScore is great for (and recommended by Nelson) tracking live performance, though it is still useful for evaluating changes in the lab.
A PaulScore can be calculated for the result set returned for a query. PaulScores can then be aggregated to a user session, and then across user sessions. Since PaulScore is not a common term of art (Paul's working on it) and there's little external documentation beyond videos of his presentations, extra detail is provided here.
  • Users queries are grouped into sessions (in our case, queries that occur within ten minutes of the previous query are in the same session).
  • All results (across queries within a session) are considered potentially relevant to all queries within that session; this is not entirely realistic, but it is a good heuristic.
    • Imagine that a user searches for query Q and gets result R in 8th place, but doesn't click on it (possibly because they didn't look that far down the list). They then search for Q' and the same result R is now in 1st place and they click on it. Query Q' gets credit for result R at position 1, but query Q also gets credit for result R in position 8, even though that instance wasn't clicked on. The justification for the heuristic is that queries with overlapping results are probably on the same or related topics; queries on completely unrelated topics generally won't have overlapping results.
  • For each result clicked on by a given user for a given query, accumulate a score such that query_score += Fk, where F is a scoring factor (see below) and k is the 0-based position of the result.
    • The factor F is a value between 0 and 1 that gives more or less weight to resutls farther down the result list. Nelson suggests values between 0.8 ("low") and 0.99 ("high"), though we are investigating values between 0.1 and 0.9. The credit for results decreases exponentially as you go down the results list. Higher factors give more weight to lower-ranked results. We know from previous research that most users on English Wikipedia, for example, are much more likely to click on the top 3 results.
    • As an aid to debugging, it is also good to note that the numerically maximum score for a query is 1/(1-F), though such a score is extremely unlikely. The minimum score would be zero (no clicks).
  • For each user session, the scores are averaged across all queries into a user session score. Thus, hyperactive users (i.e., bots) that run 20,000 queries in a session don't skew the results, and a person who only searched once is as important as someone who searched many times.
  • For the whole data set (e.g., a day's worth of data), all user session session scores are averaged to give an engine score.
  • We are putting together a dashboard that will automatically approximate the daily PaulScore for full-text and auto-completion queries on CirrusSearch, using a range of factors (0.1-0.9 to start).
Nelson suggests that a score of 0.25 is "very good", though of course the specific F factor and other vagaries of search will affect the score.
For auto-completion suggestions, we expect a much lower score since most queries get no clicks (i.e., many results are shown and ignored while typing and most users will only click on one result), whereas full-text searchers can more easily go back to the results page or open multiple results in other windows.


When measuring accuracy (or classification, or search results, etc.), precision is the fraction of retrieved instances that are relevant. When you focus on precision, you care more about every answer you return being correct, and worry less about missing some correct answers. Improving precision decreases false positives.


  1. When measuring accuracy (or classification, or search results, etc.), recall is the fraction of relevant instances that are retrieved. When you focus on recall, you care more about returning every correct answer you can, and worry less about having some additional wrong answers. Improving recall decreases false negatives.
  2. "Recall" is also sometimes used more loosely to refer to the number of results made available. So someone might speak of "increasing recall" when they mean specifically that they want more results to be considered and scored, usually on the assumption that some of them will be good results. An example would be configuring your search engine to only require 70% of query terms to be in a returned document—that would certainly increase "recall" in the sense of there being more results, but wouldn't guarantee that there would be a larger number of desirable results.

Relevance Forge[edit]

Also known as RelForge
The primary purpose of the Relevance Forge is to allow us to experiment with and test proposed modifications to our search process and gauge their effectiveness and impact before releasing them into production, and even before doing any kind of user acceptance or A/B testing.


Elasticsearch is a distributed, open source search and analytics engine which powers search on Wikipedia and other WMF wikis. Elasticsearch has a glossary defining terms used in the elasticsearch world. Below are a few of the most used ones.


- Lucene is the indexing engine used by elasticsearch.


- In the context of elasticsearch, a node is an instance of elasticsearch. In practice, this maps to a server, even if it is actually possible to run multiple nodes on the same server. However, this is not something that we do at the moment.


- An elasticsearch index can be compared to a database.


- An index is split in multiple shards. Each shard is a lucene instance. Each shard has at least one replica. (Multiple replicas are supported)

See also[edit]

What search metrics do you use most to understand how your search is performing (Justin-Ormont) (