ORES/Thresholds

Overview
This page describes why thresholds are needed for ORES models and how to choose them.

What are thresholds?
'Figure 1. A damaging score from ORES' ORES includes likelihood estimates as part of the scores it generates. These likelihood estimates allow consumers (tool developers and product designers) to interpret ORES' confidence in a prediction within the technologies they build.

By choosing thresholds on ORES confidence, consumers can make effective use of ORES predictions.

The anatomy of an ORES score
Figure 1 shows an example of a damaging prediction from ORES. Under "probability" there are two likeihood estimates. These are for "true" and "false". When trying to detect damaging edits, we are interested in the "true" probability.

Note: likelihood estimates do not necessarily correspond directly to the measures of sensitivity (recall) or specificity (precision). This is due to nuances of how a model is trained.

Regardless, the operational concerns of applying a prediction model in practice require a balance of concerns that can't be captured in a single value.

Balancing sensitivity vs. specificity
The goal when applying thresholds to ORES likelihood estimates ("probability") is to choose a trade-off based on the important operational parameters that you wish to support.

For example, when using ORES to identify damaging edits for human review, one might choose emphasize sensitivity or high recall by selecting all edits that pass a low threshold. This would allow human reviewers to ensure that the majority of damage is caught in review at the cost of needing to review more edits.

However, in the development of an auto-revert bot, one would choose emphasize high specificity or high precision by selecting only the edits that pass a high threshold. This would minimize the harm of automatically reverting good edits by ensuring that the situation is rare while letting a large amount of less-obvious damage pass on for human review.

In this case the score in Figure 1, the  probability is relatively high (0.764) which means that ORES' estimates this edit to be likely to be damaging but it is not very confident. If we were to set a high sensitivity threshold (targeting high recall) of 0.3, then we'd flag this edit for review. If we were to set a high specificity threshold (targeting high precision) of 0.9, we'd let this edit pass on to be reviewed by others.

Querying thresholds in ORES
We've included an API that allows for exploration of the threshold statistics of any model. For example, if we were to choose to emphasize specificity (high precision), we might want to make sure that fewer than 1 in 10 edits that an auto-revert bot reverts was actually good. This operational concern corresponds to. To ask ORES what threshold we should set in order to achieve this operational concern, we'll write the following query:

https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true."maximum recall @ precision >= 0.9"

This returns, The statistics have some margin of error, but we see that the threshold should be. We expect to have 92.3% precision, select 5.5% of all damaging edits (recall), and so on. These are good properties for an auto-revert bot (like en:User:ClueBot NG or fa:User:Dexbot).

Note: thresholds may change as we update the models. Applications should track the model version and request new thresholds when the model's version is incremented.

Worked example
TODO