ORES/Thresholds

Overview
This page describes how to choose thresholds for ORES models and why this needs to be done.

What are thresholds?
'Figure 1. A damaging score from ORES' ORES includes likelihood estimates as part of the scores it generates. These likelihood estimates allow consumers (tool developers and product designers) to interpret ORES' confidence in a prediction within the technologies they build. By choosing thresholds on ORES confidence, tool developers and product designers can make effective use of ORES predictions.

The anatomy of an ORES score
Figure 1 shows an example damaging prediction from ORES. Note that under "probability" there are two likeihood estimates. One for "true", and one for "false". When trying to detect damaging edits, we are interested in the "true" probability. Note these likelihood estimates do not necessarily correspond directly to the measures of sensitivity (recall) or specificity (precision). This is due to nuances of how a model is trained. Regardless, the operational concerns of applying a prediction model in practice require a balance of concerns that can't be captured in a single value.

Balancing sensitivity vs. specificity
The goal when applying thresholds to ORES likelihood estimates ("probability") is to choose a tradefoff based on the important operational parameters that you wish to support. For example, when using ORES to identify damaging edits for human review, one might choose to err on the side of sensitivity or high recall by selecting all edits that pass a low threshold. This would allow human reviewers to ensure that the majority of damage is caught in review at the cost of needing to review more edits. However, in the development of an auto-revert bot, one would choose to err on the side high specificity or high precision by selecting only the edits that pass a high threshold. This would minimize the harm of automatically reverting good edits by ensure that the situation is rare at the cost of letting a large amount of less-obvious damage pass on for human review.

In this case the score in Figure 1, the  probability is relatively high (0.764) which means that ORES' estimates this edit to be likely to be damaging but it is not very confident. If we were to set a high sensitivity threshold (targeting high recall) of 0.3, then we'd flag this edit for review. If we were to set a high specificity threshold (targeting high precision) of 0.9, we'd let this edit pass on to be reviewed by others.

Querying thresholds in ORES
We've included an API that allows for exploration of the threshold statistics of any model. For example, if we were to choose to err on the side of specificity (high precision), we might want to make sure that fewer than 1 in 10 edits that an auto-revert bot reverts was actually good. This operational concern corresponds to. In order to ask ORES what threshold we should set in order to achieve this operational concerns, we'll write the following query:

https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true."maximum recall @ precision >= 0.9"

This returns, The statistics have some margin of error, but we see that the threshold should be, and we expect to have 92.3% precision, select 5.5% of all damaging edits (recall), and so on. These are good properties for an auto-revert bot (like en:User:ClueBot NG or fa:User:Dexbot).

Note that thresholds may change as we update the models, so your application should track the model version and request new thresholds whenever the model's version is incremented.

Worked example
TODO