ORES/Thresholds

From mediawiki.org

Overview[edit]

This page describes why thresholds are needed for ORES models and how to choose them.

What are thresholds?[edit]

"damaging": {
  "score": {
    "prediction": true,
    "probability": {"true": 0.764, "false": 0.236}
  }
}

Figure 1. A damaging score from ORES

ORES includes likelihood estimates as part of the scores it generates. These likelihood estimates allow consumers (tool developers and product designers) to interpret ORES' confidence in a prediction within the technologies they build.

By choosing thresholds on ORES confidence, consumers can make effective use of ORES predictions.

The anatomy of an ORES score[edit]

Figure 1 shows an example of a damaging prediction from ORES. Under "probability" there are two likeihood estimates. These are for "true" and "false". When trying to detect damaging edits, we are interested in the "true" probability.

Note: likelihood estimates do not necessarily correspond directly to the measures of sensitivity (recall) or specificity (precision). This is due to nuances of how a model is trained.

Regardless, the operational concerns of applying a prediction model in practice require a balance of concerns that can't be captured in a single value.

Balancing sensitivity vs. specificity[edit]

The sensitivity and specificity trade-off of the English Wikipedia damage detection model is presented for 9 probability (damaging=true) thresholds.
Sensitivity         vs.         Specificity
Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Precision 0.135 0.171 0.215 0.272 0.347 0.429 0.510 0.641 0.868
Recall 0.905 0.820 0.750 0.664 0.566 0.493 0.391 0.262 0.109

The goal when applying thresholds to ORES likelihood estimates ("probability") is to choose a trade-off based on the important operational parameters that you wish to support.

For example, when using ORES to identify damaging edits for human review, one might choose emphasize sensitivity or high recall by selecting all edits that pass a low threshold. This would allow human reviewers to ensure that the majority of damage is caught in review at the cost of needing to review more edits.

However, in the development of an auto-revert bot, one would choose emphasize high specificity or high precision by selecting only the edits that pass a high threshold. This would minimize the harm of automatically reverting good edits by ensuring that the situation is rare while letting a large amount of less-obvious damage pass on for human review.

In this case the score in Figure 1, the damaging=true probability is relatively high (0.764) which means that ORES' estimates this edit to be likely to be damaging but it is not very confident. If we were to set a high sensitivity threshold (targeting high recall) of 0.3, then we'd flag this edit for review. If we were to set a high specificity threshold (targeting high precision) of 0.9, we'd let this edit pass on to be reviewed by others.

Querying thresholds in ORES[edit]

We've included an API that allows for exploration of the threshold statistics of any model. For example, if we were to choose to emphasize specificity (high precision), we might want to make sure that fewer than 1 in 10 edits that an auto-revert bot reverts was actually good. This operational concern corresponds to precision >= 0.9. To ask ORES what threshold we should set in order to achieve this operational concern, we'll write the following query:

https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true."maximum recall @ precision >= 0.9"

This returns,

{
  "enwiki": {
    "models": {
      "damaging": {
        "statistics": {
          "thresholds": {
            "true": [
              {"!f1": 0.983, "!precision": 0.968, "!recall": 1.0, "accuracy": 0.968, "f1": 0.103,
                "filter_rate": 0.998, "fpr": 0.0, "match_rate": 0.002, "precision": 0.923,
                "recall": 0.055, "threshold": 0.936}
            ]
          }
        }
      }
    }
  }
}

The statistics have some margin of error, but we see that the threshold should be damaging.true > 0.936. We expect to have 92.3% precision, select 5.5% of all damaging edits (recall), and so on. These are good properties for an auto-revert bot (like en:User:ClueBot NG or fa:User:Dexbot).

Note: thresholds may change as we update the models. Applications should track the model version and request new thresholds when the model's version is incremented.

Worked example[edit]

Aaron wants to generate a list of newly created articles that are likely to be vandalism. Since he is going to review them before taking any action, he doesn't mind seeing some good pages in the set, but he's really concerned about letting any vandalism stay in the review queue. After considering this for a little while, he's willing to let 5% of the least concerning vandalism for others to catch. This means he is targeting 95% recall. So he wants to get the most specific (precision) threshold that he can that is still sensitive (recall) enough to catch 95% of vandalism.

"vandalism" is one of the target classes of the draftquality models, so Aaron knows he wants to do a threshold optimization on that class/model pair. He designs the following query to look for an appropriate threshold:

https://ores.wikimedia.org/v3/scores/enwiki/?models=draftquality&model_info=statistics.thresholds.vandalism."maximum precision @ recall >= 0.95"

And he gets the following response:

{"!f1": 0.915, "!precision": 1.0, "!recall": 0.844, "accuracy": 0.845, "f1": 0.081, 
 "filter_rate": 0.839, "fpr": 0.156, "match_rate": 0.161, "precision": 0.042,
 "recall": 0.95, "threshold": 0.016}

Great! He can look at all of the new page creations that ORES scores as more than 0.016 probability of "vandalism" (threshold) and expect that he'll catch 95% of the vandalism (recall). But oh no! It looks like he only gets 4.2% precision. That means that only 1 out of every 25 pages he looks at will actually be vandalism. Is this any help at all? It turns out that it's a huge help because without ORES, he'd have to review 95% of pages to know that he was catching 95% of the vandalism. But by using this model, he only needs to review 16% of new pages -- cutting down his workload by a factor of 6!