ORES/Thresholds

This page describes how to choose thresholds for ORES models, and why everyone needs to do this.

What are thresholds?
Many of the scores from the ORES system are predicting a yes-no question, like "Is this revision damaging?". These will emit a score that looks like, and although it's tempting to interpret the score as a probability, "75% likely to be damaging", and pick a probability as your cut-off, this will result in bad and unknown performance. In the worst case, your cut-off could be as arbitrary as a coin toss, or select half of all edits.

A helpful way to define these thresholds is by choosing the bounds for important operational parameters, for example the "very likely bad" threshold defaults to "maximum recall of damaging edits at precision 90% or greater", i.e. "Catch as many damaging edits as possible, trying to limit falsely identified damaging edits to 10% or less of the revisions we select using this threshold."

Any client that uses scores should have well-defined thresholds that it will use for each task. The Scoring Platform team is happy to help choose and calculate these thresholds, or you can use the table here.

Threshold calculation service
We've included an API that can calculate thresholds for any model. To fetch the configured "very likely bad" threshold for English Wikipedia as described above, we would request

This returns, The statistics have some margin of error, but we see that the threshold should be, and we expect to have 92.3% precision, select 0.2% of the total volume of edits, and so on. These are good properties for a workflow of nearly automatic reversion, for example, but would obviously have to be tuned for a different use. [TODO: for comparison]

Note that thresholds may change as we update the models, so your application should track the model version and request new thresholds whenever the model is incremented.

Worked example
TODO