ORES/RCFilters

Very rough notes for now, just a transcription of what I scribbled on my notepad

To document:


 * Vocabulary (model, outcome, precision, recall, threshold, level)
 * How this appears in the UI (+Special:ORESModels)
 * Threshold selection
 * Config structure
 * Defaults (goodfaith verylikelybad missing)
 * Variability when models change
 * Disabling/skipping levels
 * Need to redo thresholds for a bunch of older models (which ones are up to date?)
 * Maintenance scripts to run during deployment
 * How to disable models (mostly bad goodfaith)
 * Thresholds in API are 1-ed for false

Threshold selection


 * maybebad: P>=0.15 or R>=0.9, tighter of the two
 * verylikelybad: P>=0.9, or R>=0.1, wider of the two (but not below P=0.6)
 * likelybad: 0.3 precision points down from verylikelybad (=P(verylikely)-0.3)? Wider of P>=0.6 and R>=0.2? Or R(verylikely)+0.1? But at least P=0.5
 * Never use thresholds with R=1 or P=1
 * likelygood: P >= 0.995 but R <= 0.9? Minimum recall for if high precision not available? No overlap with likelybad (but overlap with maybebad is OK)

Vocabulary

 * Model: software that predicts a certain attribute of an edit or page. For RCFilters, we use the  and   models, which predict how likely an edit is to be damaging or made in good faith.
 * Outcome: possible values for the attribute that the model predicts. For damaging and goodfaith, the only outcomes are  (is damaging / in good faith) and   (is not damaging / not in good faith). Some other models have more than two outcomes (e.g.  ), but RCFilters only uses true/false models right now.
 * Score: a number between 0 and 1 returned by a model. The higher the score, the higher the likelihood of a  outcome. But this does not necessarily correspond to a probability! If edit A has a damaging score of 0.9 and edit B has a damaging score of 0.3, that means the model thinks A is more likely to be damaging than B, but it doesn't mean that there's a 90% chance that A is damaging. It doesn't even have to mean that A is more likely to be damaging than not damaging.
 * Filter: a feature in RCFilters that lets you display only those edits that match certain criteria. The ORES integration in RCFilters typically provides the following filters:
 * For  ("contribution quality"): very likely good, may have problems , likely have problems , very likely have problems
 * For  ("user intent"): very likely good faith , may be bad faith , likely bad faith , very likely bad faith
 * Note that for, the   filter looks for a   outcome (high scores) and the   filters look for a   outcome (low scores), but for   this is reversed (because there,   outcomes are "bad")
 * Threshold: A cutoff value for scores. Filters are implemented as  (when looking for   outcomes) or   (when looking for   outcomes), and the number   is called the threshold.
 * Note how thresholds are reversed for  outcomes: when the ORES API reports a 0.123 threshold for a   outcome, that means  . This is a bit confusing, but this definition has advantages, like higher thresholds always corresponding to narrower score ranges. Another way you can think of it is that the   model is a mirror image of the   model, with , and you're working with thresholds on   (since   is equivalent to  ).
 * Precision: The expected percentage of results of a filter that truly match the filter. For example, when we say "the precision of the verylikelygood filter is 95%", that means that we expect 95% of the edits returned by the verylikelygood filter to actually be good (and the other 5% to be false positives).
 * Recall: The expected percentage of the truly matching population that is returned by a filter. For example, when we say "the recall of the likelybad filter is 30%", that means that of all the bad edits that exist, 30% are found by the likelybad filter.
 * Precision/recall at threshold: When we say "the precision at threshold 0.687 is 60%", that means that we expect 60% of the edits with  to be true positives (and 40% to be false positives). When we say "the recall at threshold 0.123 is 80%", that means we expect 80% of all edits that are truly damaging/goodfaith to have scores above 0.123.

More about precision and recall
For more about the definitions of precision and recall, see the Wikipedia article on the subject.

Precision and recall are a trade-off: at lower thresholds (wider score ranges), recall will be high but precision will be low, and as the threshold is increased (score range is narrowed), precision will increase but recall will decrease. At the extremes, at threshold 0 (, so all scores) recall will by definition be 100% but precision will be low, and at threshold 1 (only edits with the highest possible score) precision will generally be 100% but recall will be low. The increase in precision and decrease in recall aren't monotonic: it's possible for a small increase in the threshold to cause a decrease in precision. The ORES UI tool lets you graph the precision/recall curve for a model (with precision and recall on the Y axis, and threshold on the X axis).

When we say that precision or recall are "low", that's relative. In the  model for instance only a small portion (say 4%) are damaging, and the overwhelming majority are non-damaging. That means that for, the precision at threshold 0 is 4%, but for   the precision at threshold 0 is 96%(!). This means that increasing the threshold increases the precision some, but not much (because it doesn't have much room to grow), and it's why the  filters tend to have such high precision stats (99.5% is standard, and 99.7% or 99.8% are sometimes used).