## Metrics for ORES usage

2

Request for more ORES metrics.

Do you have metrics for ORES usage, and specifically:

1. How many users or how often users enable RC filters of ORES compared to other RC Filters? (project level) [wondering does ORES play well in some projects and doesn't predict well in others?]
2. How many edits could be driven by ORES, how many skipped? (e.g for edits with high score for damaging ${\displaystyle |DAMAGING\cap ROLLBACK|}$ and ${\displaystyle |DAMAGING\cap \neg {ROLLBACK}|}$ )
3. Wouldn't it be better to evaluate ORES bad predictions based actions (such as high damaging not reverted) compared to more strict feedback. While somewhat more dirty to use it, it is and will always be much richer DB compared to judgments

Hey!

1. We don't develop RC filters. That's the Collaboration team folks. They probably do have stats. Ping Roan Kattouw.
2. For this, I would use our test statistics. E.g. https://ores.wikimedia.org/v2/scores/enwiki/damaging?model_info=test_stats shows that you could auto-revert 8% of damaging edits if you're OK with 99% precision.
3. In some ways yes, but in a lot of ways, no. The biggest problem with doing that is that we don't know how often ORES was right to *not* flag something since people won't be directly reviewing those things. The second problem is that revert != damage != vandalism, so the action is often ambiguous. The third problem is that ORES affects your judgement. Just by having something flagged as "likely to be bad", your view of it changes. This is well documented in the research lit. So I think it is best to evaluate ORES by specific judgements (is this edit damaging? Is it intentionally damaging?) over a random sample of edits. That's what the test statistics linked above report.
Reply to "Metrics for ORES usage"