ORES/Model info/Statistics

From mediawiki.org

This page contains an overview of the model fitness statistics that ORES presents with classifier models.

Custom documentation of metrics[edit]

Example scenario[edit]

Let’s assume a total of 100 edits, of which 35 are damaging – an unrealistically high ratio of damaging edits, but useful for illustration purposes – leaving us with the following labels (or actual values): 35 positives and 65 negatives as visualized by Figure 1.1 where each edit is represented by one editor.


Figure 1.1: Total of 100 edits, represented by 100 editors, divided into actual positives in green and actual negatives in red.A binary classifier might now predict 40 positives, of which 30 actually are positive and 60 negatives of which 55 actually are negative.

This also means that 10 non damaging edits have been predicted to be damaging and 5 damaging edits have been predicted not to be damaging. Figure 1.2 illustrates this state by marking predicted positives with a hazardous symbol and predicted negatives with a sun symbol. Referring to the confusion matrix, we have

  • 30 true positives(correctly predicted damaging edits)
  • 5 false negatives(wrongly predicted damaging edits)
  • 55 true negatives(correctly predicted non damaging edits)

Figure 1.2: Edits divided into TP, FN, TN and FP.

  • 10 false positives(wrongly predicted non damaging edits)We will get back to this example scenario in the definitions of metrics.2

Confusion Matrix[edit]

As we are faced with the binary damaging classifier, there are four different classification cases:

  1. Correctly classifying an edit as damaging – a true positive
  2. Wrongly classifying an edit as damaging – a false positive
  3. Correctly classifying an edit as good – a true negative
  4. Wrongly classifying an edit as good – a false negative

Popular representations of such cases are confusion matrices as the one in figure 1.3. Throughout this documentation, the abbreviations of TP, FP, TN and FN will be used to denote the four mentioned cases

Figure 1.3: Confusion matrix of a binary classifier. Predicted positives in red, predicted negatives in blue, consistent with PreCall’s design.

Metrics Overview[edit]

By performing optimization queries, we can tell ORES that we want a specific metric to be greater equal or lower equal to a specified value while maximizing or minimizing a second one. The following table comes with a quick definition and a value, if possible, based on the confusion matrix, for each metric:

Metric Quick Definition Value

recall
Ability to find all relevant cases TP/TP+FN
precision
Ability to find only relevant cases TP/TP+FP
f1
Harmonic mean of recall and 2·precision·recall/precision+recall
fpr
Probability of a false alarm FP/FP+TN
rocauc
Measure of classification performance
prauc
Measure of classification performance
accuracy
Portion of correctly predicted data TP+TN/Total
match_rate
Portion of observations predicted to be positive TP+FP/Total
filter_rate
Portion of observations predicted to be negative 1−match_rate=TN+FN/Total
!recall
Negated recall TN/TN+FP
!precision
Negated precision TN/TN+FN
!f1
Negated f1 2·!precision·!recall/!precision+!recall

Detailed definition of metrics[edit]

recall[edit]

Recall (TP/TP+FN), true positive rate (tpr) or “sensitivity” of a model is the ability of that model to find all relevant cases within the dataset. To us, relevant case means damaging edit: The ability of the model to identify those depends on the ratio of actual positives being predicted as such with. In terms of numbers for our example that would be 30/30+5≈0.86.

precision[edit]

Precision (TPTP+FP) or “specificity” of a model is the ability of the model to find only relevant cases within the dataset. We are interested in how good the model is at only predicting edits to be damaging that actually are. Therefore, we want the ratio of true positives to all those predicted to be positive: 30/30 + 10= 0.75

f1[edit]

f1-score, the harmonic mean of recall and precision, a metric from 0(worst) to 1 (best), serves as an accuracy evaluation metric. It is defined by 2·precision·recall/precision+recall. Note that unlike the average of recall and precision, the harmonic mean punishes extreme values. Referring to the example scenario, we get 2·0.75·3035/0.75 +3035= 0.8

fpr[edit]

The false positive rate (FP/FP+TN) answers the question of ‘what is the portion, of all actual negatives, that are wrongly predicted?’ and can be described as the probability of a false alarm. In our example, a false alarm would be predicting an edit as damaging that isn’t. As a result we get 10/10 + 55≈0.155

rocauc[edit]

The area under the ROC-curve, a measure between 0.5 (worthless) and 1.0 (perfect: getting no FPs), can be described as the probability of ranking a random positive higher than a random negative and serves as a measure of classification performance. The receiver operating characteristic (ROC) curve itself is used to visualize the performance of a classifier, plotting the TPR versus FPR as a function of the model’s threshold for classifying a positive. Assuming that we have had a threshold of 0.5 to get the previous results, one point on our ROC curve would be:(fpr,tpr) = (0.15,0.86). Doing this for every threshold wanted results in the ROC curve. The area under the curve (auc) is a way of quantifying its performance.

prauc[edit]

Similarly to the rocauc, the area under the precision recall curve evaluates a classifiers performance. The main difference, however, is that the PR curve plots precision versus recall and does not make use of true negatives. It is therefore favorable to use prauc over rocauc if true negatives are unimportant to the general problem or if there area lot more negatives than positives, since differences between models will be more notable in the absence of a vast amount of true negatives in that second case. The point on the PR curve of our example for the standard threshold of 0.5 is (precision,recall) = (0.75,0.86) To construct the PR-curve, it would be necessary to do this for every threshold wanted. Again, calculating the area under it is a way to quantify the curve’s performance and therefore the model’s performance as well.

accuracy[edit]

Accuracy (TP+TN/Total) measures the ratio of correctly predicted data: positives and negatives. In the example, this is the proportion of correctly predicted damaging edits and correctly predicted non damaging edits to the total of edits and is given by 30 + 55/35 + 65=0.85

match_rate[edit]

The match rate (TP+FP/Total) is the ratio of observations predicted to be positive. Concerning our damaging classifier, this is equal to wanting to know the ratio of edits predicted to be damaging, which is given by 30 + 10/35 + 65=0.4

filter_rate[edit]

The filter rate (1−match_rate=TN+FN/Total) is the ratio of observations predicted to be negative. This is the complement to the match rate. In the example, the filter rate describes the ratio of edits predicted to be damaging, given by (1−match_rate=55 + 5/35 + 65= 0.6).

!<metric>[edit]

Any metric with an exclamation mark is the same metric for the negative class:–!recall =TN/TN+FP, the ability of a model to predict all negatives as such–!precision =TN/TN+FN, the ability of a model to only predict negatives as such–!f1 = 2·!recall·!precision/!recall+!precision, the harmonic mean of !recall and !precision.

Note that these metrics are also particularly useful for multi-class classifier as they permit queries to reference all but one class, e.g. in the ORES itemquality model, the recall for all classes except the “E” class comes down to the !recall of the “E” class.