ORES/Model info/Statistics

This page contains an overview of the model fitness statistics that ORES presents with classifier models. 1 Custom documentation of metrics1.1  Example scenarioLet’s assume a total of 100 edits, of which 35 are damaging – an unrealisticallyhigh ratio of damaging edits, but useful for illustration purposes – leaving uswith the following labels (or actual values):  35positives and 65negatives asvisualized by Figure 1.1 where each edit is represented by one editor.Figure 1.1: Total of 100 edits, represented by 100 editors, divided into actualpositives in green and actual negatives in red.A  binary  classifier  might  now  predict  40  positives,  of  which  30actuallyarepositive and 60 negatives of which 55actually arenegative. This alsomeans that  10  non  damaging  edits  have  been  predicted  to  be  damagingand 5 damaging edits have been predicted not to be damaging. Figure 1.2illustrates this state by marking predicted positives with a hazardous symboland predicted negatives with a sun symbol.Referring to the confusion matrix, we have•30 true positives(correctly predicted damaging edits)•5 false negatives(wrongly predicted damaging edits)•55 true negatives(correctly predicted non damaging edits)1 1 Custom documentation of metricsFigure 1.2: Edits divided into TP, FN, TN and FP.•10 false positives(wrongly predicted non damaging edits)We will get back to this example scenario in the definitions of metrics.2 1.2 Confusion Matrix1.2  Confusion MatrixAs we are faced with the binary damaging classifier, there are four differentclassification cases:1. Correctly classifying an edit as damaging – atrue positive2. Wrongly classifying an edit as damaging – afalse positive3. Correctly classifying an edit as good – atrue negative4. Wrongly classifying an edit as good – afalse negativePopular representations of such cases are confusion matrices as the one infigure 1.3. Throughout this documentation, the abbreviations of TP, FP, TNand FN will be used to denote the four mentioned cases.Figure 1.3: Confusion matrix of a binary classifier. Predicted positives in red,predicted negatives in blue, consistent with PreCall’s design.3 1 Custom documentation of metrics1.3  Metrics OverviewBy performing optimization queries, we can tell ORES that we want a spe-cific  metric  to  be  greater  equal  or  lower  equal  to  a  specified  value  whilemaximizing or minimizing a second one. The following table comes with aquick definition and a value, if possible, based on the confusion matrix, foreach metric:MetricQuick DefinitionValuerecallAbility to find all relevant casesTPTP+FNprecisionAbility to find only relevant casesTPTP+FPf1Harmonic mean of recall and precision2·rec·precrec+precfprProbability of a false alarmFPFP+TNrocaucMeasure of classification performancepraucMeasure of classification performanceaccuracyPortion of correctly predicted dataTP+TNTotalmatchratePortion of observations predictedTP+FPTotalto be positivefilterratePortion of observations predicted1−match_rate=to be negativeTN+FNTotal!recallNegated recallTNTN+FP!precisionNegated precisionTNTN+FN!f1Negated f12·!rec·!prec!rec+!prec1.4 Detailed definition of metricsrecall•Recall (TPTP+FN), true positive rate (tpr) or “sensitivity” of a model is theability of that model to findall relevant cases within the dataset.•To  us,  relevant  case  means  damaging  edit:. The ability  of  themodel to identify those depends on the ratio of actual positives beingpredicted as such:4 1.4 Detailed definition of metricswith=+•In terms of numbers for our example that would be3030+5≈0.86.precision•Precision (TPTP+FP) or “specificity” of a model is the ability of the modelto findonly relevant cases within the dataset.•We are interested in how good the model is atonly predicting edits tobe damaging that actually are. Therefore, we want the ratio of truepositives to all those predicted to be positive:+=3030 + 10= 0.75f1•f1-score,  the harmonic mean of recall and precision,  a metric from 0(worst) to 1 (best), serves as an accuracy evaluation metric•It is defined by2·precision·recallprecision+recallNote that unlike the average of recall and precision, the harmonic meanpunishes extreme values.•Referring to the example scenario, we get2·0.75·30350.75 +3035= 0.8fpr•The  false  positive  rate  (FPFP+TN)  answers  the  question  of  ‘what  is  theportion, of all actual negatives, that are wrongly predicted?’  and canbe described as the probability of a false alarm.•In our example, a false alarm would be predicting an edit as damagingthat isn’t. As a result we get+=1010 + 55≈0.155 1 Custom documentation of metricsrocauc•The area under the ROC-curve, a measure between 0.5 (worthless) and1.0 (perfect:  getting no FPs),  can be described as the probability ofranking a random positive higher than a random negative and servesas a measure of classification performance.•The receiver operating characteristic (ROC) curve itself is used to vi-sualize the performance of a classifier, plotting the TPR versus FPRas a function of the model’s threshold for classifying a positive.•Assuming  that  we  have  had  a  threshold  of  0.5  to  get  the  previousresults, one point on our ROC curve would be:(fpr,tpr) = (0.15,0.86).Doing this for every threshold wanted results in the ROC curve. Thearea under the curve (auc) is a way of quantifying its performance.prauc•Similarly to  therocauc,  the  area  under  the  precision  recall  curveevaluates  a  classifiers  performance. The main  difference,  however,  isthat the PR curve plots precision versus recall and does not make useof true negatives. It is therefore favorable to usepraucoverrocaucif true negatives are unimportant to the general problem or if there area lot more negatives than positives, since differences between modelswill be more notable in the absence of a vast amount of true negativesin that second case.•The point on the PR curve of our example for the standard thresholdof 0.5 is(precision,recall) = (0.75,0.86)To construct the PR-curve, it would be necessary to do this for everythreshold wanted. Again, calculating the area under it is a way to quan-tify the curve’s performance and therefore the model’s performance aswell.accuracy•Accuracy (TP+TNTotal) measures the ratio of correctly predicted data: posi-tives and negatives.6 1.4 Detailed definition of metrics•In the example, this is the proportion of correctly predicted damagingedits and correctly predicted non damaging edits to the total of editsand is given by=++=30 + 5535 + 65= 0.85matchrate•The  match  rate  (TP+FPTotal)  is  the  ratio  of  observations  predicted  to  bepositive.•Concerning our damaging classifier, this is equal to wanting to knowthe ratio of edits predicted to be damaging, which is given by++=30 + 1035 + 65= 0.4filterrate•The filter rate (1−match_rate=TN+FNTotal) is the ratio of observationspredicted to be negative. This is the complement to the match rate.•In the example, the filter rate describes the ratio of edits predicted tobe damaging, given by1−match_rate=++=55 + 535 + 65= 0.6! •Any with an exclamation mark is the same metric for thenegative class:–!recall =TNTN+FP, the ability of a model to predict allnegatives assuch–!precision =TNTN+FN, the ability of a model to onlypredict negativesas such–!f1 = 2·!rec∗!prec!rec+!prec, the harmonic mean of !recall and !precision7 1 Custom documentation of metrics•Note that these metrics are also particularly useful for multi-class clas-sifier as they permit queries to reference all but one class, e.g.  in theORESitemqualitymodel, the recall for all classes except the “E” classcomes down to the !recall of the “E” class.