Figure 4

Interpretation of binary classification performance on hold-out test set. (a) Mean ROC curve (blue) and variance (yellow) of the 10 models from condition 4 (see Table 2) evaluated on the test set. For multiple thresholds, the true positive rate (TPR, also known as recall) is studied against the false positive rate (FPR). Dotted red line indicates a no-skill classifier. Mean AUC under ROC is reported. (b) Mean precision-recall curve (blue) and variance (yellow) of the 10 models from conditions 4 (see Table 2) evaluated on the test set. For multiple thresholds, the precision is studied against the recall (TPR). Dotted red line indicates a no-skill classifier. Mean average precision (AP) score is reported. The graphs explain performance with different metrics, and indicate no detriment in test performance due to imbalanced datasets.