Fig. 5: P2.3: disregard of the properties of the dataset.
From: Understanding metric-related pitfalls in image analysis validation

a, High class imbalance. In the case of underrepresented classes, common metrics may yield misleading values. In the given example, accuracy and BA have a high score despite the large amount of FP samples. The class imbalance is uncovered only by metrics considering predictive values (here, MCC). This pitfall is also relevant for other counting and multi-threshold metrics, such as AUROC, EC (depending on the chosen costs), LR+, NB, sensitivity, specificity and weighted Cohen’s kappa. b, Small test set size. The values of the expected calibration error (ECE) depend on the sample size. Even for a simulated perfectly calibrated model, the ECE will be substantially greater than zero for small sample sizes21. c, Imperfect reference standard. A single erroneously annotated pixel can lead to a large decrease in performance, especially in the case of the Hausdorff distance (HD) when applied to small structures. The HD 95th percentile (HD95), however, was designed to deal with spatial outliers. This pitfall is also relevant for localization criteria such as box/approx IoU and point inside box/approx.