Table 1 Metrics Reloaded addresses common and rare pitfalls in metric selection, as compiled in ref. 19
From: Metrics reloaded: recommendations for image analysis validation
Source of pitfall | Addressed in Metrics Reloaded by |
---|---|
Inadequate choice of the problem category | |
Wrong choice of problem category | Problem category mapping (subprocess S1; Extended Data Fig. 1) as a prerequisite for metric selection. |
Disregard of the domain interest | |
Importance of structure boundaries | FP2.1 - Particular importance of structure boundaries; recommendation to complement common overlap-based segmentation metrics with boundary-based metrics (Fig. 2 and Supplementary Note 2.3) if the property holds. |
Importance of structure volume | FP2.2 - Particular importance of structure volume; recommendation to complement common overlap-based and boundary-based segmentation metrics with volume-based metrics (Supplementary Note 2.3) if the property holds. |
Importance of structure center(line) | FP2.3 - Particular importance of structure center(line); recommendation of the clDice as alternative to the common DSC or IoU in segmentation problems (subprocess S6; Extended Data Fig. 6) and recommendation of center point-based localization criterion in ObD (subprocess S8; Extended Data Fig. 8) if the property holds. |
Importance of confidence awareness | FP2.7.1 - Calibration assessment requested; dedicated recommendations on calibration (Supplementary Note 2.6). |
Importance of comparability across datasets | FP4.2 - Provided class prevalences reflect the population of interest; used in the subprocesses S2–S4 (Extended Data Figs. 2–4); general focus on prevalence dependency of metrics in the framework. |
Unequal severity of class confusions | FP2.5 - Penalization of errors; recommendation of the so-far uncommon metric EC as a classification metric (subprocess S2; Extended Data Fig. 2); setting β in the Fβ score according to preference for FP (oversegmentation) and FN (undersegmentation; see DG3.3 in Supplementary Note 2.7.2). |
Importance of cost–benefit analysis | FP2.6 - Decision rule applied to predicted class scores: incorporation of a decision rule that is based on cost–benefit analysis; recommendation of the so-far uncommon metrics NB (Fig. SN 3.11) and EC (Fig. SN 3.6). |
Disregard of target structure properties | |
Small structure sizes | FP3.1 - Small size of structures relative to pixel size; recommendation to consider the problem an ObD problem (Supplementary Note 2.4); complementation of overlap-based segmentation metrics with boundary-based metrics in the case of small structures with noisy reference (subprocess S6; Extended Data Fig. 6); recommendation of lower ObD localization threshold in case of small sizes (see DG8.3 in Supplementary Note 2.7.7). |
High variability of structure sizes | FP3.2 - High variability of structure sizes; recommendation of lower ObD localization threshold (see DG8.3 in Supplementary Note 2.7.7) and size stratification (Supplementary Note 2.4) in case of size variability. |
Complex structure shapes | FP3.3 - Target structures feature tubular shape; recommendation of the clDice as alternative to the common DSC in segmentation problems (subprocess S6; Extended Data Fig. 6) and recommendation of point inside mask/box/approx as localization criterion in ObD if the property holds (subprocess S8; Extended Data Fig. 8). |
Occurrence of overlapping or touching structures | FP3.5 - Possibility of overlapping or touching target structures; explicit recommendation to phrase problem as InS rather than SemS problem (Supplementary Note 2.3); recommendation of higher ObD localization threshold in case of small sizes (see DG8.3 in Supplementary Note 2.7.7). |
Occurrence of disconnected structures | FP3.6 - Possibility of disconnected target structure(s); recommendation of appropriate localization criterion for ObD (DG8.2 in Supplementary Note 2.7.7). |
Disregard of dataset properties | |
High class imbalance | FP4.1 - High class imbalance and FP2.5.5 - compensation for class imbalances requested; compensation of class imbalance via prevalence-independent metrics such as EC and BA. |
Small test set size | Recommendation of confidence intervals for all metrics. |
Imperfect reference standard: noisy reference standard | FP4.3.1 - High inter-rater variability and FP2.5.7 - compensation for annotation imprecisions requested; default recommendation of the so-far rather uncommon metric NSD to assess the quality of boundaries. |
Imperfect reference standard: spatial outliers in reference | FP4.3.2 - Possibility of spatial outliers in reference annotation and FP2.5.6 - handling of spatial outliers; recommendation of outlier-robust metrics, such as NSD in case no distance-based penalization of outliers is requested in segmentation problems. |
Occurrence of cases with an empty reference | FP4.6 - Possibility of reference without target structure(s); recommendations for aggregation in the case of empty references according to Supplementary Note 2.4 and Extended Data Table 1. |
Disregard of algorithm output properties | |
Possibility of empty prediction | FP5.2 - Possibility of algorithm output not containing the target structure(s); selection of appropriate aggregation strategy in ObD (Supplementary Note 2.4). |
Possibility of overlapping predictions | FP5.4 - Possibility of overlapping predictions; recommendation of an assignment strategy based on IoU > 0.5 if overlapping predictions are not possible and no predicted class scores are available. |
Lack of predicted class scores | FP5.1 - Availability of predicted class scores; leveraging class scores for optimizing decision regions (FP2.6) and assessing calibration quality (FP2.7). |