Table 1 Metrics Reloaded addresses common and rare pitfalls in metric selection, as compiled in ref. ¹⁹

From: Metrics reloaded: recommendations for image analysis validation

Source of pitfall	Addressed in Metrics Reloaded by
Inadequate choice of the problem category
Wrong choice of problem category	Problem category mapping (subprocess S1; Extended Data Fig. 1) as a prerequisite for metric selection.
Disregard of the domain interest
Importance of structure boundaries	FP2.1 - Particular importance of structure boundaries; recommendation to complement common overlap-based segmentation metrics with boundary-based metrics (Fig. 2 and Supplementary Note 2.3) if the property holds.
Importance of structure volume	FP2.2 - Particular importance of structure volume; recommendation to complement common overlap-based and boundary-based segmentation metrics with volume-based metrics (Supplementary Note 2.3) if the property holds.
Importance of structure center(line)	FP2.3 - Particular importance of structure center(line); recommendation of the clDice as alternative to the common DSC or IoU in segmentation problems (subprocess S6; Extended Data Fig. 6) and recommendation of center point-based localization criterion in ObD (subprocess S8; Extended Data Fig. 8) if the property holds.
Importance of confidence awareness	FP2.7.1 - Calibration assessment requested; dedicated recommendations on calibration (Supplementary Note 2.6).
Importance of comparability across datasets	FP4.2 - Provided class prevalences reflect the population of interest; used in the subprocesses S2–S4 (Extended Data Figs. 2–4); general focus on prevalence dependency of metrics in the framework.
Unequal severity of class confusions	FP2.5 - Penalization of errors; recommendation of the so-far uncommon metric EC as a classification metric (subprocess S2; Extended Data Fig. 2); setting β in the F_β score according to preference for FP (oversegmentation) and FN (undersegmentation; see DG3.3 in Supplementary Note 2.7.2).
Importance of cost–benefit analysis	FP2.6 - Decision rule applied to predicted class scores: incorporation of a decision rule that is based on cost–benefit analysis; recommendation of the so-far uncommon metrics NB (Fig. SN 3.11) and EC (Fig. SN 3.6).
Disregard of target structure properties
Small structure sizes	FP3.1 - Small size of structures relative to pixel size; recommendation to consider the problem an ObD problem (Supplementary Note 2.4); complementation of overlap-based segmentation metrics with boundary-based metrics in the case of small structures with noisy reference (subprocess S6; Extended Data Fig. 6); recommendation of lower ObD localization threshold in case of small sizes (see DG8.3 in Supplementary Note 2.7.7).
High variability of structure sizes	FP3.2 - High variability of structure sizes; recommendation of lower ObD localization threshold (see DG8.3 in Supplementary Note 2.7.7) and size stratification (Supplementary Note 2.4) in case of size variability.
Complex structure shapes	FP3.3 - Target structures feature tubular shape; recommendation of the clDice as alternative to the common DSC in segmentation problems (subprocess S6; Extended Data Fig. 6) and recommendation of point inside mask/box/approx as localization criterion in ObD if the property holds (subprocess S8; Extended Data Fig. 8).
Occurrence of overlapping or touching structures	FP3.5 - Possibility of overlapping or touching target structures; explicit recommendation to phrase problem as InS rather than SemS problem (Supplementary Note 2.3); recommendation of higher ObD localization threshold in case of small sizes (see DG8.3 in Supplementary Note 2.7.7).
Occurrence of disconnected structures	FP3.6 - Possibility of disconnected target structure(s); recommendation of appropriate localization criterion for ObD (DG8.2 in Supplementary Note 2.7.7).
Disregard of dataset properties
High class imbalance	FP4.1 - High class imbalance and FP2.5.5 - compensation for class imbalances requested; compensation of class imbalance via prevalence-independent metrics such as EC and BA.
Small test set size	Recommendation of confidence intervals for all metrics.
Imperfect reference standard: noisy reference standard	FP4.3.1 - High inter-rater variability and FP2.5.7 - compensation for annotation imprecisions requested; default recommendation of the so-far rather uncommon metric NSD to assess the quality of boundaries.
Imperfect reference standard: spatial outliers in reference	FP4.3.2 - Possibility of spatial outliers in reference annotation and FP2.5.6 - handling of spatial outliers; recommendation of outlier-robust metrics, such as NSD in case no distance-based penalization of outliers is requested in segmentation problems.
Occurrence of cases with an empty reference	FP4.6 - Possibility of reference without target structure(s); recommendations for aggregation in the case of empty references according to Supplementary Note 2.4 and Extended Data Table 1.
Disregard of algorithm output properties
Possibility of empty prediction	FP5.2 - Possibility of algorithm output not containing the target structure(s); selection of appropriate aggregation strategy in ObD (Supplementary Note 2.4).
Possibility of overlapping predictions	FP5.4 - Possibility of overlapping predictions; recommendation of an assignment strategy based on IoU > 0.5 if overlapping predictions are not possible and no predicted class scores are available.
Lack of predicted class scores	FP5.1 - Availability of predicted class scores; leveraging class scores for optimizing decision regions (FP2.6) and assessing calibration quality (FP2.7).

The first column lists all pitfall sources captured by the published taxonomy that relate to either the inadequate choice of the problem category or poor metric selection. The second column summarizes how Metrics Reloaded addresses these pitfalls. The notation FPX.Y refers to a fingerprint item (Supplementary Note 1.3).

Back to article page

Table 1 Metrics Reloaded addresses common and rare pitfalls in metric selection, as compiled in ref. ¹⁹

Search

Quick links