Table 1 Metrics Reloaded addresses common and rare pitfalls in metric selection, as compiled in ref. 19

From: Metrics reloaded: recommendations for image analysis validation

Source of pitfall

Addressed in Metrics Reloaded by

Inadequate choice of the problem category

 Wrong choice of problem category

Problem category mapping (subprocess S1; Extended Data Fig. 1) as a prerequisite for metric selection.

Disregard of the domain interest

 Importance of structure boundaries

FP2.1 - Particular importance of structure boundaries; recommendation to complement common overlap-based segmentation metrics with boundary-based metrics (Fig. 2 and Supplementary Note 2.3) if the property holds.

 Importance of structure volume

FP2.2 - Particular importance of structure volume; recommendation to complement common overlap-based and boundary-based segmentation metrics with volume-based metrics (Supplementary Note 2.3) if the property holds.

 Importance of structure center(line)

FP2.3 - Particular importance of structure center(line); recommendation of the clDice as alternative to the common DSC or IoU in segmentation problems (subprocess S6; Extended Data Fig. 6) and recommendation of center point-based localization criterion in ObD (subprocess S8; Extended Data Fig. 8) if the property holds.

 Importance of confidence awareness

FP2.7.1 - Calibration assessment requested; dedicated recommendations on calibration (Supplementary Note 2.6).

 Importance of comparability across datasets

FP4.2 - Provided class prevalences reflect the population of interest; used in the subprocesses S2–S4 (Extended Data Figs. 24); general focus on prevalence dependency of metrics in the framework.

 Unequal severity of class confusions

FP2.5 - Penalization of errors; recommendation of the so-far uncommon metric EC as a classification metric (subprocess S2; Extended Data Fig. 2); setting β in the Fβ score according to preference for FP (oversegmentation) and FN (undersegmentation; see DG3.3 in Supplementary Note 2.7.2).

 Importance of cost–benefit analysis

FP2.6 - Decision rule applied to predicted class scores: incorporation of a decision rule that is based on cost–benefit analysis; recommendation of the so-far uncommon metrics NB (Fig. SN 3.11) and EC (Fig. SN 3.6).

Disregard of target structure properties

 Small structure sizes

FP3.1 - Small size of structures relative to pixel size; recommendation to consider the problem an ObD problem (Supplementary Note 2.4); complementation of overlap-based segmentation metrics with boundary-based metrics in the case of small structures with noisy reference (subprocess S6; Extended Data Fig. 6); recommendation of lower ObD localization threshold in case of small sizes (see DG8.3 in Supplementary Note 2.7.7).

 High variability of structure sizes

FP3.2 - High variability of structure sizes; recommendation of lower ObD localization threshold (see DG8.3 in Supplementary Note 2.7.7) and size stratification (Supplementary Note 2.4) in case of size variability.

 Complex structure shapes

FP3.3 - Target structures feature tubular shape; recommendation of the clDice as alternative to the common DSC in segmentation problems (subprocess S6; Extended Data Fig. 6) and recommendation of point inside mask/box/approx as localization criterion in ObD if the property holds (subprocess S8; Extended Data Fig. 8).

 Occurrence of overlapping or touching structures

FP3.5 - Possibility of overlapping or touching target structures; explicit recommendation to phrase problem as InS rather than SemS problem (Supplementary Note 2.3); recommendation of higher ObD localization threshold in case of small sizes (see DG8.3 in Supplementary Note 2.7.7).

 Occurrence of disconnected structures

FP3.6 - Possibility of disconnected target structure(s); recommendation of appropriate localization criterion for ObD (DG8.2 in Supplementary Note 2.7.7).

Disregard of dataset properties

 High class imbalance

FP4.1 - High class imbalance and FP2.5.5 - compensation for class imbalances requested; compensation of class imbalance via prevalence-independent metrics such as EC and BA.

 Small test set size

Recommendation of confidence intervals for all metrics.

 Imperfect reference standard: noisy reference standard

FP4.3.1 - High inter-rater variability and FP2.5.7 - compensation for annotation imprecisions requested; default recommendation of the so-far rather uncommon metric NSD to assess the quality of boundaries.

 Imperfect reference standard: spatial outliers in reference

FP4.3.2 - Possibility of spatial outliers in reference annotation and FP2.5.6 - handling of spatial outliers; recommendation of outlier-robust metrics, such as NSD in case no distance-based penalization of outliers is requested in segmentation problems.

 Occurrence of cases with an empty reference

FP4.6 - Possibility of reference without target structure(s); recommendations for aggregation in the case of empty references according to Supplementary Note 2.4 and Extended Data Table 1.

Disregard of algorithm output properties

 Possibility of empty prediction

FP5.2 - Possibility of algorithm output not containing the target structure(s); selection of appropriate aggregation strategy in ObD (Supplementary Note 2.4).

 Possibility of overlapping predictions

FP5.4 - Possibility of overlapping predictions; recommendation of an assignment strategy based on IoU > 0.5 if overlapping predictions are not possible and no predicted class scores are available.

 Lack of predicted class scores

FP5.1 - Availability of predicted class scores; leveraging class scores for optimizing decision regions (FP2.6) and assessing calibration quality (FP2.7).

  1. The first column lists all pitfall sources captured by the published taxonomy that relate to either the inadequate choice of the problem category or poor metric selection. The second column summarizes how Metrics Reloaded addresses these pitfalls. The notation FPX.Y refers to a fingerprint item (Supplementary Note 1.3).