Fig. 6: P3: pitfalls related to poor metric application.
From: Understanding metric-related pitfalls in image analysis validation

a, Non-standardized metric implementation. In the case of the AP metric and the construction of the PR curve, the strategy of how identical scores (here, a confidence (Conf.) score of 0.80 is present twice) are treated has a substantial impact on the metric scores. Microsoft COCO11 and CityScapes23 are used as examples. b, Non-independence of test cases. The number of images taken from Patient 1 is much higher compared than the numbers taken from Patients 2–5. Averaging over all DSC values (∅) results in a high averaged score. Aggregating metric values per patient reveals much higher scores for Patient 1 than for the others, which would have been hidden by simple aggregation. c, Uninformative visualization. A single box plot (left) does not give sufficient information about the raw metric value distribution. Adding the raw metric values as jittered dots on top (right) adds important information (here, on clusters). In the case of non-independent validation data, color- or shape-coding helps reveal data clusters.