Fig. 4: Dermatologist-level CNN models are not robust to image transformations. | npj Digital Medicine

Fig. 4: Dermatologist-level CNN models are not robust to image transformations.

From: Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Fig. 4

a Representative example of Model A predictions on a melanoma image from MClass-D. The predicted probability of melanoma is shown below the predicted class. The decision threshold is model confidence >31.0%, as determined by the operating point at which the model has a sensitivity comparable to dermatologists for the MClass-D. b The predicted melanoma probability across rotations is plotted for each example for each test set. Non-robust images whose predictions cross the decision threshold, selected to match dermatologists’ sensitivity, are plotted in red, robust predictions plotted in gray. The false negative example shown in (a) is highlighted by the blue arrow. c The percentage of non-robust lesions, assessed over rotations, horizontal flip, brightness, and contrast transformations, is shown as a percentage of the whole dataset for each test dataset. “Consistently correct” and “Consistently wrong” refer to lesions whose prediction is consistent across all transformations. “Correct original, wrong transformed” refers to lesions whose prediction was correct on the original image but was wrong on one or more transformed images. “Wrong original, correct transformed” refers to lesions whose prediction was wrong on the original image but was correct on one or more transformed images. Abbreviations: D Dermoscopic, MClass Melanoma Classification Benchmark, ND Non-dermoscopic, PH2 Hospital Pedro Hispano, UCSF University of California, San Francisco, VAMC-C Veterans Affairs Medical Center clinic, VAMC-T Veterans Affairs Medical Center teledermatology.

Back to article page