Fig. 5: AUROC performance of automated-labeling model at two different pSim threshold values, compared to sensitivity, specificity of individual expert radiologists, and pooled public labels from three open-source CXR datasets.

AUROC performance of our xAI CXR auto-labeling model applied to the CheXpert, MIMIC, and NIH open-source datasets, is shown for each of the five labeled clinical output labels: a cardiomegaly, b pleural effusion, c pulmonary edema, d pneumonia, and e atelectasis. Comparison is to the performance of the individual expert radiologists (A–G, red circles), as well as to the performance of the pooled external annotations (blue squares, n = number available labeled external cases per clinical output label). ROC curves (y-axis sensitivity, x-axis 1-specificity) are shown for both the baseline pSim = 0 threshold (magnified box) and the optimal pSim threshold (i.e., the lowest pSim threshold achieving 100% accuracy, as per Figs. 2–4c and d).