Fig. 4

AUROC (Sensitivity vs. False Positive rate) curves for the three best performing methods, compared with the baseline (model outputs without post-processing), on two external evaluation datasets. We omit methods 2 and 4 due to their lower performance and to enhance readability.