Fig. 3

Model performance for disease classification in the external validation cohort. Performance metrics (Accuracy, F1 score, and AUROC) of models using seven encoders (ResNet50, UNI, UNI2-h, Prov-Gigapath, Phikon, Virchow, and Virchow2) and four aggregation methods (max pooling, ABMIL, TransMIL, and CLAM) in the external validation cohort are presented. Data are reported as means and 95% confidence intervals. AUROC area under the receiver operating characteristic curve, ABMIL attention-based multiple instance learning, TransMIL transformer-based multiple instance learning, CLAM clustering-constraint attention multiple instance learning.