Extended Data Fig. 4: Performance Comparison of Model Ensembles and Single-Model Baselines Using DeLong’s Test.
From: Benchmarking foundation models as feature extractors for weakly supervised computational pathology

A, AUROC scores for each model and ensemble approach are shown, averaging predictions across five folds for individual models and five or ten folds for ensembles. Two ensembling approaches were used: taking the average prediction scores of downstream models trained on different foundation model backbones (prefix Avg) and concatenating feature vectors from different backbones to create a single downstream model (prefix Concat). The “Lauren” task was excluded as it’s not a binary classification. B-C, P-values from two-sided DeLong’s tests comparing CONCH (B) or Virchow2 (C) with other models and ensembles. No correction for multiple testing was applied; alpha was set to 0.05.