Extended Data Fig. 4: Performance Comparison of Model Ensembles and Single-Model Baselines Using DeLong’s Test. | Nature Biomedical Engineering

Extended Data Fig. 4: Performance Comparison of Model Ensembles and Single-Model Baselines Using DeLong’s Test.

From: Benchmarking foundation models as feature extractors for weakly supervised computational pathology

Extended Data Fig. 4

A, AUROC scores for each model and ensemble approach are shown, averaging predictions across five folds for individual models and five or ten folds for ensembles. Two ensembling approaches were used: taking the average prediction scores of downstream models trained on different foundation model backbones (prefix Avg) and concatenating feature vectors from different backbones to create a single downstream model (prefix Concat). The “Lauren” task was excluded as it’s not a binary classification. B-C, P-values from two-sided DeLong’s tests comparing CONCH (B) or Virchow2 (C) with other models and ensembles. No correction for multiple testing was applied; alpha was set to 0.05.

Source data

Back to article page