Fig. 5: Performance of FLEX across training data scales and MIL models.

a Schematic of the experimental design for evaluating performance at different training data scales. The procedure follows the SP-MCCV strategy, which involves 5 outer folds and 3 inner Monte Carlo Cross-Validation (MCCV) folds, for a total of 15 evaluation runs. Specifically for this experiment, the dataset is first partitioned into 5 site-preserved outer folds. In each of the 5 outer loop iterations, one fold is held out as the OOD test set. Training sets of increasing size (1, 2, 3, and 4 folds) are constructed from the remaining 4 folds. For each scale condition, 3 inner MCCV runs are performed, for a total of 15 evaluation runs per condition. b OOD AUROC comparison between FLEX and the baseline (Original) on three tasks from the TCGA-BRCA dataset (n = 875). Box plots show aggregated results from 15 runs for each training data scale, derived from the SP-MCCV strategy. The box plots display the median (center line), the first and third quartiles (Q1-Q3; the box), and whiskers extending to 1.5 × interquartile range. Individual data points from each run are overlaid. Presented P-values are from a two-sided paired-samples t-test without adjustment. c Dumbbell plot showing OOD AUROC after integrating FLEX with five state-of-the-art MIL models across 16 tasks. Datasets used include TCGA-BRCA (n = 937), TCGA-NSCLC (n = 958), TCGA-STAD (n = 414), and TCGA-CRC (n = 606). Each dumbbell connects the mean performance of an Original model to its FLEX-enhanced counterpart. Indicated P values were calculated using a two-sided paired-samples t-test with multiple hypothesis correction. Source data are provided as a Source Data file.