Fig. 3: Evaluation of FLEX’s effectiveness in improving demographic fairness.

The top three rows (NSCLC-TYPE, BRCA-PR, CRC-BRAF) show fairness when evaluated across self-reported race and ancestry groups using results from a 15-fold SP-MCCV on the TCGA-NSCLC (n = 958), TCGA-BRCA (n = 937), and TCGA-CRC (n = 606) datasets, respectively. The bottom two rows (C-BRCA-TYPE, C-LUAD-EGFR) show results from models trained on TCGA datasets and externally validated on CPTAC-BRCA (n = 323) and CPTAC-LUAD (n = 815). For each cross-validation fold, metrics for each demographic-label group were estimated via bootstrapping (n = 500 replicates). a Fairness gap for five representative tasks. The AUROC gap ratio is the performance difference between the best- and worst-performing subgroups relative to the overall AUROC. Smaller dots represent individual folds (n = 15); large dots with error bars show the mean and 95% CI across folds. Marginal distributions show the distribution of AUROC gap ratios across folds. Presented P-values are from a two-sided Wilcoxon signed-rank test without adjustment, comparing FLEX to the Original method (P = 0.026 when evaluated on ancestry for NSCLC-TYPE; P = 0.018 when evaluated on self-reported race for BRCA-PR; P = 0.048 for C-BRCA-TYPE; P = 0.004 for C-LUAD-EGFR). b AUROC distribution across subgroups. Box plots show the distribution across n = 15 cross-validation folds. The center line represents the median (50th percentile), the box bounds represent the interquartile range (IQR; 25th to 75th percentiles), and the whiskers extend to data points within 1.5 × the IQR. c Violin plots of True Positive Rate (TPR) disparity, defined as the difference between a subgroup’s TPR and the overall TPR. The distributions are derived from results across n = 15 cross-validation folds. Within each violin, the box plot shows the median (center line), interquartile range (IQR), and whiskers extending to 1.5 × IQR. Distributions centered closer to the zero-disparity line indicate higher fairness. Presented P-values are from a two-sided paired-samples t-test without adjustment. The dashed line represents the overall TPR across the three methods. d Quantitative summary of TPR disparity, showing the Root Mean Square Error (RMSE) for Original (O), Reinhard (R), and FLEX (F). Lower values indicate better fairness. Source data are provided as a Source Data file.