Fig. 5: OOD fairness of models with different model selection criteria and for different algorithms.
From: The limits of fair medical imaging AI in real-world generalization

a, We varied the ID model selection criteria and compared the selected model against the oracle that chooses the model that is most fair OOD. We plotted the increase in OOD fairness gap of the selected model over the oracle, averaged across 42 combinations of OOD dataset, task and attribute. We used non-parametric bootstrap sampling (n = 1,000) to define the bootstrap distribution for the metric. We found that selection criteria based on choosing models with minimum attribute encoding achieve better OOD fairness than naively selecting based on ID fairness or other aggregate performance metrics (‘Minimum Attribute Prediction Accuracy’ versus ‘Minimum Fairness Gap’: P = 9.60 × 10−94, one-tailed Wilcoxon rank-sum test; ‘Minimum Attribute Prediction AUROC’ versus ‘Minimum Fairness Gap’: P = 1.95 × 10−12, one-tailed Wilcoxon rank-sum test). b, We selected the model for each algorithm with the minimum ID fairness gap. We evaluated its OOD fairness against the oracle on the same 42 settings. We found that removing demographic encoding (that is, DANN) leads to the best OOD fairness (‘DANN’ versus ‘ERM’: P = 1.86 × 10−117, one-tailed Wilcoxon rank-sum test). On each box, the central line indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend to 1.5 times the interquartile range. Points beyond the whiskers are plotted individually using the ‘+’ symbol.