Fig. 8: Conceptual diagram showing the workflow of permutation testing and stratification analysis to assess bias in WSI-based models.
From: Confounding factors and biases abound when predicting molecular biomarkers from histological images

The algorithm takes as input a dataset containing prediction scores (\(Z\)), ground truth labels (\(Y\)) and a confounding or stratification variable (\(C\)). In step 1, the algorithm computes foreground statistics, such as AUROC within each stratum defined by the values of \({\rm{C}}\). In step 2, the algorithm permutes \({\rm{C}}\) multiple times (represented by \(Q\)), generating permuted datasets D(1), D(2)⋯, D(Q). AUROCs are computed in each permuted dataset, where any association \(C\) and \(Y\) has been randomized to form a null distribution reflecting expected model performance under the assumption of no association between \(C\) and \(Y\). In step 3, the algorithm compares the observed AUROCs against null distributions to assess how extreme they are. If they lie in the tails, the effect of \(C\) is considered statistically significant, and a two-sided multiple hypothesis corrected P value is computed. KDE, kernel density estimation.