Fig. 4: Model performance under the main-source robustness protocol and cyclic-source generalizability protocol.

a Assessing model robustness by training on the main source subset and testing on four auxiliary source subsets under the main-source robustness protocol. b Assessing model generalizability by training on the main source subset combined with three of the four auxiliary source subsets and testing on the remaining subset under the cyclic-source generalizability protocol. The error bars represent the 95% confidence interval of the estimates, and the bar center represents the mean estimate of the displayed metric. The estimates are computed by generating a bootstrap distribution with 1000 bootstrap samples for corresponding testing sets with n=1000 samples.