Fig. 2: Performance on lung disease classification.

a Pretrain-seen-distribution: models fine-tuned and tested on datasets seen during pretraining TANGERINE. b Pretrain-unseen-distribution: models fine-tuned and tested on datasets not seen during pretraining. c Domain generalisation: models fine-tuned on one dataset and evaluated on a distinct target dataset unseen during pretraining or fine-tuning. Full dataset sizes, including training, validation, and testing splits for all tasks presented here, are detailed in Supplementary Table 2. Each model was trained with five random seeds; error bars show 95% confidence intervals, and bar centres indicate mean AUROC. Pairwise P values were computed using two-sided t tests with Bonferroni correction (most competitive P values shown, full values in Supplementary Data 3–6). Multi-class classification results for specific disease categories are shown in Supplementary Figs. 1–3. TANGERINE consistently outperforms or matches comparison models.