Fig. 3: External validation results for LLaVA-Rad on held-out datasets.

Open-I (a, b) CheXpert (c, d) and US-CXR (e, f). LLaVA-Rad outperforms baselines across all external validation datasets, as assessed by traditional factual correctness metrics (F1-CheXbert-14, F1-RadGraph) and lexical similarity (ROUGE-L). CheXprompt evaluation (b, d, f) further demonstrates that LLaVA-Rad produces fewer clinically significant and overall errors compared to baselines. Each dataset sample consists of image-report pairs (Open-I: n = 2163; CheXpert: n = 61; US-CXR: n = 1751). Values represent mean metric scores for each dataset, and error bars indicate 95% bootstrap confidence intervals derived from 500 resampling iterations. Source data are provided as a Source Data file.