Fig. 4: Evaluating LLaVA-Rad using CheXprompt.

a GPT-4 based CheXprompt is more similar to average left-in radiologists in total error quantification, compared to the left-out radiologist (mean absolute difference 0.55 vs 0.71). b Comparison between CheXprompt and existing metrics in terms of agreement with radiologist error quantification. c Comparison between LLaVA-Rad and competing methods using CheXprompt on the MIMIC-CXR test set. d Illustration of how CheXprompt can be used to evaluate a report generated by LLaVA-Rad, with errors highlighted. GPT-4T stands for GPT-4 Turbo. In a p values correspond to two-sided paired t-test. In b, c values represent mean metric scores and error bars correspond to 95% bootstrap confidence intervals. Source data are provided as a Source Data file.