Fig. 1: Schematic overview of our human evaluation framework. | Nature Medicine

Fig. 1: Schematic overview of our human evaluation framework.

From: Collaboration between clinicians and vision–language models in radiology report generation

Fig. 1

a, To compare radiology reports generated by our AI model with reports written by human experts, we devise two evaluation schemes: (1) a pairwise preference test in which a certified expert is given two reports without knowing the source of the report (one report from our model and the original report from a radiologist) and they are asked to choose which report should be ‘used downstream for the care of this patient’; and (2) an error correction task in which a single report (either AI-generated or the original one) is evaluated carefully and edited if required. The expert is also asked to give the reason for each correction and to indicate whether the error is clinically significant or not. b, We measure the utility of the AI-based report generation system in an assistive scenario in which the AI model first generates a report and the human expert revises as needed. For this task, we repeat the same pairwise preference test as before but this time the expert is asked to compare an AI-generated report corrected with human edits against a report written by human alone. We perform this evaluation on two datasets, one acquired in outpatient care delivery in India and another from intensive care in the United States. Board-certified radiologists are recruited in both countries to study the regional inter-rater variation.

Back to article page