Fig. 1: Evaluation Procedure for GPT-4 with Vision (GPT-4V).
From: Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

This figure illustrates the evaluation workflow for GPT-4V using 207 NEJM Image Challenges. The example instance is adapted from the New England Journal of Medicine, Xiaojing Tang and Lijun Sun, Encapsulating Peritoneal Sclerosis. Copyright © 2024 Massachusetts Medical Society. Reprinted with permission from Massachusetts Medical Society18. a A medical student answered all questions and triaged them into specialties. b Nine physicians provided their answers to the questions in their specialty. c GPT-4V is prompted to answer challenge questions with a final choice and structured responses reflecting three specific capabilities. d The physicians then appraised the validity of each component of GPT-4V’s responses based on the ground-truth explanations.