Fig. 2: Evaluation results.
From: Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

a Average multi-choice accuracies achieved by various models and individuals, segmented by question difficulty. b Confusion matrices showing the intersection of errors made by GPT-4V and human physicians. c Bar graphs representing the percentage of GPT-4V’s rationales in each capability area as evaluated by human physicians for accuracy. ***p < 0.001, n.s. not significant.