Fig. 5: Accuracy and self-reported confidence across question difficulty for nine large language models. | npj Gut and Liver

Fig. 5: Accuracy and self-reported confidence across question difficulty for nine large language models.

From: Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks

Fig. 5

a−i show smoothed model accuracy (blue solid line, left y-axis, %) and smoothed self-reported confidence (orange dashed line, right y-axis, 0−10 scale) as a function of question difficulty. Question difficulty is defined as the percentage of human test-takers answering correctly (lower values indicate more difficult questions), and questions were grouped into 5-percentage-point difficulty bins on the x-axis. Panels are ordered by Brier score from best- to worst-calibrated models, with the six lowest Brier-score (best calibrated) models in a−f and the three highest Brier-score (worst calibrated) models in g−i.

Back to article page