Fig. 3: Calibration diagrams for model confidence and human confidence across experiments 1 and 2.
From: What large language models know and what people think they know

The top and middle rows show results for multiple-choice questions with the GPT-3.5 and PaLM2 models, respectively. The bottom row shows results for short-answer questions with the GPT-4o model. The histograms at the bottom of each plot show the proportion of observations in each confidence bin (values are scaled by 30% for visual clarity). The shaded regions represent the 95% confidence interval of the mean computed across participants and questions.