Fig. 4: Calibration of self-reported confidence across best- and worst-calibrated large language models.

a-i show calibration curves for each LLM, plotting observed model accuracy (y-axis, %) against mean normalized self-reported confidence (x-axis) within 15 quantile-based bins of the original 0−10 confidence scale. The dashed diagonal line indicates perfect calibration. Points represent confidence bins, with adjacent numbers indicating the number of questions contributing to each bin; bins containing fewer than three responses are not displayed. Panels are ordered by Brier score from best- to worst-calibrated models, with the six lowest Brier-score (best calibrated) models in a–f and the three highest Brier-score (worst calibrated) models in g–i.