Fig. 2: Average response accuracy versus mean self-reported confidence across large language models.

Scatterplot of mean accuracy and mean confidence scores (0−10 scale) for models with more than 150 valid responses, with each point representing a single model. The dashed line denotes perfect calibration, where mean confidence equals mean accuracy. Models above this line are overconfident, whereas models below are under-confident. A subset of closely clustered models is magnified to improve readability.