Fig. 2: Average response accuracy versus mean self-reported confidence across large language models. | npj Gut and Liver

Fig. 2: Average response accuracy versus mean self-reported confidence across large language models.

From: Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks

Fig. 2

Scatterplot of mean accuracy and mean confidence scores (0−10 scale) for models with more than 150 valid responses, with each point representing a single model. The dashed line denotes perfect calibration, where mean confidence equals mean accuracy. Models above this line are overconfident, whereas models below are under-confident. A subset of closely clustered models is magnified to improve readability.

Back to article page