Fig. 3: Box-and-Whisker Plot Depicting the Distribution of Confidence Scores Attributed by ChatGPT to Different Categories of Responses.
From: Exploring the capabilities of ChatGPT in women’s health: obstetrics and gynaecology

The y-axis represents the probability scores (expressed as a percentage) that ChatGPT assigned to its answers, indicative of self-assessed confidence. The categories on the x-axis represent four scenarios: ChatGPT correctly identifying a correct answer (red), ChatGPT incorrectly identifying a correct answer (blue), ChatGPT incorrectly identifying an incorrect answer as correct (green), and ChatGPT correctly identifying an incorrect answer (purple). The central line in each box denotes the median confidence score, while the bounds of the boxes represent the interquartile range (IQR). The median confidence level for correctly identified correct answers and incorrectly identified correct answers was both at 70.0%, while for correct answers misclassified as incorrect, the median was significantly lower at 10.0%. Incorrect answers accurately identified as such had a median confidence of 5.0%. Despite the presence of statistical significance, the minimal practical variance in confidence scores suggests a limitation in ChatGPT’s ability to self-evaluate the certainty of its responses accurately.