Fig. 6: GPT-4’s performance across two settings: five-choice vs. three-choice WEP options, for all four metrics.
From: An evaluation of estimative uncertainty in large language models

Results are analyzed under both narrow (less uncertain) and wide (more uncertain) outcome ranges. Standard errors and significance are reported as in Fig. 5.