Fig. 4: A heat map visualizing KL divergence for 17 WEPs across three comparison pairs.
From: An evaluation of estimative uncertainty in large language models

ERNIE-4.0 (Chinese) vs. humans, GPT-3.5/4 (English vs. Chinese), and GPT-3.5/4 vs. ERNIE-4.0 (Chinese). Darker colors indicate higher divergence. *, **, and *** denote Brunner Munzel test significance at 90, 95, and 99% levels. KS statistics are in Supplementary Fig. S17.