Fig. 10: Impact of data contamination on reasoning evaluation.
From: Automating expert-level medical reasoning evaluation of large language models

The bar plot illustrates reasoning scores obtained via LLM-w-Rationale on the full dataset (blue bars) and an uncontaminated subset (red bars). Error bars represent the 95% CI of the mean, calculated via bootstrapping. Reasoning performance shows minimal variation between the full dataset and the uncontaminated subset across models, indicating limited influence of data leakage on reasoning evaluation.