Fig. 10: Impact of data contamination on reasoning evaluation. | npj Digital Medicine

Fig. 10: Impact of data contamination on reasoning evaluation.

From: Automating expert-level medical reasoning evaluation of large language models

Fig. 10

The bar plot illustrates reasoning scores obtained via LLM-w-Rationale on the full dataset (blue bars) and an uncontaminated subset (red bars). Error bars represent the 95% CI of the mean, calculated via bootstrapping. Reasoning performance shows minimal variation between the full dataset and the uncontaminated subset across models, indicating limited influence of data leakage on reasoning evaluation.

Back to article page