Table 2 Estimated data contamination rates of MedThink-Bench in investigated LLMs

From: Automating expert-level medical reasoning evaluation of large language models

Prediction model

Data contamination rate

GPT-4o

0.004

Gemini-2.5-flash

0.032

Claude-sonnet-3.5

0.054

DeepSeek-R1

0.048

HuatuoGPT-o1-70B

0.000

Llama-3.3-70B

0.118

Med42-70B

0.016

Qwen3-32B

0.038

QwQ-32B

0.034

MedGemma-27B

0.252

Baichuan-M1-14B

0.076

  1. Lower values indicate a lower likelihood of data contamination. Contamination for OpenAI-o3 was not assessed because the model does not permit temperature control, which is required by the contamination detection method.