Table 2 Estimated data contamination rates of MedThink-Bench in investigated LLMs
From: Automating expert-level medical reasoning evaluation of large language models
Prediction model | Data contamination rate |
|---|---|
GPT-4o | 0.004 |
Gemini-2.5-flash | 0.032 |
Claude-sonnet-3.5 | 0.054 |
DeepSeek-R1 | 0.048 |
HuatuoGPT-o1-70B | 0.000 |
Llama-3.3-70B | 0.118 |
Med42-70B | 0.016 |
Qwen3-32B | 0.038 |
QwQ-32B | 0.034 |
MedGemma-27B | 0.252 |
Baichuan-M1-14B | 0.076 |