Table 3 Error detection performance of the LLM-w-Rationale framework against expert evaluation (ground truth) for medical reasoning assessment
From: Automating expert-level medical reasoning evaluation of large language models
Model | Precision | Recall | F1 |
|---|---|---|---|
OpenAI-o3 | 0.902 | 0.884 | 0.893 |
GPT-4o | 0.826 | 0.839 | 0.832 |
Gemini-2.5-flash | 0.884 | 0.822 | 0.852 |
Claude-sonnet-3.5 | 0.812 | 0.850 | 0.831 |
DeepSeek-R1 | 0.899 | 0.878 | 0.889 |
HuatuoGPT-o1-70B | 0.910 | 0.887 | 0.899 |
Llama-3.3-70B | 0.800 | 0.764 | 0.781 |
Med42-70B | 0.815 | 0.881 | 0.847 |
QwQ-32B | 0.820 | 0.769 | 0.794 |
Qwen3-32B | 0.805 | 0.755 | 0.779 |
MedGemma-27B | 0.900 | 0.922 | 0.911 |
Baichuan-M1-14B | 0.809 | 0.815 | 0.812 |
Average | 0.849 | 0.839 | 0.843 |