npj Digital Medicine

Table 3 Error detection performance of the LLM-w-Rationale framework against expert evaluation (ground truth) for medical reasoning assessment

From: Automating expert-level medical reasoning evaluation of large language models

Model	Precision	Recall	F₁
OpenAI-o3	0.902	0.884	0.893
GPT-4o	0.826	0.839	0.832
Gemini-2.5-flash	0.884	0.822	0.852
Claude-sonnet-3.5	0.812	0.850	0.831
DeepSeek-R1	0.899	0.878	0.889
HuatuoGPT-o1-70B	0.910	0.887	0.899
Llama-3.3-70B	0.800	0.764	0.781
Med42-70B	0.815	0.881	0.847
QwQ-32B	0.820	0.769	0.794
Qwen3-32B	0.805	0.755	0.779
MedGemma-27B	0.900	0.922	0.911
Baichuan-M1-14B	0.809	0.815	0.812
Average	0.849	0.839	0.843

Back to article page

Search

Advanced search

Quick links