Table 3 Error detection performance of the LLM-w-Rationale framework against expert evaluation (ground truth) for medical reasoning assessment

From: Automating expert-level medical reasoning evaluation of large language models

Model

Precision

Recall

F1

OpenAI-o3

0.902

0.884

0.893

GPT-4o

0.826

0.839

0.832

Gemini-2.5-flash

0.884

0.822

0.852

Claude-sonnet-3.5

0.812

0.850

0.831

DeepSeek-R1

0.899

0.878

0.889

HuatuoGPT-o1-70B

0.910

0.887

0.899

Llama-3.3-70B

0.800

0.764

0.781

Med42-70B

0.815

0.881

0.847

QwQ-32B

0.820

0.769

0.794

Qwen3-32B

0.805

0.755

0.779

MedGemma-27B

0.900

0.922

0.911

Baichuan-M1-14B

0.809

0.815

0.812

Average

0.849

0.839

0.843