Table 3 QA Accuracy Resulta by Modelb

From: Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education

Model Type

Model Name

Overall

Symptom Group

Medical History

Family and Social History

Closed-Source

Claude-3-Haiku

72.25%

68.00%

40.91%

56.76%

Claude-3.5-Sonnet

83.82%

88.00%

86.36%

48.65%

Claude-4-Sonnet

86.13%

68.00%

86.36%

64.86%

Claude-4-Opus

90.80%

88.00%

86.36%

78.38%

GPT-3.5-Turboc

55.49%

60.00%

36.36%

60.00%

GPT-4o

89.02%

92.00%

81.82%

64.86%

GPT-4-Turbo

94.15%c

91.20%

87.10%

85.56%

Open-Source

Deepseek-v3-671b

80.81%

63.64%

79.31%

62.50%

Llama-3-70b

75.72%

84.00%

36.36%

43.24%

Qwen-3-32b

78.61%

84.00%

77.27%

40.54%

  1. aAll model tested using setup with all agents and few-shot learning.
  2. bClaude-3-Sonnet model is removed due to model deprecation.
  3. cWe observe that GPT-3.5-Turbo performs poorly in the Cypher query generation task, suggesting limitations in structured reasoning and translating natural language into formal queries.