Table 3 QA Accuracy Result^a by Model^b

Model Type	Model Name	Overall	Symptom Group	Medical History	Family and Social History
Closed-Source	Claude-3-Haiku	72.25%	68.00%	40.91%	56.76%
	Claude-3.5-Sonnet	83.82%	88.00%	86.36%	48.65%
	Claude-4-Sonnet	86.13%	68.00%	86.36%	64.86%
	Claude-4-Opus	90.80%	88.00%	86.36%	78.38%
	GPT-3.5-Turbo^c	55.49%	60.00%	36.36%	60.00%
	GPT-4o	89.02%	92.00%	81.82%	64.86%
	GPT-4-Turbo	94.15%^c	91.20%	87.10%	85.56%
Open-Source	Deepseek-v3-671b	80.81%	63.64%	79.31%	62.50%
	Llama-3-70b	75.72%	84.00%	36.36%	43.24%
	Qwen-3-32b	78.61%	84.00%	77.27%	40.54%

^aAll model tested using setup with all agents and few-shot learning.
^bClaude-3-Sonnet model is removed due to model deprecation.
^cWe observe that GPT-3.5-Turbo performs poorly in the Cypher query generation task, suggesting limitations in structured reasoning and translating natural language into formal queries.

Quick links

Search