Table 1 Model performance comparison across tasks and evaluation methods

User setting	Model	Triage level		Specialty		Diagnosis		Average
User setting	Model	Exact match	Range	Matched	At least one	Matched	At least one	Average
General User	RAG-Assisted LLM	64.10	78.20	77.12	86.35	69.43	80.85	76.01
	Claude 3.5 Sonnet	62.20	82.80	78.26	88.05	70.22	82.00	77.26
	Claude 3 Sonnet	58.35	74.40	78.10	87.70	70.17	81.55	75.05
	Claude 3 Haiku	57.70	71.80	77.86	87.10	67.39	79.60	73.58
Clinical User	RAG-Assisted LLM	65.75	77.15	77.28	86.45	69.77	81.70	76.35
	Claude 3.5 Sonnet	64.40	82.40	78.86	88.55	70.26	82.10	77.76
	Claude 3 Sonnet	61.65	74.55	77.72	87.15	70.51	82.05	75.61
	Claude 3 Haiku	59.00	66.15	78.02	87.05	67.46	79.30	72.83

Performance is presented as accuracy [%] on all tasks and with all evaluation methods. A bold value indicates the best-performing model and an underlined value indicates the second-best-performing model, determined separately within each user setting (general or clinical user) and within each evaluation method (exact match/matched or range/at least one) for each prediction task (triage level, specialty or diagnosis).

Quick links

Search