Table 1 Model performance comparison across tasks and evaluation methods
User setting | Model | Triage level | Specialty | Diagnosis | Average | |||
|---|---|---|---|---|---|---|---|---|
Exact match | Range | Matched | At least one | Matched | At least one | |||
General User | RAG-Assisted LLM | 64.10 | 78.20 | 77.12 | 86.35 | 69.43 | 80.85 | 76.01 |
Claude 3.5 Sonnet | 62.20 | 82.80 | 78.26 | 88.05 | 70.22 | 82.00 | 77.26 | |
Claude 3 Sonnet | 58.35 | 74.40 | 78.10 | 87.70 | 70.17 | 81.55 | 75.05 | |
Claude 3 Haiku | 57.70 | 71.80 | 77.86 | 87.10 | 67.39 | 79.60 | 73.58 | |
Clinical User | RAG-Assisted LLM | 65.75 | 77.15 | 77.28 | 86.45 | 69.77 | 81.70 | 76.35 |
Claude 3.5 Sonnet | 64.40 | 82.40 | 78.86 | 88.55 | 70.26 | 82.10 | 77.76 | |
Claude 3 Sonnet | 61.65 | 74.55 | 77.72 | 87.15 | 70.51 | 82.05 | 75.61 | |
Claude 3 Haiku | 59.00 | 66.15 | 78.02 | 87.05 | 67.46 | 79.30 | 72.83 | |