Table 6 Overall classification accuracy (proportion) with 95% CIs (cheap subsampling bootstrap; \(m=0.6N\), \(B=100\)) on the Ophtimus-Eval-V1, MedMCQA, and PubMedQA datasets.
From: Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning
Model | Multi-choice question | ||
|---|---|---|---|
Ophtimus-Eval-V1 | MedMCQA | PubMedQA | |
OpenAI GPT-4o | 0.72 [0.70, 0.74] | 0.83 [0.82, 0.84] | 0.91 [0.87, 0.94] |
LLaMA 3-8B Instruct | 0.47 [0.45, 0.49] | 0.74 [0.73, 0.75] | 0.82 [0.78, 0.86] |
LLaMA 3.1-8B Instruct | 0.41 [0.39, 0.43] | 0.61 [0.60, 0.62] | 0.78 [0.73, 0.83] |
Eye-LLaMA | 0.29 [0.28, 0.31] | 0.58 [0.57, 0.59] | 0.60 [0.54, 0.66] |
PMC-LLaMA-13B | 0.38 [0.36, 0.40] | 0.56 [0.55, 0.58] | 0.72 [0.67, 0.78] |
Ophtimus-V1-Inst | 0.61 [0.59, 0.63] | 0.67 [0.66, 0.68] | 0.74 [0.69, 0.79] |
Ophtimus-V2-Inst | 0.64 [0.62, 0.66] | 0.72 [0.71, 0.72] | 0.73 [0.69, 0.76] |
Ophtimus-V2-Tx | 0.42 [0.40, 0.44] | 0.55 [0.54, 0.56] | 0.88 [0.84, 0.93] |