Table 6 Overall classification accuracy (proportion) with 95% CIs (cheap subsampling bootstrap; \(m=0.6N\), \(B=100\)) on the Ophtimus-Eval-V1, MedMCQA, and PubMedQA datasets.

From: Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning

Model

Multi-choice question

Ophtimus-Eval-V1

MedMCQA

PubMedQA

OpenAI GPT-4o

0.72 [0.70, 0.74]

0.83 [0.82, 0.84]

0.91 [0.87, 0.94]

LLaMA 3-8B Instruct

0.47 [0.45, 0.49]

0.74 [0.73, 0.75]

0.82 [0.78, 0.86]

LLaMA 3.1-8B Instruct

0.41 [0.39, 0.43]

0.61 [0.60, 0.62]

0.78 [0.73, 0.83]

Eye-LLaMA

0.29 [0.28, 0.31]

0.58 [0.57, 0.59]

0.60 [0.54, 0.66]

PMC-LLaMA-13B

0.38 [0.36, 0.40]

0.56 [0.55, 0.58]

0.72 [0.67, 0.78]

Ophtimus-V1-Inst

0.61 [0.59, 0.63]

0.67 [0.66, 0.68]

0.74 [0.69, 0.79]

Ophtimus-V2-Inst

0.64 [0.62, 0.66]

0.72 [0.71, 0.72]

0.73 [0.69, 0.76]

Ophtimus-V2-Tx

0.42 [0.40, 0.44]

0.55 [0.54, 0.56]

0.88 [0.84, 0.93]