Table 4 Hierarchical F1 scores for CliBench across all tasks—primary/secondary diagnosis, medication, and surgical procedure—with 95% confidence intervals (CI).

From: Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning

#	Model	Task	L1	L2	L3	L4	Full
1	OpenAI API (ChatGPT-4o)	Primary	0.74 [0.72, 0.76]	0.57 [0.55, 0.60]	0.52 [0.49, 0.54]	0.38 [0.36, 0.41]	0.26 [0.24, 0.28]
		Secondary	0.56 [0.53, 0.59]	0.31 [0.28, 0.34]	0.25 [0.22, 0.28]	0.17 [0.15, 0.19]	0.12 [0.10, 0.14]
		Medication	0.66 [0.63, 0.69]	0.63 [0.60, 0.66]	0.51 [0.48, 0.54]	0.44 [0.41, 0.47]	0.32 [0.30, 0.35]
		Surgical	0.75 [0.72, 0.78]	0.62 [0.58, 0.65]	0.35 [0.31, 0.39]	–	0.16 [0.12, 0.19]
2	LLaMA 3.1 8B Instruct	Primary	0.63 [0.60, 0.65]	0.45 [0.42, 0.47]	0.38 [0.36, 0.41]	0.27 [0.25, 0.29]	0.18 [0.16, 0.20]
		Secondary	0.33 [0.31, 0.35]	0.23 [0.21, 0.25]	0.17 [0.15, 0.19]	0.14 [0.12, 0.16]	0.12 [0.10, 0.14]
		Medication	0.47 [0.45, 0.50]	0.45 [0.42, 0.48]	0.36 [0.33, 0.39]	0.31 [0.29, 0.34]	0.22 [0.19, 0.24]
		Surgical	0.58 [0.56, 0.61]	0.29 [0.26, 0.31]	0.13 [0.11, 0.14]	–	0.06 [0.05, 0.07]
3	LLaMA 3.1 8B Ophthalmic Instruct	Primary	0.72 [0.69, 0.74]	0.56 [0.54, 0.59]	0.50 [0.48, 0.52]	0.40 [0.37, 0.42]	0.24 [0.22, 0.26]
		Secondary	0.39 [0.37, 0.41]	0.26 [0.24, 0.29]	0.21 [0.19, 0.23]	0.17 [0.16, 0.19]	0.14 [0.13, 0.16]
		Medication	0.54 [0.52, 0.57]	0.52 [0.49, 0.55]	0.41 [0.39, 0.44]	0.37 [0.35, 0.40]	0.28 [0.26, 0.31]
		Surgical	0.66 [0.63, 0.68]	0.40 [0.38, 0.42]	0.20 [0.18, 0.22]	–	0.09 [0.08, 0.10]
4	Ophtimus-V2-Inst	Primary	0.67 [0.65, 0.70]	0.50 [0.48, 0.53]	0.42 [0.40, 0.45]	0.31 [0.28, 0.33]	0.17 [0.15, 0.19]
		Secondary	0.37 [0.34, 0.40]	0.23 [0.21, 0.26]	0.19 [0.16, 0.22]	0.15 [0.13, 0.18]	0.12 [0.09, 0.14]
		Medication	0.52 [0.49, 0.55]	0.49 [0.46, 0.52]	0.39 [0.36, 0.43]	0.34 [0.31, 0.37]	0.25 [0.22, 0.27]
		Surgical	0.62 [0.59, 0.64]	0.39 [0.36, 0.41]	0.20 [0.18, 0.22]	–	0.10 [0.09, 0.12]
5	Ophtimus-V2-Tx	Primary	0.73 [0.71, 0.75]	0.58 [0.56, 0.61]	0.51 [0.48, 0.53]	0.40 [0.37, 0.42]	0.23 [0.21, 0.25]
		Secondary	0.36 [0.33, 0.38]	0.25 [0.23, 0.27]	0.20 [0.18, 0.22]	0.17 [0.15, 0.19]	0.15 [0.13, 0.16]
		Medication	0.55 [0.52, 0.57]	0.52 [0.49, 0.55]	0.43 [0.40, 0.46]	0.40 [0.37, 0.42]	0.31 [0.28, 0.33]
		Surgical	0.75 [0.72, 0.77]	0.42 [0.39, 0.46]	0.25 [0.22, 0.28]	–	0.13 [0.10, 0.15]

L1–L4 and Full are reported (L4 is not applicable to surgical procedure and is shown as “–”). Bold indicates the best score per column within each task (ties bolded). Complementary metrics—accuracy, precision, and recall—are reported in the Appendix.

Back to article page

Table 4 Hierarchical F1 scores for CliBench across all tasks—primary/secondary diagnosis, medication, and surgical procedure—with 95% confidence intervals (CI).

Search

Quick links