Table 4 Hierarchical F1 scores for CliBench across all tasks—primary/secondary diagnosis, medication, and surgical procedure—with 95% confidence intervals (CI).

From: Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning

#

Model

Task

L1

L2

L3

L4

Full

1

OpenAI API (ChatGPT-4o)

Primary

0.74 [0.72, 0.76]

0.57 [0.55, 0.60]

0.52 [0.49, 0.54]

0.38 [0.36, 0.41]

0.26 [0.24, 0.28]

Secondary

0.56 [0.53, 0.59]

0.31 [0.28, 0.34]

0.25 [0.22, 0.28]

0.17 [0.15, 0.19]

0.12 [0.10, 0.14]

Medication

0.66 [0.63, 0.69]

0.63 [0.60, 0.66]

0.51 [0.48, 0.54]

0.44 [0.41, 0.47]

0.32 [0.30, 0.35]

Surgical

0.75 [0.72, 0.78]

0.62 [0.58, 0.65]

0.35 [0.31, 0.39]

0.16 [0.12, 0.19]

2

LLaMA 3.1 8B Instruct

Primary

0.63 [0.60, 0.65]

0.45 [0.42, 0.47]

0.38 [0.36, 0.41]

0.27 [0.25, 0.29]

0.18 [0.16, 0.20]

Secondary

0.33 [0.31, 0.35]

0.23 [0.21, 0.25]

0.17 [0.15, 0.19]

0.14 [0.12, 0.16]

0.12 [0.10, 0.14]

Medication

0.47 [0.45, 0.50]

0.45 [0.42, 0.48]

0.36 [0.33, 0.39]

0.31 [0.29, 0.34]

0.22 [0.19, 0.24]

Surgical

0.58 [0.56, 0.61]

0.29 [0.26, 0.31]

0.13 [0.11, 0.14]

0.06 [0.05, 0.07]

3

LLaMA 3.1 8B Ophthalmic Instruct

Primary

0.72 [0.69, 0.74]

0.56 [0.54, 0.59]

0.50 [0.48, 0.52]

0.40 [0.37, 0.42]

0.24 [0.22, 0.26]

Secondary

0.39 [0.37, 0.41]

0.26 [0.24, 0.29]

0.21 [0.19, 0.23]

0.17 [0.16, 0.19]

0.14 [0.13, 0.16]

Medication

0.54 [0.52, 0.57]

0.52 [0.49, 0.55]

0.41 [0.39, 0.44]

0.37 [0.35, 0.40]

0.28 [0.26, 0.31]

Surgical

0.66 [0.63, 0.68]

0.40 [0.38, 0.42]

0.20 [0.18, 0.22]

0.09 [0.08, 0.10]

4

Ophtimus-V2-Inst

Primary

0.67 [0.65, 0.70]

0.50 [0.48, 0.53]

0.42 [0.40, 0.45]

0.31 [0.28, 0.33]

0.17 [0.15, 0.19]

Secondary

0.37 [0.34, 0.40]

0.23 [0.21, 0.26]

0.19 [0.16, 0.22]

0.15 [0.13, 0.18]

0.12 [0.09, 0.14]

Medication

0.52 [0.49, 0.55]

0.49 [0.46, 0.52]

0.39 [0.36, 0.43]

0.34 [0.31, 0.37]

0.25 [0.22, 0.27]

Surgical

0.62 [0.59, 0.64]

0.39 [0.36, 0.41]

0.20 [0.18, 0.22]

0.10 [0.09, 0.12]

5

Ophtimus-V2-Tx

Primary

0.73 [0.71, 0.75]

0.58 [0.56, 0.61]

0.51 [0.48, 0.53]

0.40 [0.37, 0.42]

0.23 [0.21, 0.25]

Secondary

0.36 [0.33, 0.38]

0.25 [0.23, 0.27]

0.20 [0.18, 0.22]

0.17 [0.15, 0.19]

0.15 [0.13, 0.16]

Medication

0.55 [0.52, 0.57]

0.52 [0.49, 0.55]

0.43 [0.40, 0.46]

0.40 [0.37, 0.42]

0.31 [0.28, 0.33]

Surgical

0.75 [0.72, 0.77]

0.42 [0.39, 0.46]

0.25 [0.22, 0.28]

0.13 [0.10, 0.15]

  1. L1–L4 and Full are reported (L4 is not applicable to surgical procedure and is shown as “–”). Bold indicates the best score per column within each task (ties bolded). Complementary metrics—accuracy, precision, and recall—are reported in the Appendix.