Table 2 Post-re-evaluation classification performance of expert prompts across different LLMs on the prompt validation dataset

Prompt	Model	Sensitivity (95% CI)	Specificity (95% CI)	F1 score (95% CI)	Accuracy (95% CI)
P0	Llama3.1	0.85 (0.81, 0.89)	0.54 (0.53, 0.56)	0.54 (0.52, 0.56)	0.62 (0.61, 0.64)
P0	Llama3.2	0.82 (0.79, 0.84)	0.42 (0.39, 0.45)	0.47 (0.45, 0.49)	0.52 (0.50, 0.55)
P0	Med42	0.62 (0.58, 0.66)	0.92 (0.92, 0.93)	0.68 (0.65, 0.71)	0.85 (0.83, 0.86)
XP1	Llama3.1	1.00 (1.00, 1.00)	0.36 (0.34, 0.38)	0.52 (0.52, 0.53)	0.53 (0.51, 0.54)
XP1	Llama3.2	0.65 (0.58, 0.73)	0.66 (0.64, 0.69)	0.50 (0.46, 0.54)	0.66 (0.64, 0.68)
XP1	Med42	0.64 (0.61, 0.66)	0.94 (0.92, 0.95)	0.70 (0.68, 0.72)	0.86 (0.85, 0.87)
XP2	Llama3.1	0.84 (0.82, 0.86)	0.76 (0.74, 0.79)	0.67 (0.64, 0.69)	0.78 (0.76, 0.80)
XP2	Llama3.2	0.85 (0.80, 0.89)	0.55 (0.53, 0.56)	0.54 (0.52, 0.57)	0.63 (0.61, 0.64)
XP2	Med42	0.45 (0.42, 0.47)	0.96 (0.96, 0.96)	0.57 (0.55, 0.60)	0.83 (0.82, 0.83)
XP3	Llama3.1	0.82 (0.79, 0.84)	0.93 (0.92, 0.94)	0.81 (0.79, 0.83)	0.90 (0.89, 0.91)
XP3	Llama3.2	0.32 (0.29, 0.34)	0.93 (0.93, 0.93)	0.42 (0.40, 0.44)	0.77 (0.77, 0.78)
XP3	Med42	0.26 (0.24, 0.28)	0.99 (0.99, 0.99)	0.40 (0.38, 0.43)	0.80 (0.79, 0.80)

XP expert prompt, CI confidence interval. Bold text indicates the best performance.

Quick links

Search