Table 2 Post-re-evaluation classification performance of expert prompts across different LLMs on the prompt validation dataset
Prompt | Model | Sensitivity (95% CI) | Specificity (95% CI) | F1 score (95% CI) | Accuracy (95% CI) |
|---|---|---|---|---|---|
P0 | Llama3.1 | 0.85 (0.81, 0.89) | 0.54 (0.53, 0.56) | 0.54 (0.52, 0.56) | 0.62 (0.61, 0.64) |
P0 | Llama3.2 | 0.82 (0.79, 0.84) | 0.42 (0.39, 0.45) | 0.47 (0.45, 0.49) | 0.52 (0.50, 0.55) |
P0 | Med42 | 0.62 (0.58, 0.66) | 0.92 (0.92, 0.93) | 0.68 (0.65, 0.71) | 0.85 (0.83, 0.86) |
XP1 | Llama3.1 | 1.00 (1.00, 1.00) | 0.36 (0.34, 0.38) | 0.52 (0.52, 0.53) | 0.53 (0.51, 0.54) |
XP1 | Llama3.2 | 0.65 (0.58, 0.73) | 0.66 (0.64, 0.69) | 0.50 (0.46, 0.54) | 0.66 (0.64, 0.68) |
XP1 | Med42 | 0.64 (0.61, 0.66) | 0.94 (0.92, 0.95) | 0.70 (0.68, 0.72) | 0.86 (0.85, 0.87) |
XP2 | Llama3.1 | 0.84 (0.82, 0.86) | 0.76 (0.74, 0.79) | 0.67 (0.64, 0.69) | 0.78 (0.76, 0.80) |
XP2 | Llama3.2 | 0.85 (0.80, 0.89) | 0.55 (0.53, 0.56) | 0.54 (0.52, 0.57) | 0.63 (0.61, 0.64) |
XP2 | Med42 | 0.45 (0.42, 0.47) | 0.96 (0.96, 0.96) | 0.57 (0.55, 0.60) | 0.83 (0.82, 0.83) |
XP3 | Llama3.1 | 0.82 (0.79, 0.84) | 0.93 (0.92, 0.94) | 0.81 (0.79, 0.83) | 0.90 (0.89, 0.91) |
XP3 | Llama3.2 | 0.32 (0.29, 0.34) | 0.93 (0.93, 0.93) | 0.42 (0.40, 0.44) | 0.77 (0.77, 0.78) |
XP3 | Med42 | 0.26 (0.24, 0.28) | 0.99 (0.99, 0.99) | 0.40 (0.38, 0.43) | 0.80 (0.79, 0.80) |