Table 2 Post-re-evaluation classification performance of expert prompts across different LLMs on the prompt validation dataset

From: An autonomous agentic workflow for clinical detection of cognitive concerns using large language models

Prompt

Model

Sensitivity (95% CI)

Specificity (95% CI)

F1 score (95% CI)

Accuracy (95% CI)

P0

Llama3.1

0.85 (0.81, 0.89)

0.54 (0.53, 0.56)

0.54 (0.52, 0.56)

0.62 (0.61, 0.64)

P0

Llama3.2

0.82 (0.79, 0.84)

0.42 (0.39, 0.45)

0.47 (0.45, 0.49)

0.52 (0.50, 0.55)

P0

Med42

0.62 (0.58, 0.66)

0.92 (0.92, 0.93)

0.68 (0.65, 0.71)

0.85 (0.83, 0.86)

XP1

Llama3.1

1.00 (1.00, 1.00)

0.36 (0.34, 0.38)

0.52 (0.52, 0.53)

0.53 (0.51, 0.54)

XP1

Llama3.2

0.65 (0.58, 0.73)

0.66 (0.64, 0.69)

0.50 (0.46, 0.54)

0.66 (0.64, 0.68)

XP1

Med42

0.64 (0.61, 0.66)

0.94 (0.92, 0.95)

0.70 (0.68, 0.72)

0.86 (0.85, 0.87)

XP2

Llama3.1

0.84 (0.82, 0.86)

0.76 (0.74, 0.79)

0.67 (0.64, 0.69)

0.78 (0.76, 0.80)

XP2

Llama3.2

0.85 (0.80, 0.89)

0.55 (0.53, 0.56)

0.54 (0.52, 0.57)

0.63 (0.61, 0.64)

XP2

Med42

0.45 (0.42, 0.47)

0.96 (0.96, 0.96)

0.57 (0.55, 0.60)

0.83 (0.82, 0.83)

XP3

Llama3.1

0.82 (0.79, 0.84)

0.93 (0.92, 0.94)

0.81 (0.79, 0.83)

0.90 (0.89, 0.91)

XP3

Llama3.2

0.32 (0.29, 0.34)

0.93 (0.93, 0.93)

0.42 (0.40, 0.44)

0.77 (0.77, 0.78)

XP3

Med42

0.26 (0.24, 0.28)

0.99 (0.99, 0.99)

0.40 (0.38, 0.43)

0.80 (0.79, 0.80)

  1. XP expert prompt, CI confidence interval. Bold text indicates the best performance.