Table 4 Comparative performance on two datasets (post re-adjudication)

From: An autonomous agentic workflow for clinical detection of cognitive concerns using large language models

Method

Sensitivity

Specificity

F1 Score

Accuracy

Refinement dataset

Agentic workflow

(AP3)

0.91 (0.86, 0.96)

0.95 (0.90, 1.00)

0.93 (0.90, 0.96)

0.93 (0.90, 0.96)

Expert-driven workflow

(XP3)

0.84 (0.79, 0.89)

0.91 (0.86, 0.96)

0.87 (0.83, 0.91)

0.88 (0.84, 0.91)

Validation dataset

Agentic Workflow

(AP3)

0.62 (0.58, 0.66)

0.98 (0.95, 1.00)

0.74 (0.68, 0.80)

0.88 (0.86, 0.91)

Expert-driven workflow

(XP3)

0.82 (0.79, 0.84)

0.93 (0.92, 0.94)

0.81 (0.79, 0.83)

0.90 (0.89, 0.91)

  1. XP3 expert prompt 3, AP3 agent prompt 3, CI confidence interval.