Table 4 Comparative performance on two datasets (post re-adjudication)

Method	Sensitivity	Specificity	F1 Score	Accuracy
Refinement dataset
Agentic workflow (AP3)	0.91 (0.86, 0.96)	0.95 (0.90, 1.00)	0.93 (0.90, 0.96)	0.93 (0.90, 0.96)
Expert-driven workflow (XP3)	0.84 (0.79, 0.89)	0.91 (0.86, 0.96)	0.87 (0.83, 0.91)	0.88 (0.84, 0.91)
Validation dataset
Agentic Workflow (AP3)	0.62 (0.58, 0.66)	0.98 (0.95, 1.00)	0.74 (0.68, 0.80)	0.88 (0.86, 0.91)
Expert-driven workflow (XP3)	0.82 (0.79, 0.84)	0.93 (0.92, 0.94)	0.81 (0.79, 0.83)	0.90 (0.89, 0.91)

Quick links

Search