Table 4 Comparative performance on two datasets (post re-adjudication)
Method | Sensitivity | Specificity | F1 Score | Accuracy |
|---|---|---|---|---|
Refinement dataset | ||||
Agentic workflow (AP3) | 0.91 (0.86, 0.96) | 0.95 (0.90, 1.00) | 0.93 (0.90, 0.96) | 0.93 (0.90, 0.96) |
Expert-driven workflow (XP3) | 0.84 (0.79, 0.89) | 0.91 (0.86, 0.96) | 0.87 (0.83, 0.91) | 0.88 (0.84, 0.91) |
Validation dataset | ||||
Agentic Workflow (AP3) | 0.62 (0.58, 0.66) | 0.98 (0.95, 1.00) | 0.74 (0.68, 0.80) | 0.88 (0.86, 0.91) |
Expert-driven workflow (XP3) | 0.82 (0.79, 0.84) | 0.93 (0.92, 0.94) | 0.81 (0.79, 0.83) | 0.90 (0.89, 0.91) |