Table 3 Summary Table of Performance Metrics Across Studies
Study | NLP Method | Sensitivity | Specificity | PPV | NPV | F1-Score | AUROC/AUC-PR | Validation |
|---|---|---|---|---|---|---|---|---|
Chen et al.30 | Keyword frequency detection | 61.8%¹ | 85.4%¹ | NR | NR | NR | 0.76 (0.69-0.81)² | Internal: Test-retest reliability |
Amjad et al.24 | Dictionary-based (D-NLP-pos) | 82.8% | 94.1% | 92.6% | 86.0% | 87.5% | 0.917 | Internal: 80/20 split |
Dictionary + negation (D-NLP-pos+neg) | 89.4% | 89.4% | 88.3% | 90.4% | 88.8% | 0.914 | ||
All-tokens (T-NLP-pos) | 84.2% | 63.5% | 67.3% | 81.8% | 74.8% | 0.826 | ||
Fu et al.29 | NLP-CAM (rule-based) | 91.9% | 100% | NR | NR | NR³ | NR | Internal: Split sample |
NLP-mCAM (rule-based) | 82.7% | 91.3% | NR | NR | NR³ | NR | ||
Amonoo et al.27 | ClinicalRegex NLP | NR | NR | NR | NR | NR | NR | Inter-rater: κ = 0.90 |
St. Sauver, et al.42 | Rule-based NLP | 64% (56–72%) | 84% (74–93%) | NR | NR | NR | NR | Manual review n = 200 |
Ge, et al.31 | Transformer model | 99.1%⁴ / 98.5%⁵ | NR | 98.6%⁴ / 99.1%⁵ | NR | 97.8%⁶ / 91.8%⁷ | 0.984⁸ | External: LTM dataset |
Shao et al.32 | LDA topic modeling | 48.1% | NR | 45.5% | NR | 46.8% | NR | Internal: n = 100 |
ICD-2 method | 61.2% | NR | 75.7% | NR | 67.7% | NR | ||
Keyword search | 28.5% | NR | 98.4% | NR | 44.2% | NR | ||
Chen et al.25 | GatorTron (transformer) | 81.19%⁹ / 88.23%¹⁰ | NR | 79.93%⁹ / 86.96%¹⁰ | NR | 80.55%⁹ / 87.59%¹⁰ | NR | Internal: 381/55/110 split |
Veeranki et al.33 | Random Forest (ML) | NR | NR | NR | NR | NR | 0.8804–0.8857¹¹ | 10-fold CV |
Pagali et al.28 | NLP-CAM algorithm | 80%¹² | NR | NR | NR | NR | NR | Manual chart review |
Young et al.43 | NLP-Dx-BD (rule-based) | NR¹³ | NR¹³ | NR | NR | NR | NR | Comparison with CAM-ICU |
Young et al.35 | NLP-Dx-BD (rule-based) | NR¹³ | NR¹³ | NR | NR | NR | NR | None |
Mikalsen et al.26 | Elastic net + anchors | NR | NR | NR | NR | NR | 0.962–0.964¹⁴ | Bootstrap CI |